Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 109]
cs.CV [Total: 130]
cs.AI [Total: 48]
cs.SD [Total: 5]
cs.LG [Total: 189]
cs.MA [Total: 10]
cs.MM [Total: 2]
eess.AS [Total: 9]
eess.IV [Total: 8]

cs.CL

[1] Contextual Augmentation for Entity Linking using Large Language Models

Daniel Vollmers, Hamada M. Zahera, Diego Moussallem, Axel-Cyrille Ngonga Ngomo

Main category: cs.CL

TL;DR: Fine-tuned model for joint entity recognition and disambiguation using LLM context enrichment, achieving SOTA performance on out-of-domain datasets.

Details

Motivation: Traditional two-step entity linking methods are computationally intensive and less effective, needing a unified approach.

Method: Joint integration of entity recognition and disambiguation in unified framework with LLM context enrichment.

Result: Achieves state-of-the-art performance on benchmark datasets, especially on out-of-domain datasets.

Conclusion: Unified joint approach with context enrichment outperforms traditional two-step methods.

Abstract: Entity Linking involves detecting and linking entity mentions in natural language texts to a knowledge graph. Traditional methods use a two-step process with separate models for entity recognition and disambiguation, which can be computationally intensive and less effective. We propose a fine-tuned model that jointly integrates entity recognition and disambiguation in a unified framework. Furthermore, our approach leverages large language models to enrich the context of entity mentions, yielding better performance in entity disambiguation. We evaluated our approach on benchmark datasets and compared with several baselines. The evaluation results show that our approach achieves state-of-the-art performance on out-of-domain datasets.

[2] Small Language Models Offer Significant Potential for Science Community

Jian Zhang

Main category: cs.CL

TL;DR: A framework using small language models (MiniLMs) for efficient information retrieval from geoscience literature, offering precise, cost-effective alternatives to large language models with capabilities for semantic search, sentiment analysis, and trend tracking.

Details

Motivation: Address concerns about information biases and computational costs of large language models (LLMs) by developing a more precise, rapid, and cost-effective approach for geoscience literature analysis using freely available small language models.

Method: Constructed a curated corpus of ~77 million sentences from 95 leading geoscience journals (2000-2024), using MiniLMs for semantic search and sentence-level indexing, with additional sentiment analysis and unsupervised clustering for topical analysis.

Result: MiniLMs excel at identifying substantial amounts of expert-verified information with established sources, particularly quantitative findings, and enable tracking of research evolution, priorities, advancements, and emerging questions through emotional tone and topical cluster analysis.

Conclusion: MiniLMs hold significant potential for geoscience applications including fact/image retrieval, trend analysis, contradiction analysis, and education, providing a powerful alternative to LLMs for domain-specific information extraction.

Abstract: Recent advancements in natural language processing, particularly with large language models (LLMs), are transforming how scientists engage with the literature. While the adoption of LLMs is increasing, concerns remain regarding potential information biases and computational costs. Rather than LLMs, I developed a framework to evaluate the feasibility of precise, rapid, and cost-effective information retrieval from extensive geoscience literature using freely available small language models (MiniLMs). A curated corpus of approximately 77 million high-quality sentences, extracted from 95 leading peer-reviewed geoscience journals such as Geophysical Research Letters and Earth and Planetary Science Letters published during years 2000 to 2024, was constructed. MiniLMs enable a computationally efficient approach for extracting relevant domain-specific information from these corpora through semantic search techniques and sentence-level indexing. This approach, unlike LLMs such as ChatGPT-4 that often produces generalized responses, excels at identifying substantial amounts of expert-verified information with established, multi-disciplinary sources, especially for information with quantitative findings. Furthermore, by analyzing emotional tone via sentiment analysis and topical clusters through unsupervised clustering within sentences, MiniLM provides a powerful tool for tracking the evolution of conclusions, research priorities, advancements, and emerging questions within geoscience communities. Overall, MiniLM holds significant potential within the geoscience community for applications such as fact and image retrievals, trend analyses, contradiction analyses, and educational purposes.

[3] When Models Can’t Follow: Testing Instruction Adherence Across 256 LLMs

Richard J. Young, Brandon Gillins, Alice M. Matthews

Main category: cs.CL

TL;DR: A streamlined evaluation framework using 20 carefully designed prompts to assess LLM instruction-following capabilities across diverse task categories, tested on 256 models via large-scale empirical study.

Details

Motivation: Systematic evaluation of instruction-following capabilities remains challenging, and existing benchmarks may be memorized by newer models, requiring novel evaluation approaches to assess genuine capabilities.

Method: Developed a compact test suite with 20 prompts targeting distinct aspects of instruction following (format compliance, content constraints, logical sequencing, multi-step execution), tested on 256 verified working models from OpenRouter.

Result: Revealed consistent failure modes and identified specific instruction types posing particular challenges across models from major providers (OpenAI, Anthropic, Google, Meta, Mistral) and emerging implementations.

Conclusion: Provides both a practical evaluation tool and comprehensive empirical analysis of instruction-following capabilities across contemporary LLMs, offering a practical diagnostic alternative to resource-intensive benchmarks.

Abstract: Despite widespread deployment of Large Language Models, systematic evaluation of instruction-following capabilities remains challenging. While comprehensive benchmarks exist, focused assessments that quickly diagnose specific instruction adherence patterns are valuable. As newer models may be trained on existing benchmarks, novel evaluation approaches are needed to assess genuine capabilities rather than memorized performance. This paper presents a streamlined evaluation framework using twenty carefully designed prompts to assess LLM instruction-following across diverse task categories. We demonstrate this framework through a large-scale empirical study conducted on October 14, 2025, testing 256 verified working models from 331 available via OpenRouter. To ensure methodological rigor and prevent selection bias, we first verified each model’s basic functionality before inclusion. Unlike large-scale benchmarks requiring extensive computational resources, our approach offers a practical diagnostic tool researchers and practitioners can readily apply. Our methodology builds upon verifiable instructions while introducing a compact test suite balancing comprehensiveness with efficiency. Each prompt targets distinct aspects of instruction following, including format compliance, content constraints, logical sequencing, and multi-step task execution. We evaluate models from major providers (OpenAI, Anthropic, Google, Meta, Mistral) and emerging implementations (Qwen, DeepSeek, community models), providing comparative performance analysis. Our findings reveal consistent failure modes and identify specific instruction types posing particular challenges. This work contributes both a practical evaluation tool and one of the most comprehensive empirical analyses of instruction-following capabilities across the contemporary LLM landscape.

[4] Transformer-Based Low-Resource Language Translation: A Study on Standard Bengali to Sylheti

Mangsura Kabir Oni, Tabia Tanzin Prama

Main category: cs.CL

TL;DR: Fine-tuned multilingual Transformer models outperform zero-shot LLMs for Bengali-to-Sylheti translation, with mBART-50 achieving best adequacy and MarianMT best character-level fidelity.

Details

Motivation: Machine translation has advanced but low-resource languages like Sylheti remain underexplored, creating a need for inclusive language technologies.

Method: Fine-tuning multilingual Transformer models (mBART-50, MarianMT) and comparing with zero-shot large language models for Bengali-to-Sylheti translation.

Result: Fine-tuned models significantly outperform LLMs, with mBART-50 achieving highest translation adequacy and MarianMT showing strongest character-level fidelity.

Conclusion: Task-specific adaptation is crucial for underrepresented languages, contributing to inclusive language technologies.

Abstract: Machine Translation (MT) has advanced from rule-based and statistical methods to neural approaches based on the Transformer architecture. While these methods have achieved impressive results for high-resource languages, low-resource varieties such as Sylheti remain underexplored. In this work, we investigate Bengali-to-Sylheti translation by fine-tuning multilingual Transformer models and comparing them with zero-shot large language models (LLMs). Experimental results demonstrate that fine-tuned models significantly outperform LLMs, with mBART-50 achieving the highest translation adequacy and MarianMT showing the strongest character-level fidelity. These findings highlight the importance of task-specific adaptation for underrepresented languages and contribute to ongoing efforts toward inclusive language technologies.

[5] DuoLens: A Framework for Robust Detection of Machine-Generated Multilingual Text and Code

Shriyansh Agrawal, Aidan Lau, Sanyam Shah, Ahan M R, Kevin Zhu, Sunishchal Dev, Vasu Sharma

Main category: cs.CL

TL;DR: Fine-tuning small encoder models (RoBERTa, CodeBERTa) for machine-generated content detection outperforms large language models with 8-12x lower latency and 3-5x less VRAM usage while maintaining high accuracy.

Details

Motivation: Current machine-generated content detectors using zero-shot methods like Fast DetectGPT or GPTZero have high computational costs or insufficient accuracy, creating a need for more efficient and accurate solutions.

Method: Fine-tuned encoder-only Small Language Models (SLMs) including RoBERTa and CodeBERTa using specialized datasets for source code and natural language binary classification tasks.

Result: Achieved AUROC = 0.97-0.99 and macro-F1 0.89-0.94 with 8-12x latency reduction and 3-5x VRAM reduction. Performance remained ≥92% of clean AUROC under cross-generator shifts and adversarial transformations.

Conclusion: SLMs significantly outperform LLMs for machine-generated content detection while using much less computational resources, demonstrating superior efficiency and robustness.

Abstract: The prevalence of Large Language Models (LLMs) for generating multilingual text and source code has only increased the imperative for machine-generated content detectors to be accurate and efficient across domains. Current detectors, predominantly utilizing zero-shot methods, such as Fast DetectGPT or GPTZero, either incur high computational cost or lack sufficient accuracy, often with a trade-off between the two, leaving room for further improvement. To address these gaps, we propose the fine-tuning of encoder-only Small Language Models (SLMs), in particular, the pre-trained models of RoBERTA and CodeBERTa using specialized datasets on source code and other natural language to prove that for the task of binary classification, SLMs outperform LLMs by a huge margin whilst using a fraction of compute. Our encoders achieve AUROC $= 0.97$ to $0.99$ and macro-F1 $0.89$ to $0.94$ while reducing latency by $8$-$12\times$ and peak VRAM by $3$-$5\times$ at $512$-token inputs. Under cross-generator shifts and adversarial transformations (paraphrase, back-translation; code formatting/renaming), performance retains $\geq 92%$ of clean AUROC. We release training and evaluation scripts with seeds and configs; a reproducibility checklist is also included.

Wangjiaxuan Xin, Shuhua Yin, Shi Chen, Yaorong Ge

Main category: cs.CL

TL;DR: TM-Rephrase is a model-agnostic framework that uses LLMs to rephrase raw tweets into formal language before topic modeling, improving topic coherence, uniqueness, and diversity while reducing redundancy.

Details

Motivation: Traditional topic modeling struggles with social media short texts due to their brevity, informality, and noise, leading to incoherent or redundant topics that are difficult to interpret.

Method: Developed TM-Rephrase framework using two rephrasing strategies (general and colloquial-to-formal) with LLMs on 25,027 COVID-19 tweets, tested on multiple topic modeling methods including LDA.

Result: TM-Rephrase improved topic coherence, uniqueness, and diversity while reducing redundancy across most algorithms, with colloquial-to-formal strategy showing greatest gains, especially for LDA.

Conclusion: The study provides a model-agnostic approach to enhance topic modeling for public health social media analysis, with broader implications for understanding public discourse in health crises and other domains.

Abstract: Social media platforms such as Twitter (now X) provide rich data for analyzing public discourse, especially during crises such as the COVID-19 pandemic. However, the brevity, informality, and noise of social media short texts often hinder the effectiveness of traditional topic modeling, producing incoherent or redundant topics that are often difficult to interpret. To address these challenges, we have developed \emph{TM-Rephrase}, a model-agnostic framework that leverages large language models (LLMs) to rephrase raw tweets into more standardized and formal language prior to topic modeling. Using a dataset of 25,027 COVID-19-related Twitter posts, we investigate the effects of two rephrasing strategies, general- and colloquial-to-formal-rephrasing, on multiple topic modeling methods. Results demonstrate that \emph{TM-Rephrase} improves three metrics measuring topic modeling performance (i.e., topic coherence, topic uniqueness, and topic diversity) while reducing topic redundancy of most topic modeling algorithms, with the colloquial-to-formal strategy yielding the greatest performance gains and especially for the Latent Dirichlet Allocation (LDA) algorithm. This study contributes to a model-agnostic approach to enhancing topic modeling in public health related social media analysis, with broad implications for improved understanding of public discourse in health crisis as well as other important domains.

[7] Re-evaluating Minimum Bayes Risk Decoding for Automatic Speech Recognition

Yuu Jinnai

Main category: cs.CL

TL;DR: MBR decoding outperforms beam search for speech-to-text tasks like ASR and ST, showing promise for high-accuracy offline applications.

Details

Motivation: Since MBR decoding works well for text-to-text generation tasks, the authors investigate if it can also improve performance for speech-to-text tasks where beam search is currently standard practice.

Method: Evaluated MBR decoding for ASR and ST tasks on English and Japanese using Whisper and its derivative models, comparing against beam search.

Result: MBR decoding outperformed beam search in most experimental settings for both ASR and ST tasks.

Conclusion: MBR decoding is a promising method for offline ASR and ST tasks that require high accuracy.

Abstract: Recent work has shown that sample-based Minimum Bayes Risk (MBR) decoding outperforms beam search in text-to-text generation tasks, such as machine translation, text summarization, and image captioning. On the other hand, beam search is the current practice for speech-to-text tasks such as automatic speech recognition (ASR) and Speech Translation (ST). Given that MBR decoding is effective in text-to-text generation tasks, it is reasonable to expect it to also be effective for speech-to-text tasks. In this paper, we evaluate MBR decoding for ASR and ST tasks on English and Japanese using Whisper and its derivative models. We observe that the accuracy of MBR decoding outperforms that of beam search in most of the experimental settings we have evaluated. The results show that MBR decoding is a promising method for offline ASR and ST tasks that require high accuracy. The code is available at https://github.com/CyberAgentAILab/mbr-for-asr

[8] Learning from the Best, Differently: A Diversity-Driven Rethinking on Data Selection

Hongyi He, Xiao Liu, Zhenghao Lin, Mingni Tang, Yi Cheng, Jintao Wang, Wenjie Li, Peng Cheng, Yeyun Gong

Main category: cs.CL

TL;DR: ODiS is a novel data selection method that uses orthogonal decomposition of quality metrics to ensure both quality and diversity in pre-training data for LLMs, outperforming traditional score-based approaches.

Details

Motivation: Existing score-based data selection methods collapse correlated dimensions, causing top-scored data to appear high-quality while systematically overlooking diversity, leading to performance degradation.

Method: ODiS evaluates data from multiple dimensions, decorrelates scores via PCA, trains Roberta-based scorers for each orthogonal dimension, and selects top-scored data from each dimension.

Result: ODiS-selected data shows less than 2% inter-dimension overlap, confirming orthogonality, and models trained with this data significantly outperform other baselines on downstream benchmarks.

Conclusion: Orthogonal, diversity-aware data selection is necessary for LLMs, and ODiS effectively preserves both quality and diversity during data selection.

Abstract: High-quality pre-training data is crutial for large language models, where quality captures factual reliability and semantic value, and diversity ensures broad coverage and distributional heterogeneity. Existing approaches typically rely on single or multiple-dimensional score-based selection. However, directly selecting top-scored data often degrades performance, and sampling from a broader range is required to recover results. The above non-monotonicity between dataset scores and downstream benchmark results reveals a fundamental bias: score-based methods collapse correlated dimensions, causing top-scored data to appear high-quality while systematically overlooking diversity. We argue that ensuring diversity requires decomposing correlated metrics into orthogonal feature dimensions, from which the top-scored data can be directly selected. Therefore, we proposed the Orthogonal Diversity-Aware Selection (ODiS) algorithm, which preserves both quality and diversity during data selection. First, ODiS evaluates data from multiple dimensions, covering language quality, knowledge quality, and comprehension difficulty. The multi-dimensional scores are then decorrelated via Principal Component Analysis (PCA), yielding orthogonal evaluation dimensions. For each dimension, a Roberta-based scorer is trained to regress the data onto PCA-projected scores, enabling scalable inference on large corpora. Finally, ODiS constructs the training dataset by selecting top-scored data within each orthogonal dimension, thereby ensuring both quality and diversity. Empirical results show that ODiS-selected data exhibit less than 2% inter-dimension overlap, confirming orthogonality between dimensions. More importantly, models trained with ODiS-selected data significantly outperform other baselines on downstream benchmarks, highlighting the necessity of orthogonal, diversity-aware data selection for LLMs.

[9] Which Evaluation for Which Model? A Taxonomy for Speech Model Assessment

Maureen de Seyssel, Eeshan Gunesh Dhekane

Main category: cs.CL

TL;DR: This paper proposes a unified taxonomy for evaluating speech foundation models across different tasks and model types, defining three axes: evaluation aspect, model capabilities, and task requirements.

Details

Motivation: Current evaluation of speech foundation models is disjointed across tasks and model types, with different models excelling at different aspects of speech processing, requiring a systematic framework to determine appropriate evaluation methods.

Method: The authors develop a three-axis taxonomy: evaluation aspect being measured, model capabilities required, and task/protocol requirements needed. They classify existing evaluations and benchmarks along these axes.

Result: The taxonomy provides a principled framework for aligning models with suitable evaluation methods and reveals systematic gaps in current evaluations (e.g., limited coverage of prosody, interaction, reasoning).

Conclusion: This work offers both a conceptual foundation and practical guide for selecting, interpreting, and extending evaluations of speech models, highlighting priorities for future benchmark design.

Abstract: Speech foundation models have recently achieved remarkable capabilities across a wide range of tasks. However, their evaluation remains disjointed across tasks and model types. Different models excel at distinct aspects of speech processing and thus require different evaluation protocols. This paper proposes a unified taxonomy that addresses the question: Which evaluation is appropriate for which model? The taxonomy defines three orthogonal axes: the \textbf{evaluation aspect} being measured, the model capabilities required to attempt the task, and the task or protocol requirements needed to perform it. We classify a broad set of existing evaluations and benchmarks along these axes, spanning areas such as representation learning, speech generation, and interactive dialogue. By mapping each evaluation to the capabilities a model exposes (e.g., speech generation, real-time processing) and to its methodological demands (e.g., fine-tuning data, human judgment), the taxonomy provides a principled framework for aligning models with suitable evaluation methods. It also reveals systematic gaps, such as limited coverage of prosody, interaction, or reasoning, that highlight priorities for future benchmark design. Overall, this work offers a conceptual foundation and practical guide for selecting, interpreting, and extending evaluations of speech models.

[10] Context-aware Fairness Evaluation and Mitigation in LLMs

Afrozah Nadeem, Mark Dras, Usman Naseem

Main category: cs.CL

TL;DR: Dynamic reversible pruning framework for bias mitigation in LLMs that adaptively masks context-aware neuron activations during inference, enabling flexible fairness control without permanent model changes.

Details

Motivation: Existing bias mitigation methods are computationally expensive, irreversible, and static - unable to adapt to changing conversational contexts. Pruning methods remove neurons permanently, limiting model adaptability.

Method: Dynamic reversible pruning framework that detects context-aware neuron activations and applies adaptive masking during inference time to modulate their influence on generation.

Result: Provides fine-grained, memory-aware bias mitigation while preserving knowledge and maintaining coherent behavior across multilingual single- and multi-turn dialogues.

Conclusion: The framework enables dynamic fairness control in real-world conversational AI through inference-time adaptation without permanent model modifications.

Abstract: Large language models often display undesirable behaviors embedded in their internal representations, undermining fairness, inconsistency drift, amplification of harmful content, and the propagation of unwanted patterns during extended dialogue and conversations. Although training-time or data-centric methods attempt to reduce these effects, they are computationally expensive, irreversible once deployed, and slow to adapt to new conversational contexts. Pruning-based methods provide a flexible and transparent way to reduce bias by adjusting the neurons responsible for certain behaviors. However, most existing approaches are static; once a neuron is removed, the model loses the ability to adapt when the conversation or context changes. To address this, we propose a dynamic, reversible, pruning-based framework that detects context-aware neuron activations and applies adaptive masking to modulate their influence during generation. Our inference-time solution provides fine-grained, memory-aware mitigation with knowledge-preserved, more coherent behavior across multilingual single- and multi-turn dialogues, enabling dynamic fairness control in real-world conversational AI.

Chen Chen, ZeYang Hu, Fengjiao Chen, Liya Ma, Jiaxing Liu, Xiaoyu Li, Xuezhi Cao

Main category: cs.CL

TL;DR: MMAO-Bench is a novel multimodal benchmark with 1880 curated samples across 44 tasks, designed to evaluate both uni-modal and omni-modal understanding in multimodal large language models.

Details

Motivation: To address the unclear correlation between uni-modal and omni-modal capabilities in multimodal models and provide comprehensive evaluation to drive intelligence evolution.

Method: Created a benchmark with human-curated samples, multiple task types, and innovative multi-step open-ended questions to assess complex reasoning across visual, audio, and language modalities.

Result: Experimental results reveal a compositional law between cross-modal and uni-modal performance, with omni-modal capability showing bottleneck effects on weak models and synergistic promotion on strong models.

Conclusion: MMAO-Bench effectively evaluates multimodal understanding and reveals important relationships between different modal capabilities, providing insights for model development.

Abstract: Multimodal Large Languages models have been progressing from uni-modal understanding toward unifying visual, audio and language modalities, collectively termed omni models. However, the correlation between uni-modal and omni-modal remains unclear, which requires comprehensive evaluation to drive omni model’s intelligence evolution. In this work, we propose a novel, high quality and diversity omni model benchmark, MultiModal All in One Benchmark (MMAO-Bench), which effectively assesses both uni-modal and omni-modal understanding capabilities. The benchmark consists of 1880 human curated samples, across 44 task types, and a innovative multi-step open-ended question type that better assess complex reasoning tasks. Experimental result shows the compositional law between cross-modal and uni-modal performance and the omni-modal capability manifests as a bottleneck effect on weak models, while exhibiting synergistic promotion on strong models.

[12] Misinformation Detection using Large Language Models with Explainability

Jainee Patel, Chintan Bhatt, Himani Trivedi, Thanh Thi Nguyen

Main category: cs.CL

TL;DR: An explainable pipeline using transformer-based PLMs (RoBERTa and DistilBERT) for misinformation detection, achieving comparable accuracy with reduced computational cost through optimized fine-tuning and interpretability methods.

Details

Motivation: To address the problem of misinformation spread on online platforms that undermines trust and informed decision making, by developing an efficient and transparent detection system.

Method: Two-step fine-tuning strategy: first freeze backbone and train classification head, then progressively unfreeze layers with layer-wise learning rate decay. Uses RoBERTa and DistilBERT models with LIME for token-level explanations and SHAP for global feature attribution.

Result: DistilBERT achieves accuracy comparable to RoBERTa while requiring significantly less computational resources. Tested on COVID Fake News and FakeNewsNet GossipCop datasets with unified preprocessing and stratified splits.

Conclusion: Lightweight PLMs can maintain task performance while substantially reducing computational cost, and the explainable pipeline provides faithful local and global justifications without compromising performance, suggesting an effective framework for scalable, trustworthy misinformation detection.

Abstract: The rapid spread of misinformation on online platforms undermines trust among individuals and hinders informed decision making. This paper shows an explainable and computationally efficient pipeline to detect misinformation using transformer-based pretrained language models (PLMs). We optimize both RoBERTa and DistilBERT using a two-step strategy: first, we freeze the backbone and train only the classification head; then, we progressively unfreeze the backbone layers while applying layer-wise learning rate decay. On two real-world benchmark datasets, COVID Fake News and FakeNewsNet GossipCop, we test the proposed approach with a unified protocol of preprocessing and stratified splits. To ensure transparency, we integrate the Local Interpretable Model-Agnostic Explanations (LIME) at the token level to present token-level rationales and SHapley Additive exPlanations (SHAP) at the global feature attribution level. It demonstrates that DistilBERT achieves accuracy comparable to RoBERTa while requiring significantly less computational resources. This work makes two key contributions: (1) it quantitatively shows that a lightweight PLM can maintain task performance while substantially reducing computational cost, and (2) it presents an explainable pipeline that retrieves faithful local and global justifications without compromising performance. The results suggest that PLMs combined with principled fine-tuning and interpretability can be an effective framework for scalable, trustworthy misinformation detection.

Hiroshi Nonaka, K. E. Perry

Main category: cs.CL

TL;DR: A scalable method for evaluating LLM story generation by analyzing social structures as signed character networks, showing LLMs consistently produce stories with tightly-knit positive relationships.

Details

Motivation: Human assessments of LLM creative capabilities are difficult to scale, requiring automated methods to evaluate complex storytelling tasks.

Method: Analyze social structures in narratives as signed character networks, comparing network properties (density, clustering, signed edge weights) across 1,200+ stories from four LLMs and human-written corpus.

Result: LLM-generated stories consistently show strong bias toward tightly-knit, positive relationships, aligning with prior human assessment findings.

Conclusion: The proposed network analysis approach provides a valuable scalable tool for evaluating limitations and tendencies in LLM creative storytelling.

Abstract: Evaluating the creative capabilities of large language models (LLMs) in complex tasks often requires human assessments that are difficult to scale. We introduce a novel, scalable methodology for evaluating LLM story generation by analyzing underlying social structures in narratives as signed character networks. To demonstrate its effectiveness, we conduct a large-scale comparative analysis using networks from over 1,200 stories, generated by four leading LLMs (GPT-4o, GPT-4o mini, Gemini 1.5 Pro, and Gemini 1.5 Flash) and a human-written corpus. Our findings, based on network properties like density, clustering, and signed edge weights, show that LLM-generated stories consistently exhibit a strong bias toward tightly-knit, positive relationships, which aligns with findings from prior research using human assessment. Our proposed approach provides a valuable tool for evaluating limitations and tendencies in the creative storytelling of current and future LLMs.

[14] Lost in the Maze: Overcoming Context Limitations in Long-Horizon Agentic Search

Howard Yen, Ashwin Paranjape, Mengzhou Xia, Thejas Venkatesh, Jack Hessel, Danqi Chen, Yuhao Zhang

Main category: cs.CL

TL;DR: SLIM is a lightweight framework for long-horizon agentic search that separates retrieval into search/browse tools and periodically summarizes trajectories, achieving better performance with fewer tool calls than existing methods.

Details

Motivation: Existing agentic search frameworks struggle with long trajectories due to context limitations - they accumulate noisy content, hit context/tool budgets, or stop early.

Method: Separates retrieval into distinct search and browse tools, and periodically summarizes the trajectory to keep context concise while enabling longer, more focused searches.

Result: SLIM achieves 56% on BrowseComp and 31% on HLE with o3 base model, outperforming open-source frameworks by 8 and 4 absolute points respectively, with 4-6x fewer tool calls and fewer hallucinations.

Conclusion: SLIM provides an effective solution for long-horizon agentic search with better performance at lower cost, and the analysis framework can inform future long-horizon agents.

Abstract: Long-horizon agentic search requires iteratively exploring the web over long trajectories and synthesizing information across many sources, and is the foundation for enabling powerful applications like deep research systems. In this work, we show that popular agentic search frameworks struggle to scale to long trajectories primarily due to context limitations-they accumulate long, noisy content, hit context window and tool budgets, or stop early. Then, we introduce SLIM (Simple Lightweight Information Management), a simple framework that separates retrieval into distinct search and browse tools, and periodically summarizes the trajectory, keeping context concise while enabling longer, more focused searches. On long-horizon tasks, SLIM achieves comparable performance at substantially lower cost and with far fewer tool calls than strong open-source baselines across multiple base models. Specifically, with o3 as the base model, SLIM achieves 56% on BrowseComp and 31% on HLE, outperforming all open-source frameworks by 8 and 4 absolute points, respectively, while incurring 4-6x fewer tool calls. Finally, we release an automated fine-grained trajectory analysis pipeline and error taxonomy for characterizing long-horizon agentic search frameworks; SLIM exhibits fewer hallucinations than prior systems. We hope our analysis framework and simple tool design inform future long-horizon agents.

[15] ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge

Zhilin Wang, Jaehun Jung, Ximing Lu, Shizhe Diao, Ellie Evans, Jiaqi Zeng, Pavlo Molchanov, Yejin Choi, Jan Kautz, Yi Dong

Main category: cs.CL

TL;DR: ProfBench is a comprehensive benchmark with over 7000 expert-evaluated response-criterion pairs across Physics PhD, Chemistry PhD, Finance MBA and Consulting MBA domains, designed to evaluate LLMs on professional document processing and report generation tasks.

Details

Motivation: Current LLM evaluation is limited to verifiable tasks like mathematics and programming, but real-world applications require assessing LLMs' ability to process professional documents, synthesize information, and generate comprehensive reports.

Method: Created ProfBench with human-expert evaluations across four professional domains, and developed affordable LLM-Judges that mitigate self-enhancement bias while reducing evaluation costs by 2-3 orders of magnitude.

Result: ProfBench poses significant challenges even for state-of-the-art LLMs, with top models like GPT-5-high achieving only 65.9% overall performance. Notable disparities exist between proprietary and open-weight models, and extended thinking plays a crucial role in complex professional tasks.

Conclusion: ProfBench provides a fair and accessible evaluation framework for professional-domain LLM capabilities, revealing current limitations and the importance of extended reasoning in complex tasks.

Abstract: Evaluating progress in large language models (LLMs) is often constrained by the challenge of verifying responses, limiting assessments to tasks like mathematics, programming, and short-form question-answering. However, many real-world applications require evaluating LLMs in processing professional documents, synthesizing information, and generating comprehensive reports in response to user queries. We introduce ProfBench: a set of over 7000 response-criterion pairs as evaluated by human-experts with professional knowledge across Physics PhD, Chemistry PhD, Finance MBA and Consulting MBA. We build robust and affordable LLM-Judges to evaluate ProfBench rubrics, by mitigating self-enhancement bias and reducing the cost of evaluation by 2-3 orders of magnitude, to make it fair and accessible to the broader community. Our findings reveal that ProfBench poses significant challenges even for state-of-the-art LLMs, with top-performing models like GPT-5-high achieving only 65.9% overall performance. Furthermore, we identify notable performance disparities between proprietary and open-weight models and provide insights into the role that extended thinking plays in addressing complex, professional-domain tasks. Data: https://huggingface.co/datasets/nvidia/ProfBench and Code: https://github.com/NVlabs/ProfBench

[16] Dynamic Evaluation for Oversensitivity in LLMs

Sophia Xiao Pu, Sitao Cheng, Xin Eric Wang, William Yang Wang

Main category: cs.CL

TL;DR: OVERBENCH is a dynamic benchmark that generates model-specific challenging datasets to address LLM oversensitivity, capturing emerging defensive patterns across 25 models with 450,000 samples.

Details

Motivation: Existing benchmarks use static datasets that degrade over time due to model evolution, leading to data contamination and reduced evaluative power for detecting oversensitivity in language models.

Method: Developed a framework that dynamically generates model-specific challenging datasets to capture emerging defensive patterns and align with each model’s unique behavior.

Result: Constructed OVERBENCH benchmark aggregating datasets across diverse LLM families with 450,000 samples from 25 models, providing continuous monitoring of defensive triggers.

Conclusion: OVERBENCH offers a dynamic and evolving perspective on oversensitivity, highlighting vulnerabilities that static datasets overlook as models advance.

Abstract: Oversensitivity occurs when language models defensively reject prompts that are actually benign. This behavior not only disrupts user interactions but also obscures the boundary between harmful and harmless content. Existing benchmarks rely on static datasets that degrade overtime as models evolve, leading to data contamination and diminished evaluative power. To address this, we develop a framework that dynamically generates model-specific challenging datasets, capturing emerging defensive patterns and aligning with each model’s unique behavior. Building on this approach, we construct OVERBENCH, a benchmark that aggregates these datasets across diverse LLM families, encompassing 450,000 samples from 25 models. OVERBENCH provides a dynamic and evolving perspective on oversensitivity, allowing for continuous monitoring of defensive triggers as models advance, highlighting vulnerabilities that static datasets overlook.

Eunsu Kim, Junyeong Park, Juhyun Oh, Kiwoong Park, Seyoung Song, A. Seza Dogruoz, Najoung Kim, Alice Oh

Main category: cs.CL

TL;DR: SCRIPTS dataset evaluates LLMs’ social reasoning for interpersonal relationships in dialogues, showing performance gaps between English (75-80%) and Korean (58-69%), with models often selecting unlikely relationships and minimal benefits from reasoning techniques.

Details

Motivation: As LLMs are increasingly used in human-AI interactions, their social reasoning capabilities in interpersonal contexts become critical for effective communication and understanding.

Method: Created SCRIPTS dataset with 1k dialogues from movie scripts in English and Korean, annotated with probabilistic relational labels by native speakers, then evaluated nine models on their ability to infer interpersonal relationships.

Result: Proprietary LLMs achieved 75-80% on English but dropped to 58-69% on Korean, with models selecting unlikely relationships 10-25% of the time. Thinking models and chain-of-thought provided minimal benefits and sometimes amplified social biases.

Conclusion: Current LLMs have significant limitations in social reasoning capabilities, highlighting the need for developing socially-aware language models that can better understand interpersonal relationships across different languages and cultural contexts.

Abstract: As large language models (LLMs) are increasingly used in human-AI interactions, their social reasoning capabilities in interpersonal contexts are critical. We introduce SCRIPTS, a 1k-dialogue dataset in English and Korean, sourced from movie scripts. The task involves evaluating models’ social reasoning capability to infer the interpersonal relationships (e.g., friends, sisters, lovers) between speakers in each dialogue. Each dialogue is annotated with probabilistic relational labels (Highly Likely, Less Likely, Unlikely) by native (or equivalent) Korean and English speakers from Korea and the U.S. Evaluating nine models on our task, current proprietary LLMs achieve around 75-80% on the English dataset, whereas their performance on Korean drops to 58-69%. More strikingly, models select Unlikely relationships in 10-25% of their responses. Furthermore, we find that thinking models and chain-of-thought prompting, effective for general reasoning, provide minimal benefits for social reasoning and occasionally amplify social biases. Our findings reveal significant limitations in current LLMs’ social reasoning capabilities, highlighting the need for efforts to develop socially-aware language models.

[18] Re:Member: Emotional Question Generation from Personal Memories

Zackary Rackauckas, Nobuaki Minematsu, Julia Hirschberg

Main category: cs.CL

TL;DR: Re:Member is an emotionally expressive language learning system that uses personal videos and stylized spoken questions to enhance second language acquisition through affective recall and conversational engagement.

Details

Motivation: To explore how emotionally expressive, memory-grounded interaction can support more engaging second language learning by leveraging personal media and affective engagement.

Method: Combines WhisperX-based transcript alignment, 3-frame visual sampling, and Style-BERT-VITS2 for emotional synthesis within a modular generation pipeline that aligns emotional tone with visual context.

Result: The system generates stylized spoken questions in the target language using expressive speech styles (whispers, late-night tones) to evoke specific moods and encourage affective recall.

Conclusion: Re:Member highlights the importance of affect and personal media in learner-centered educational technologies as a stylized interaction probe.

Abstract: We present Re:Member, a system that explores how emotionally expressive, memory-grounded interaction can support more engaging second language (L2) learning. By drawing on users’ personal videos and generating stylized spoken questions in the target language, Re:Member is designed to encourage affective recall and conversational engagement. The system aligns emotional tone with visual context, using expressive speech styles such as whispers or late-night tones to evoke specific moods. It combines WhisperX-based transcript alignment, 3-frame visual sampling, and Style-BERT-VITS2 for emotional synthesis within a modular generation pipeline. Designed as a stylized interaction probe, Re:Member highlights the role of affect and personal media in learner-centered educational technologies.

[19] When Can We Trust LLMs in Mental Health? Large-Scale Benchmarks for Reliable LLM Evaluation

Abeer Badawi, Elahe Rahimi, Md Tahmid Rahman Laskar, Sheri Grach, Lindsay Bertrand, Lames Danok, Jimmy Huang, Frank Rudzicz, Elham Dolatabadi

Main category: cs.CL

TL;DR: The paper introduces MentalBench-100k and MentalAlign-70k benchmarks for evaluating LLMs in mental health support, addressing limitations of existing datasets and providing frameworks for assessing judge reliability.

Details

Motivation: Existing benchmarks for evaluating LLMs in mental health support are limited in scale, reliability, and lack frameworks to assess when automated judges can be trusted. They often rely on synthetic or social media data.

Method: Created two benchmarks: MentalBench-100k (100,000 response pairs from real scenarios) and MentalAlign-70k (70,000 ratings comparing LLM judges with human experts). Used Affective Cognitive Agreement Framework with ICC to quantify agreement, consistency, and bias.

Result: LLM judges show systematic inflation, strong reliability for cognitive attributes (guidance, informativeness), reduced precision for empathy, and some unreliability in safety and relevance.

Conclusion: The benchmarks establish new methodological and empirical foundations for reliable, large-scale evaluation of LLMs in mental health support, addressing key gaps in existing evaluation frameworks.

Abstract: Evaluating Large Language Models (LLMs) for mental health support is challenging due to the emotionally and cognitively complex nature of therapeutic dialogue. Existing benchmarks are limited in scale, reliability, often relying on synthetic or social media data, and lack frameworks to assess when automated judges can be trusted. To address the need for large-scale dialogue datasets and judge reliability assessment, we introduce two benchmarks that provide a framework for generation and evaluation. MentalBench-100k consolidates 10,000 one-turn conversations from three real scenarios datasets, each paired with nine LLM-generated responses, yielding 100,000 response pairs. MentalAlign-70k}reframes evaluation by comparing four high-performing LLM judges with human experts across 70,000 ratings on seven attributes, grouped into Cognitive Support Score (CSS) and Affective Resonance Score (ARS). We then employ the Affective Cognitive Agreement Framework, a statistical methodology using intraclass correlation coefficients (ICC) with confidence intervals to quantify agreement, consistency, and bias between LLM judges and human experts. Our analysis reveals systematic inflation by LLM judges, strong reliability for cognitive attributes such as guidance and informativeness, reduced precision for empathy, and some unreliability in safety and relevance. Our contributions establish new methodological and empirical foundations for reliable, large-scale evaluation of LLMs in mental health. We release the benchmarks and codes at: https://github.com/abeerbadawi/MentalBench/

[20] From Memorization to Generalization: Fine-Tuning Large Language Models for Biomedical Term-to-Identifier Normalization

Suswitha Pericharla, Daniel B. Hier, Tayo Obafemi-Ajayi

Main category: cs.CL

TL;DR: Fine-tuning LLMs for biomedical term normalization shows varying success depending on terminology characteristics - gene mappings benefit from semantic generalization while GO and HPO mappings are limited to rote learning due to non-lexicalized identifiers.

Details

Motivation: To understand why large language models perform unevenly across different biomedical terminologies for automated term normalization, which is essential for semantic interoperability in biomedical data integration.

Method: Evaluated fine-tuning of Llama 3.1 8B across multiple biomedical ontologies (GO, HPO, GENE), analyzing both memorization (training-term performance) and generalization (validation-term performance), with embedding analyses to examine semantic alignment.

Result: GO mappings showed strong memorization gains (77% improvement) but HPO showed minimal improvement. Generalization occurred only for protein-gene mappings (13.9% gain). Success depended on identifier popularity and lexicalization - popular identifiers enhanced memorization while lexicalized identifiers enabled semantic generalization.

Conclusion: Fine-tuning success for biomedical term normalization depends on two key factors: identifier popularity (enhances memorization) and lexicalization (enables generalization), providing a predictive framework for when fine-tuning will be effective versus limited to rote learning.

Abstract: Effective biomedical data integration depends on automated term normalization, the mapping of natural language biomedical terms to standardized identifiers. This linking of terms to identifiers is essential for semantic interoperability. Large language models (LLMs) show promise for this task but perform unevenly across terminologies. We evaluated both memorization (training-term performance) and generalization (validation-term performance) across multiple biomedical ontologies. Fine-tuning Llama 3.1 8B revealed marked differences by terminology. GO mappings showed strong memorization gains (up to 77% improvement in term-to-identifier accuracy), whereas HPO showed minimal improvement. Generalization occurred only for protein-gene (GENE) mappings (13.9% gain), while fine-tuning for HPO and GO yielded negligible transfer. Baseline accuracy varied by model scale, with GPT-4o outperforming both Llama variants for all terminologies. Embedding analyses showed tight semantic alignment between gene symbols and protein names but weak alignment between terms and identifiers for GO or HPO, consistent with limited lexicalization. Fine-tuning success depended on two interacting factors: identifier popularity and lexicalization. Popular identifiers were more likely encountered during pretraining, enhancing memorization. Lexicalized identifiers, such as gene symbols, enabled semantic generalization. By contrast, arbitrary identifiers in GO and HPO constrained models to rote learning. These findings provide a predictive framework for when fine-tuning enhances factual recall versus when it fails due to sparse or non-lexicalized identifiers.

[21] That’s Deprecated! Understanding, Detecting, and Steering Knowledge Conflicts in Language Models for Code Generation

Jaesung Bae, Cameron Churchwell, Mitchell Hermon, Tsun-An Hsieh, Jocelyn Xu, Yekaterina Yegorova, Mark Hasegawa-Johnson, Heng Ji

Main category: cs.CL

TL;DR: This paper investigates how LLMs handle conflicts between their parametric knowledge and prompt information, extending knowledge conflict research to code generation with a new framework and evaluation method.

Details

Motivation: To understand LLM behavior when faced with knowledge conflicts in code generation scenarios, building on prior QA research about parametric knowledge vs. external information conflicts.

Method: Proposed a domain-agnostic framework for constructing and interpreting knowledge conflicts, developed a novel evaluation method and dataset specifically for code conflict scenarios, and used activation-level steering techniques.

Result: Large LLMs encode knowledge conflict awareness with up to 80.65% detection accuracy. Activation-level steering achieved up to 12.6% improvement over random baseline, but effectiveness depends on model size, task domain, and steering direction balance.

Conclusion: LLMs can detect knowledge conflicts effectively, and activation steering shows promise but requires careful consideration of model characteristics and task parameters for optimal performance.

Abstract: This paper investigates how large language models (LLMs) behave when faced with discrepancies between their parametric knowledge and conflicting information contained in a prompt. Building on prior question-answering (QA) research, we extend the investigation of knowledge conflicts to the realm of code generation. We propose a domain-agnostic framework for constructing and interpreting such conflicts, along with a novel evaluation method and dataset tailored to code conflict scenarios. Our experiments indicate that sufficiently large LLMs encode the notion of a knowledge conflict in their parameters, enabling us to detect knowledge conflicts with up to \textbf{80.65%} accuracy. Building on these insights, we show that activation-level steering can achieve up to a \textbf{12.6%} improvement in steering success over a random baseline. However, effectiveness depends critically on balancing model size, task domain, and steering direction. The experiment code and data will be made publicly available after acceptance.

[22] A Graph Signal Processing Framework for Hallucination Detection in Large Language Models

Valentin Noël

Main category: cs.CL

TL;DR: Proposes a spectral analysis framework using graph signal processing on transformer attention graphs to detect hallucinations in LLMs, achieving 88.75% accuracy vs 75% for perplexity baselines.

Details

Motivation: Distinguishing factual reasoning from hallucinations in large language models remains challenging despite their impressive results.

Method: Models transformer layers as dynamic graphs induced by attention, with token embeddings as signals, and applies graph signal processing diagnostics including Dirichlet energy, spectral entropy, and high-frequency energy ratios.

Result: Experiments reveal universal spectral patterns: factual statements show consistent ’energy mountain’ behavior with low-frequency convergence, while different hallucination types have distinct signatures. A spectral signature detector achieves 88.75% accuracy.

Conclusion: Spectral geometry captures reasoning patterns and error behaviors, offering a promising framework for hallucination detection in large language models.

Abstract: Large language models achieve impressive results but distinguishing factual reasoning from hallucinations remains challenging. We propose a spectral analysis framework that models transformer layers as dynamic graphs induced by attention, with token embeddings as signals on these graphs. Through graph signal processing, we define diagnostics including Dirichlet energy, spectral entropy, and high-frequency energy ratios, with theoretical connections to computational stability. Experiments across GPT architectures suggest universal spectral patterns: factual statements exhibit consistent “energy mountain” behavior with low-frequency convergence, while different hallucination types show distinct signatures. Logical contradictions destabilize spectra with large effect sizes ($g>1.0$), semantic errors remain stable but show connectivity drift, and substitution hallucinations display intermediate perturbations. A simple detector using spectral signatures achieves 88.75% accuracy versus 75% for perplexity-based baselines, demonstrating practical utility. These findings indicate that spectral geometry may capture reasoning patterns and error behaviors, potentially offering a framework for hallucination detection in large language models.

[23] Training-Free Spectral Fingerprints of Voice Processing in Transformers

Valentin Noël

Main category: cs.CL

TL;DR: Transformer architectures develop unique “computational fingerprints” detectable through spectral analysis of attention patterns, revealing how training emphasis creates measurable connectivity patterns during linguistic computations.

Details

Motivation: To understand how different transformer architectures implement identical linguistic computations through distinct connectivity patterns, and to develop a method for detecting these architectural signatures.

Method: Using graph signal processing on attention-induced token graphs to track changes in algebraic connectivity (Fiedler value) under voice alternation across 20 languages and three model families, focusing on early layers (2-5).

Result: Clear architectural signatures emerged: Phi-3-Mini showed dramatic English-specific early layer disruption, Qwen2.5-7B displayed small distributed shifts largest for morphologically rich languages, and LLaMA-3.2-1B exhibited systematic but muted responses. These signatures strongly correlate with behavioral differences and are modulated by attention head ablations.

Conclusion: Training emphasis leaves detectable computational imprints as specialized processing strategies that manifest as measurable connectivity patterns. The framework serves as a simple, training-free diagnostic for revealing architectural biases and supporting model reliability analysis.

Abstract: Different transformer architectures implement identical linguistic computations via distinct connectivity patterns, yielding model imprinted ``computational fingerprints’’ detectable through spectral analysis. Using graph signal processing on attention induced token graphs, we track changes in algebraic connectivity (Fiedler value, $\Delta\lambda_2$) under voice alternation across 20 languages and three model families, with a prespecified early window (layers 2–5). Our analysis uncovers clear architectural signatures: Phi-3-Mini shows a dramatic English specific early layer disruption ($\overline{\Delta\lambda_2}_{[2,5]}!\approx!-0.446$) while effects in 19 other languages are minimal, consistent with public documentation that positions the model primarily for English use. Qwen2.5-7B displays small, distributed shifts that are largest for morphologically rich languages, and LLaMA-3.2-1B exhibits systematic but muted responses. These spectral signatures correlate strongly with behavioral differences (Phi-3: $r=-0.976$) and are modulated by targeted attention head ablations, linking the effect to early attention structure and confirming functional relevance. Taken together, the findings are consistent with the view that training emphasis can leave detectable computational imprints: specialized processing strategies that manifest as measurable connectivity patterns during syntactic transformations. Beyond voice alternation, the framework differentiates reasoning modes, indicating utility as a simple, training free diagnostic for revealing architectural biases and supporting model reliability analysis.

[24] Tibetan Language and AI: A Comprehensive Survey of Resources, Methods and Challenges

Cheng Huang, Nyima Tashi, Fan Gao, Yutong Liu, Jiahao Li, Hao Tian, Siyang Jiang, Thupten Tsering, Ban Ma-bao, Renzeg Duojie, Gadeng Luosang, Rinchen Dongrub, Dorje Tashi, Jin Zhang, Xiao Feng, Hao Wang, Jie Tang, Guojie Tang, Xiangxiang Wang, Jia Zhang, Tsengdar Lee, Yongbin Yu

Main category: cs.CL

TL;DR: This paper provides a comprehensive survey of Tibetan AI research, covering data resources, NLP tasks, machine translation, speech recognition, and LLM developments for this low-resource language.

Details

Motivation: Tibetan is a major low-resource language with unique linguistic characteristics that has received limited AI research attention due to lack of accessible data resources, standardized benchmarks, and dedicated tools.

Method: The authors systematically categorize existing Tibetan datasets and tools, evaluate methods used across different tasks, and compare performance where possible through a comprehensive survey approach.

Result: The survey identifies persistent bottlenecks including data sparsity, orthographic variation, and lack of unified evaluation metrics, while highlighting the potential of cross-lingual transfer, multi-modal learning, and community-driven resource creation.

Conclusion: This survey serves as a foundational reference for future Tibetan AI research and encourages collaborative efforts to build an inclusive and sustainable AI ecosystem for low-resource languages.

Abstract: Tibetan, one of the major low-resource languages in Asia, presents unique linguistic and sociocultural characteristics that pose both challenges and opportunities for AI research. Despite increasing interest in developing AI systems for underrepresented languages, Tibetan has received limited attention due to a lack of accessible data resources, standardized benchmarks, and dedicated tools. This paper provides a comprehensive survey of the current state of Tibetan AI in the AI domain, covering textual and speech data resources, NLP tasks, machine translation, speech recognition, and recent developments in LLMs. We systematically categorize existing datasets and tools, evaluate methods used across different tasks, and compare performance where possible. We also identify persistent bottlenecks such as data sparsity, orthographic variation, and the lack of unified evaluation metrics. Additionally, we discuss the potential of cross-lingual transfer, multi-modal learning, and community-driven resource creation. This survey aims to serve as a foundational reference for future work on Tibetan AI research and encourages collaborative efforts to build an inclusive and sustainable AI ecosystem for low-resource languages.

[25] “You Are Rejected!”: An Empirical Study of Large Language Models Taking Hiring Evaluations

Dingjie Fu, Dianxing Shi

Main category: cs.CL

TL;DR: LLMs fail to pass professional software engineering hiring evaluations despite their coding capabilities, showing significant inconsistency with company-referenced solutions.

Details

Motivation: To investigate whether LLMs can successfully pass standardized hiring evaluations used by tech companies to identify high-potential software and algorithm engineers.

Method: Comprehensive examination of a widely used professional assessment questionnaire using state-of-the-art LLMs to generate responses and evaluate performance against company-referenced solutions.

Result: Significant inconsistency between model-generated answers and company-referenced solutions. All evaluated LLMs failed to pass the hiring evaluation.

Conclusion: Contrary to expectations, LLMs are not ideal engineers and cannot successfully pass professional hiring evaluations used by leading technology companies.

Abstract: With the proliferation of the internet and the rapid advancement of Artificial Intelligence, leading technology companies face an urgent annual demand for a considerable number of software and algorithm engineers. To efficiently and effectively identify high-potential candidates from thousands of applicants, these firms have established a multi-stage selection process, which crucially includes a standardized hiring evaluation designed to assess job-specific competencies. Motivated by the demonstrated prowess of Large Language Models (LLMs) in coding and reasoning tasks, this paper investigates a critical question: Can LLMs successfully pass these hiring evaluations? To this end, we conduct a comprehensive examination of a widely used professional assessment questionnaire. We employ state-of-the-art LLMs to generate responses and subsequently evaluate their performance. Contrary to any prior expectation of LLMs being ideal engineers, our analysis reveals a significant inconsistency between the model-generated answers and the company-referenced solutions. Our empirical findings lead to a striking conclusion: All evaluated LLMs fails to pass the hiring evaluation.

[26] Think Straight, Stop Smart: Structured Reasoning for Efficient Multi-Hop RAG

Jihwan Bang, Juntae Lee, Seunghan Yang, Sungha Choi

Main category: cs.CL

TL;DR: TSSS is an efficient multi-hop RAG framework that uses template-based reasoning with prefix caching and retriever-based termination to reduce token usage and ensure stable reasoning.

Details

Motivation: Existing iterative prompting approaches for multi-hop RAG are inefficient due to regenerating predictable tokens and relying on stochastic stopping, leading to excessive token usage and unstable termination.

Method: TSSS introduces (i) template-based reasoning that caches recurring prefixes and anchors sub-queries to the main question, and (ii) a retriever-based terminator that deterministically halts reasoning when sub-queries become repetitive.

Result: On HotpotQA, 2WikiMultiHop, and MuSiQue, TSSS achieves state-of-the-art accuracy and competitive efficiency among RAG-CoT approaches.

Conclusion: TSSS enables both faster inference and more reliable answers, making it effective for efficiency-constrained scenarios like on-device inference.

Abstract: Multi-hop retrieval-augmented generation (RAG) is a promising strategy for complex reasoning, yet existing iterative prompting approaches remain inefficient. They often regenerate predictable token sequences at every step and rely on stochastic stopping, leading to excessive token usage and unstable termination. We propose TSSS (Think Straight, Stop Smart), a structured multi-hop RAG framework designed for efficiency. TSSS introduces (i) a template-based reasoning that caches recurring prefixes and anchors sub-queries to the main question, reducing token generation cost while promoting stable reasoning, and (ii) a retriever-based terminator, which deterministically halts reasoning once additional sub-queries collapse into repetition. This separation of structured reasoning and termination control enables both faster inference and more reliable answers. On HotpotQA, 2WikiMultiHop, and MuSiQue, TSSS achieves state-of-the-art accuracy and competitive efficiency among RAG-CoT approaches, highlighting its effectiveness in efficiency-constrained scenarios such as on-device inference.

[27] When Facts Change: Probing LLMs on Evolving Knowledge with evolveQA

Nishanth Sridhar Nakshatri, Shamik Roy, Manoj Ghuhan Arivazhagan, Hanhan Zhou, Vinayshekhar Bannihatti Kumar, Rashmi Gangadharaiah

Main category: cs.CL

TL;DR: evolveQA is a benchmark for evaluating LLMs on temporal knowledge conflicts using real-world time-stamped corpora, showing significant performance drops compared to static knowledge questions.

Details

Motivation: Existing benchmarks for temporal knowledge conflicts rely on structured knowledge bases with popular entities and lack dynamic structure to fairly evaluate LLMs with different knowledge cut-off dates.

Method: Constructed from 3 real-world time-stamped corpora (AWS updates, Azure changes, WHO disease reports), identifies natural knowledge evolution and generates questions with gold answers tailored to different LLM knowledge cut-off dates.

Result: Evaluation of 12 LLMs across 3 knowledge probing formats shows significant performance drops of up to 31% on evolveQA compared to static knowledge questions.

Conclusion: evolveQA effectively reveals LLMs’ limitations in handling temporally evolving knowledge and provides a fair evaluation framework for different knowledge cut-off dates.

Abstract: LLMs often fail to handle temporal knowledge conflicts–contradictions arising when facts evolve over time within their training data. Existing studies evaluate this phenomenon through benchmarks built on structured knowledge bases like Wikidata, but they focus on widely-covered, easily-memorized popular entities and lack the dynamic structure needed to fairly evaluate LLMs with different knowledge cut-off dates. We introduce evolveQA, a benchmark specifically designed to evaluate LLMs on temporally evolving knowledge, constructed from 3 real-world, time-stamped corpora: AWS updates, Azure changes, and WHO disease outbreak reports. Our framework identifies naturally occurring knowledge evolution and generates questions with gold answers tailored to different LLM knowledge cut-off dates. Through extensive evaluation of 12 open and closed-source LLMs across 3 knowledge probing formats, we demonstrate significant performance drops of up to 31% on evolveQA compared to static knowledge questions.

[28] Interpretable Question Answering with Knowledge Graphs

Kartikeya Aneja, Manasvi Srivastava, Subhayan Das, Nagender Aneja

Main category: cs.CL

TL;DR: A question answering system that uses knowledge graph retrieval with entity relationship paraphrasing instead of RAG with LLMs, achieving 71.9% accuracy on CRAG benchmark.

Details

Motivation: To create a question answering system that doesn't rely on retrieval augmented generation with large language models, instead using knowledge graph retrieval with small paraphraser models.

Method: Two-stage pipeline: 1) Pre-processing documents to generate QA pairs, 2) Converting QAs into knowledge graph for graph-based retrieval using embeddings and fuzzy techniques, then querying, re-ranking, and paraphrasing entity relationships.

Result: Achieved 71.9% accuracy with LLAMA-3.2 and 54.4% with GPT-3.5-Turbo on CRAG benchmark using LLM-as-a-judge evaluation.

Conclusion: The proposed knowledge graph-based approach without RAG with LLMs is effective for question answering, achieving competitive accuracy with smaller models.

Abstract: This paper presents a question answering system that operates exclusively on a knowledge graph retrieval without relying on retrieval augmented generation (RAG) with large language models (LLMs). Instead, a small paraphraser model is used to paraphrase the entity relationship edges retrieved from querying the knowledge graph. The proposed pipeline is divided into two main stages. The first stage involves pre-processing a document to generate sets of question-answer (QA) pairs. The second stage converts these QAs into a knowledge graph from which graph-based retrieval is performed using embeddings and fuzzy techniques. The graph is queried, re-ranked, and paraphrased to generate a final answer. This work includes an evaluation using LLM-as-a-judge on the CRAG benchmark, which resulted in accuracies of 71.9% and 54.4% using LLAMA-3.2 and GPT-3.5-Turbo, respectively.

[29] Multi-Faceted Evaluation of Tool-Augmented Dialogue Systems

Zhaoyi Joey Hou, Tanya Shourya, Yingfan Wang, Shamik Roy, Vinayshekhar Bannihatti Kumar, Rashmi Gangadharaiah

Main category: cs.CL

TL;DR: TRACE benchmark and SCOPE framework introduced for evaluating tool-augmented conversational AI systems, addressing limitations of existing methods in detecting complex multi-turn errors.

Details

Motivation: Existing evaluation methods fail to capture critical errors in multi-turn tool-augmented dialogues, particularly when agents misinterpret tool results but still appear satisfactory to users.

Method: TRACE benchmark with systematically synthesized tool-augmented conversations covering diverse error cases, and SCOPE framework that automatically discovers error patterns and evaluation rubrics.

Result: SCOPE significantly outperforms baseline methods, especially on challenging cases where user satisfaction signals are misleading.

Conclusion: The proposed TRACE benchmark and SCOPE framework effectively address evaluation challenges in tool-augmented conversational AI systems by detecting complex multi-turn errors that traditional methods miss.

Abstract: Evaluating conversational AI systems that use external tools is challenging, as errors can arise from complex interactions among user, agent, and tools. While existing evaluation methods assess either user satisfaction or agents' tool-calling capabilities, they fail to capture critical errors in multi-turn tool-augmented dialogues-such as when agents misinterpret tool results yet appear satisfactory to users. We introduce TRACE, a benchmark of systematically synthesized tool-augmented conversations covering diverse error cases, and SCOPE, an evaluation framework that automatically discovers diverse error patterns and evaluation rubrics in tool-augmented dialogues. Experiments show SCOPE significantly outperforms the baseline, particularly on challenging cases where user satisfaction signals are misleading.

[30] DiSRouter: Distributed Self-Routing for LLM Selections

Hang Zheng, Hongshen Xu, Yongkai Lin, Shuai Fan, Lu Chen, Kai Yu

Main category: cs.CL

TL;DR: DiSRouter is a distributed query routing system where LLM agents independently decide whether to answer or route queries based on their self-awareness, outperforming centralized routing methods.

Details

Motivation: Current centralized routing systems are inflexible and perform poorly because small routers cannot fully understand different LLMs' knowledge boundaries, necessitating a more effective approach to balance performance and costs.

Method: A distributed routing paradigm where queries traverse LLM networks with agents making independent routing decisions, enabled by a two-stage Self-Awareness Training pipeline that enhances each LLM’s self-awareness and competence judgment.

Result: DiSRouter significantly outperforms existing routing methods in utility across various scenarios, effectively distinguishes between easy and hard queries, and shows strong generalization to out-of-domain tasks.

Conclusion: Leveraging LLMs’ intrinsic self-awareness is more effective than external assessment, paving the way for more modular and efficient multi-agent systems.

Abstract: The proliferation of Large Language Models (LLMs) has created a diverse ecosystem of models with highly varying performance and costs, necessitating effective query routing to balance performance and expense. Current routing systems often rely on a centralized external router trained on a fixed set of LLMs, making them inflexible and prone to poor performance since the small router can not fully understand the knowledge boundaries of different LLMs. We introduce DiSRouter (Distributed Self-Router), a novel paradigm that shifts from centralized control to distributed routing. In DiSRouter, a query traverses a network of LLM agents, each independently deciding whether to answer or route to other agents based on its own self-awareness, its ability to judge its competence. This distributed design offers superior flexibility, scalability, and generalizability. To enable this, we propose a two-stage Self-Awareness Training pipeline that enhances each LLM’s self-awareness. Extensive experiments demonstrate that DiSRouter significantly outperforms existing routing methods in utility across various scenarios, effectively distinguishes between easy and hard queries, and shows strong generalization to out-of-domain tasks. Our work validates that leveraging an LLM’s intrinsic self-awareness is more effective than external assessment, paving the way for more modular and efficient multi-agent systems.

[31] Modality Matching Matters: Calibrating Language Distances for Cross-Lingual Transfer in URIEL+

York Hay Ng, Aditya Khan, Xiang Lu, Matteo Salloum, Michael Zhou, Phuong H. Hoang, A. Seza Doğruöz, En-Shiun Annie Lee

Main category: cs.CL

TL;DR: The paper introduces a framework for type-matched language distances to address limitations in existing linguistic knowledge bases, proposing structure-aware representations for geography, genealogy, and typology, and unifying them into a composite distance for improved cross-lingual transfer.

Details

Motivation: Existing linguistic knowledge bases like URIEL+ have limitations: their one-size-fits-all vector representations are unsuitable for diverse linguistic data structures, and they lack principled methods for aggregating multiple signals into a comprehensive score.

Method: Proposed novel structure-aware representations: speaker-weighted distributions for geography, hyperbolic embeddings for genealogy, and latent variables model for typology. Unified these signals into a robust, task-agnostic composite distance.

Result: The representations and composite distances consistently improved performance in selecting transfer languages across a wide range of NLP tasks.

Conclusion: The framework provides a more principled and effective toolkit for multilingual research by addressing key limitations in existing linguistic distance measures.

Abstract: Existing linguistic knowledge bases such as URIEL+ provide valuable geographic, genetic and typological distances for cross-lingual transfer but suffer from two key limitations. One, their one-size-fits-all vector representations are ill-suited to the diverse structures of linguistic data, and two, they lack a principled method for aggregating these signals into a single, comprehensive score. In this paper, we address these gaps by introducing a framework for type-matched language distances. We propose novel, structure-aware representations for each distance type: speaker-weighted distributions for geography, hyperbolic embeddings for genealogy, and a latent variables model for typology. We unify these signals into a robust, task-agnostic composite distance. In selecting transfer languages, our representations and composite distances consistently improve performance across a wide range of NLP tasks, providing a more principled and effective toolkit for multilingual research.

[32] SheetBrain: A Neuro-Symbolic Agent for Accurate Reasoning over Complex and Large Spreadsheets

Ziwei Wang, Jiayuan Su, Mengyu Zhou, Huaxing Zeng, Mengni Jia, Xiao Lv, Haoyu Dong, Xiaojun Ma, Shi Han, Dongmei Zhang

Main category: cs.CL

TL;DR: SheetBrain is a neuro-symbolic dual workflow agent framework that improves LLM performance on complex spreadsheet reasoning tasks through understanding, execution, and validation modules, achieving significant accuracy improvements on tabular QA and manipulation benchmarks.

Details

Motivation: LLMs struggle with accurately capturing complex table structures and ensuring reasoning correctness for spreadsheet tasks, requiring a more robust framework for tabular data reasoning.

Method: Proposes SheetBrain with three core modules: understanding module for spreadsheet overview and problem insight, execution module with Python sandbox and Excel toolkit for multi-turn reasoning, and validation module for correctness verification and re-execution.

Result: SheetBrain significantly improves accuracy on existing benchmarks and the new SheetBench benchmark targeting large, multi-table, structurally complex spreadsheets.

Conclusion: The neuro-symbolic dual workflow approach effectively addresses LLM limitations in spreadsheet reasoning, demonstrating superior performance on complex tabular data tasks.

Abstract: Understanding and reasoning over complex spreadsheets remain fundamental challenges for large language models (LLMs), which often struggle with accurately capturing the complex structure of tables and ensuring reasoning correctness. In this work, we propose SheetBrain, a neuro-symbolic dual workflow agent framework designed for accurate reasoning over tabular data, supporting both spreadsheet question answering and manipulation tasks. SheetBrain comprises three core modules: an understanding module, which produces a comprehensive overview of the spreadsheet - including sheet summary and query-based problem insight to guide reasoning; an execution module, which integrates a Python sandbox with preloaded table-processing libraries and an Excel helper toolkit for effective multi-turn reasoning; and a validation module, which verifies the correctness of reasoning and answers, triggering re-execution when necessary. We evaluate SheetBrain on multiple public tabular QA and manipulation benchmarks, and introduce SheetBench, a new benchmark targeting large, multi-table, and structurally complex spreadsheets. Experimental results show that SheetBrain significantly improves accuracy on both existing benchmarks and the more challenging scenarios presented in SheetBench. Our code is publicly available at https://github.com/microsoft/SheetBrain.

[33] Difficulty-Controllable Multiple-Choice Question Generation Using Large Language Models and Direct Preference Optimization

Yuto Tomikawa, Masaki Uto

Main category: cs.CL

TL;DR: Proposes a difficulty-controllable multiple-choice question generation method using LLM with direct preference optimization for better difficulty control accuracy.

Details

Motivation: Existing neural question generation methods cannot directly generate multiple-choice questions and lack explicit training for optimizing difficulty control accuracy.

Method: Leverages large language model trained using direct preference optimization technique to improve difficulty control accuracy in multiple-choice question generation.

Result: Not specified in abstract.

Conclusion: The proposed method addresses limitations of conventional approaches by enabling direct multiple-choice question generation and improving difficulty controllability.

Abstract: Difficulty-controllable question generation for reading comprehension has gained significant attention in the field of education as a fundamental tool for adaptive learning support. Although several neural question generation methods have recently succeeded in controlling difficulty, conventional approaches still face two major limitations. First, they cannot directly generate multiple-choice questions, which are the most widely used question type in educational contexts. Second, they are not explicitly trained to optimize the accuracy of difficulty control, leaving room for further improvement in difficulty controllability. To address these limitations, this study proposes a novel difficulty-controllable multiple-choice question generation method for reading comprehension which leverages a large language model trained using a direct preference optimization technique to improve the accuracy of difficulty control.

[34] TheMCPCompany: Creating General-purpose Agents with Task-specific Tools

Reza Esfandiarpoor, Vishwas Suryanarayanan, Stephen H. Bach, Vishal Chowdhary, Anthony Aue

Main category: cs.CL

TL;DR: TheMCPCompany is a benchmark for evaluating tool-calling agents using MCP servers with over 18,000 tools from real-world services. Experiments show advanced models like GPT-5 perform well with tool retrieval, but current models struggle with complex enterprise environments requiring navigation of thousands of tools.

Details

Motivation: Current general-purpose agents mainly use web browsers for environment interaction, but MCP enables easier development of task-specific tools. There's a need to evaluate tool-calling agents on real-world services and understand their practical limitations.

Method: Created TheMCPCompany benchmark using REST APIs of real-world services to build MCP servers with 18,000+ tools. Provided manually annotated ground-truth tools and evaluated agents with both ground-truth tools and tool retrieval approaches.

Result: Models with tool retrieval perform similarly or better than browser-based agents. Smaller models can’t fully utilize available tools through retrieval, while GPT-5’s performance with retrieval is close to ground-truth tools. Advanced models struggle with complex enterprise environments requiring navigation of thousands of tools.

Conclusion: Current models face challenges in navigating tens of thousands of tools and combining them for complex problems. Better reasoning and retrieval models are needed for effective tool-based agents in enterprise environments.

Abstract: Since the introduction of the Model Context Protocol (MCP), the number of available tools for Large Language Models (LLMs) has increased significantly. These task-specific tool sets offer an alternative to general-purpose tools such as web browsers, while being easier to develop and maintain than GUIs. However, current general-purpose agents predominantly rely on web browsers for interacting with the environment. Here, we introduce TheMCPCompany, a benchmark for evaluating tool-calling agents on tasks that involve interacting with various real-world services. We use the REST APIs of these services to create MCP servers, which include over 18,000 tools. We also provide manually annotated ground-truth tools for each task. In our experiments, we use the ground truth tools to show the potential of tool-calling agents for both improving performance and reducing costs assuming perfect tool retrieval. Next, we explore agent performance using tool retrieval to study the real-world practicality of tool-based agents. While all models with tool retrieval perform similarly or better than browser-based agents, smaller models cannot take full advantage of the available tools through retrieval. On the other hand, GPT-5’s performance with tool retrieval is very close to its performance with ground-truth tools. Overall, our work shows that the most advanced reasoning models are effective at discovering tools in simpler environments, but seriously struggle with navigating complex enterprise environments. TheMCPCompany reveals that navigating tens of thousands of tools and combining them in non-trivial ways to solve complex problems is still a challenging task for current models and requires both better reasoning and better retrieval models.

[35] JointCQ: Improving Factual Hallucination Detection with Joint Claim and Query Generation

Fan Xu, Huixuan Zhang, Zhenliang Zhang, Jiahao Wang, Xiaojun Wan

Main category: cs.CL

TL;DR: JointCQ is a joint claim-and-query generation framework that improves hallucination detection in LLMs by addressing limitations in claim extraction and query generation stages.

Details

Motivation: Current hallucination detection methods suffer from context loss during claim extraction and low specificity in query generation, leading to degraded performance in detecting unreliable content generated by LLMs.

Method: The framework uses evaluation criteria to filter synthesized training data and finetunes a language model for joint claim extraction and query generation, providing reliable inputs for downstream search and verification.

Result: Experimental results show that JointCQ outperforms previous methods on multiple open-domain QA hallucination detection benchmarks.

Conclusion: The method advances the goal of more trustworthy and transparent language model systems by improving hallucination detection capabilities.

Abstract: Current large language models (LLMs) often suffer from hallucination issues, i,e, generating content that appears factual but is actually unreliable. A typical hallucination detection pipeline involves response decomposition (i.e., claim extraction), query generation, evidence collection (i.e., search or retrieval), and claim verification. However, existing methods exhibit limitations in the first two stages, such as context loss during claim extraction and low specificity in query generation, resulting in degraded performance across the hallucination detection pipeline. In this work, we introduce JointCQ https://github.com/pku0xff/JointCQ, a joint claim-and-query generation framework designed to construct an effective and efficient claim-query generator. Our framework leverages elaborately designed evaluation criteria to filter synthesized training data, and finetunes a language model for joint claim extraction and query generation, providing reliable and informative inputs for downstream search and verification. Experimental results demonstrate that our method outperforms previous methods on multiple open-domain QA hallucination detection benchmarks, advancing the goal of more trustworthy and transparent language model systems.

[36] KORE: Enhancing Knowledge Injection for Large Multimodal Models via Knowledge-Oriented Augmentations and Constraints

Kailin Jiang, Hongbo Jiang, Ning Jiang, Zhi Gao, Jinhe Bi, Yuchen Ren, Bin Li, Yuntao Du, Lei Liu, Qing Li

Main category: cs.CL

TL;DR: KORE is a method for injecting new knowledge into large multimodal models while preserving existing knowledge, using knowledge-oriented augmentations and constraints to prevent catastrophic forgetting.

Details

Motivation: Large multimodal models have static knowledge that can't keep up with real-world developments, and existing methods struggle with both learning new knowledge and preserving old knowledge.

Method: KORE converts knowledge items into structured knowledge for accurate learning, and stores previous knowledge in covariance matrices to define fine-tuning directions that minimize interference with existing knowledge.

Result: Extensive experiments on LLaVA-v1.5-7B, LLaVA-v1.5-13B, and Qwen2.5-VL-7B show superior new knowledge injection performance and effective mitigation of catastrophic forgetting.

Conclusion: KORE successfully addresses the dual challenge of knowledge adaptation and retention in large multimodal models through synergistic augmentations and constraints.

Abstract: Large Multimodal Models encode extensive factual knowledge in their pre-trained weights. However, its knowledge remains static and limited, unable to keep pace with real-world developments, which hinders continuous knowledge acquisition. Effective knowledge injection thus becomes critical, involving two goals: knowledge adaptation (injecting new knowledge) and knowledge retention (preserving old knowledge). Existing methods often struggle to learn new knowledge and suffer from catastrophic forgetting. To address this, we propose KORE, a synergistic method of KnOwledge-oRientEd augmentations and constraints for injecting new knowledge into large multimodal models while preserving old knowledge. Unlike general text or image data augmentation, KORE automatically converts individual knowledge items into structured and comprehensive knowledge to ensure that the model accurately learns new knowledge, enabling accurate adaptation. Meanwhile, KORE stores previous knowledge in the covariance matrix of LMM’s linear layer activations and initializes the adapter by projecting the original weights into the matrix’s null space, defining a fine-tuning direction that minimizes interference with previous knowledge, enabling powerful retention. Extensive experiments on various LMMs, including LLaVA-v1.5-7B, LLaVA-v1.5-13B, and Qwen2.5-VL-7B, show that KORE achieves superior new knowledge injection performance and effectively mitigates catastrophic forgetting.

[37] HAD: HAllucination Detection Language Models Based on a Comprehensive Hallucination Taxonomy

Fan Xu, Xinyu Hu, Zhenghan Yu, Li Lin, Xu Zhang, Yang Zhang, Wei Zhou, Jinjie Gu, Xiaojun Wan

Main category: cs.CL

TL;DR: Introduces HAD models for hallucination detection in NLG, featuring a comprehensive taxonomy, span-level identification/correction, and achieving SOTA results on multiple benchmarks.

Details

Motivation: Address growing concerns about hallucination in large language models by developing robust detection methods to improve output reliability and accuracy.

Method: Propose HAD models that integrate hallucination detection, span-level identification, and correction in single inference; trained on 90K synthetic samples with 11-category taxonomy.

Result: HAD models outperform existing baselines, achieving state-of-the-art results on HaluEval, FactCHD, and FaithBench benchmarks with 2,248-sample test set.

Conclusion: HAD models demonstrate robustness and versatility across various NLG tasks, providing effective hallucination detection and correction capabilities.

Abstract: The increasing reliance on natural language generation (NLG) models, particularly large language models, has raised concerns about the reliability and accuracy of their outputs. A key challenge is hallucination, where models produce plausible but incorrect information. As a result, hallucination detection has become a critical task. In this work, we introduce a comprehensive hallucination taxonomy with 11 categories across various NLG tasks and propose the HAllucination Detection (HAD) models https://github.com/pku0xff/HAD, which integrate hallucination detection, span-level identification, and correction into a single inference process. Trained on an elaborate synthetic dataset of about 90K samples, our HAD models are versatile and can be applied to various NLG tasks. We also carefully annotate a test set for hallucination detection, called HADTest, which contains 2,248 samples. Evaluations on in-domain and out-of-domain test sets show that our HAD models generally outperform the existing baselines, achieving state-of-the-art results on HaluEval, FactCHD, and FaithBench, confirming their robustness and versatility.

[38] Balancing Rewards in Text Summarization: Multi-Objective Reinforcement Learning via HyperVolume Optimization

Junjie Song, Yiwen Liu, Dapeng Li, Yin Sun, Shukun Fu, Siqi Chen, Yuji Cao

Main category: cs.CL

TL;DR: Hypervolume Optimization (HVO) is a novel RL method that dynamically adjusts reward scores using hypervolume method to optimize multiple objectives in text summarization, achieving balanced performance across consistency, coherence, relevance, and fluency.

Details

Motivation: Text summarization requires optimizing multiple objectives simultaneously, but few studies have focused on multi-objective optimization through RL based on LLMs. Existing approaches struggle with balanced performance across different quality dimensions.

Method: Proposed Hypervolume Optimization (HVO) strategy that dynamically adjusts scores between groups during RL reward process using hypervolume method, guiding model optimization to progressively approximate pareto front for balanced multi-objective summarization.

Result: HVO outperforms GRPO in overall scores and shows more balanced performance across different dimensions. A 7B foundation model enhanced by HVO performs comparably to GPT-4 in summarization tasks while maintaining shorter generation length.

Conclusion: HVO effectively addresses the multi-objective optimization challenge in text summarization through dynamic reward adjustment, achieving state-of-the-art performance with balanced results across multiple quality dimensions.

Abstract: Text summarization is a crucial task that requires the simultaneous optimization of multiple objectives, including consistency, coherence, relevance, and fluency, which presents considerable challenges. Although large language models (LLMs) have demonstrated remarkable performance, enhanced by reinforcement learning (RL), few studies have focused on optimizing the multi-objective problem of summarization through RL based on LLMs. In this paper, we introduce hypervolume optimization (HVO), a novel optimization strategy that dynamically adjusts the scores between groups during the reward process in RL by using the hypervolume method. This method guides the model’s optimization to progressively approximate the pareto front, thereby generating balanced summaries across multiple objectives. Experimental results on several representative summarization datasets demonstrate that our method outperforms group relative policy optimization (GRPO) in overall scores and shows more balanced performance across different dimensions. Moreover, a 7B foundation model enhanced by HVO performs comparably to GPT-4 in the summarization task, while maintaining a shorter generation length. Our code is publicly available at https://github.com/ai4business-LiAuto/HVO.git

[39] Slot Filling as a Reasoning Task for SpeechLLMs

Kadri Hacioglu, Manjunath K E, Andreas Stolcke

Main category: cs.CL

TL;DR: Integration of chain-of-thought reasoning into speech large language models improves slot-filling performance, with hybrid models combining direct and reasoning modes outperforming single-mode approaches.

Details

Motivation: To enhance speech large language models (speechLLMs) by incorporating reasoning capabilities inspired by recent developments in reasoning LLMs, specifically for the end-to-end slot-filling task.

Method: Used chain-of-thought framework to decompose slot-filling into multiple reasoning steps, created a reasoning dataset, applied supervised fine-tuning to speechLLMs, experimented with different LLM types/sizes as text foundation models, and developed hybrid models preserving both direct and reasoning modes.

Result: Performance improvements achieved by introducing reasoning steps, but reasoning textual LLMs developed for math/logic/coding domains were inferior as foundation models. Hybrid speechLLMs built on hybrid text foundation LLMs and fine-tuned for both modes outperformed single-mode models.

Conclusion: Reasoning integration enhances speechLLM performance for slot-filling, with hybrid models combining direct and reasoning modes being most effective, while domain-specific reasoning LLMs may not be optimal foundation models for speech tasks.

Abstract: We propose integration of reasoning into speech large language models (speechLLMs) for the end-to-end slot-filling task. Inspired by the recent development of reasoning LLMs, we use a chain-of-thought framework to decompose the slot-filling task into multiple reasoning steps, create a reasoning dataset and apply the supervised fine-tuning strategy to a speechLLM. We distinguish between regular and reasoning speechLLMs and experiment with different types and sizes of LLMs as their text foundation models. We demonstrate performance improvements by introducing reasoning (intermediate) steps. However, we show that a reasoning textual LLM developed mainly for math, logic and coding domains might be inferior as a foundation model for a reasoning speechLLM. We further show that hybrid speechLLMs, built on a hybrid text foundation LLM and fine-tuned to preserve both direct and reasoning modes of operation, have better performance than those fine-tuned employing only one mode of operation.

[40] Algorithmic Fairness in NLP: Persona-Infused LLMs for Human-Centric Hate Speech Detection

Ewelina Gajewska, Arda Derbent, Jaroslaw A Chudziak, Katarzyna Budzynska

Main category: cs.CL

TL;DR: This paper investigates how personalizing LLMs with annotator personas affects their sensitivity to hate speech detection, particularly regarding identity-based biases between annotators and targets.

Details

Motivation: To understand how incorporating socio-demographic attributes into LLMs can address bias in automated hate speech detection, bridging psychological insights on group identity with NLP techniques.

Method: Used Google’s Gemini and OpenAI’s GPT-4.1-mini models with two persona-prompting methods: shallow persona prompting and deep contextualized persona development using Retrieval-Augmented Generation (RAG) to incorporate richer persona profiles.

Result: Analysis showed the impact of using in-group and out-group annotator personas on models’ detection performance and fairness across diverse social groups, highlighting both potential and limitations of persona-based approaches.

Conclusion: Persona-based approaches can help reduce bias in hate speech detection but have limitations, offering valuable insights for developing more equitable detection systems.

Abstract: In this paper, we investigate how personalising Large Language Models (Persona-LLMs) with annotator personas affects their sensitivity to hate speech, particularly regarding biases linked to shared or differing identities between annotators and targets. To this end, we employ Google’s Gemini and OpenAI’s GPT-4.1-mini models and two persona-prompting methods: shallow persona prompting and a deeply contextualised persona development based on Retrieval-Augmented Generation (RAG) to incorporate richer persona profiles. We analyse the impact of using in-group and out-group annotator personas on the models’ detection performance and fairness across diverse social groups. This work bridges psychological insights on group identity with advanced NLP techniques, demonstrating that incorporating socio-demographic attributes into LLMs can address bias in automated hate speech detection. Our results highlight both the potential and limitations of persona-based approaches in reducing bias, offering valuable insights for developing more equitable hate speech detection systems.

[41] Local Obfuscation by GLINER for Impartial Context Aware Lineage: Development and evaluation of PII Removal system

Prakrithi Shivaprakash, Lekhansh Shukla, Animesh Mukherjee, Prabhat Chand, Pratima Murthy

Main category: cs.CL

TL;DR: LOGICAL is an efficient, locally deployable PII removal system using fine-tuned GLiNER model that outperforms commercial solutions and large language models for clinical note de-identification.

Details

Motivation: Need for efficient PII removal from clinical EHRs for research and AI development, addressing limitations of LLMs (high computational costs, data privacy risks) especially in low-resource settings.

Method: Fine-tuned GLiNER model on 2849 clinical text instances, defined 9 PII categories, compared against Azure NER, Presidio, Gemini-Pro-2.5, and Llama-3.3-70B-Instruct using character-level precision, recall, and F1-score.

Result: Fine-tuned GLiNER achieved micro-average F1-score of 0.980, outperforming Gemini-Pro-2.5 (0.845), correctly sanitized 95% of documents completely vs 64% for next-best solution, operates efficiently on standard laptop without GPU.

Conclusion: Fine-tuned specialized transformer models like GLiNER offer accurate, computationally efficient, and secure PII removal solution, enabling “sanitisation at the source” as practical alternative to resource-intensive LLMs.

Abstract: Removing Personally Identifiable Information (PII) from clinical notes in Electronic Health Records (EHRs) is essential for research and AI development. While Large Language Models (LLMs) are powerful, their high computational costs and the data privacy risks of API-based services limit their use, especially in low-resource settings. To address this, we developed LOGICAL (Local Obfuscation by GLINER for Impartial Context-Aware Lineage), an efficient, locally deployable PII removal system built on a fine-tuned Generalist and Lightweight Named Entity Recognition (GLiNER) model. We used 1515 clinical documents from a psychiatric hospital’s EHR system. We defined nine PII categories for removal. A modern-gliner-bi-large-v1.0 model was fine-tuned on 2849 text instances and evaluated on a test set of 376 instances using character-level precision, recall, and F1-score. We compared its performance against Microsoft Azure NER, Microsoft Presidio, and zero-shot prompting with Gemini-Pro-2.5 and Llama-3.3-70B-Instruct. The fine-tuned GLiNER model achieved superior performance, with an overall micro-average F1-score of 0.980, significantly outperforming Gemini-Pro-2.5 (F1-score: 0.845). LOGICAL correctly sanitised 95% of documents completely, compared to 64% for the next-best solution. The model operated efficiently on a standard laptop without a dedicated GPU. However, a 2% entity-level false negative rate underscores the need for human-in-the-loop validation across all tested systems. Fine-tuned, specialised transformer models like GLiNER offer an accurate, computationally efficient, and secure solution for PII removal from clinical notes. This “sanitisation at the source” approach is a practical alternative to resource-intensive LLMs, enabling the creation of de-identified datasets for research and AI development while preserving data privacy, particularly in resource-constrained environments.

[42] Modeling Turn-Taking with Semantically Informed Gestures

Varsha Suresh, M. Hamza Mughal, Christian Theobalt, Vera Demberg

Main category: cs.CL

TL;DR: The paper introduces DnD Gesture++, an extended multimodal corpus with semantic gesture annotations, and shows that incorporating semantically guided gestures improves turn-taking prediction in conversations.

Details

Motivation: To study how multimodal cues, especially gestures, complement linguistic and acoustic features in managing conversational turn-taking.

Method: Extend the DnD Gesture corpus with 2,663 semantic gesture annotations (iconic, metaphoric, deictic, discourse types) and use a Mixture-of-Experts framework integrating text, audio, and gestures for turn-taking prediction.

Result: Incorporating semantically guided gestures yields consistent performance gains over baselines in turn-taking prediction.

Conclusion: Gestures play a complementary role in multimodal turn-taking, enhancing prediction performance when integrated with text and audio features.

Abstract: In conversation, humans use multimodal cues, such as speech, gestures, and gaze, to manage turn-taking. While linguistic and acoustic features are informative, gestures provide complementary cues for modeling these transitions. To study this, we introduce DnD Gesture++, an extension of the multi-party DnD Gesture corpus enriched with 2,663 semantic gesture annotations spanning iconic, metaphoric, deictic, and discourse types. Using this dataset, we model turn-taking prediction through a Mixture-of-Experts framework integrating text, audio, and gestures. Experiments show that incorporating semantically guided gestures yields consistent performance gains over baselines, demonstrating their complementary role in multimodal turn-taking.

[43] M3-SLU: Evaluating Speaker-Attributed Reasoning in Multimodal Large Language Models

Yejin Kwon, Taewoo Kang, Hyunsoo Yoon, Changouk Kim

Main category: cs.CL

TL;DR: M3-SLU is a new multimodal benchmark for evaluating speaker-attributed reasoning in multi-speaker conversations, revealing that current models struggle with identifying who said what despite understanding what was said.

Details

Motivation: Current multimodal models show strong performance in speech and text comprehension but struggle with speaker-attributed reasoning - understanding who said what and when in natural conversations.

Method: Built M3-SLU benchmark from four open corpora (CHiME-6, MELD, MultiDialog, AMI) with over 12,000 validated instances containing audio, transcripts, and metadata. Includes two tasks: Speaker-Attributed Question Answering and Speaker Attribution via Utterance Matching.

Result: Baseline results show models can capture what was said but often fail to identify who said it, revealing a key gap in speaker-aware dialogue understanding.

Conclusion: M3-SLU serves as a challenging benchmark to advance research in speaker-aware multimodal understanding, addressing the critical limitation of current models in speaker attribution.

Abstract: We present M3-SLU, a new multimodal large language model (MLLM) benchmark for evaluating multi-speaker, multi-turn spoken language understanding. While recent models show strong performance in speech and text comprehension, they still struggle with speaker-attributed reasoning, the ability to understand who said what and when in natural conversations. M3-SLU is built from four open corpora (CHiME-6, MELD, MultiDialog, and AMI) and comprises over 12,000 validated instances with paired audio, transcripts, and metadata. It includes two tasks: (1) Speaker-Attributed Question Answering and (2) Speaker Attribution via Utterance Matching. We provide baseline results for both cascaded pipelines and end-to-end MLLMs, evaluated using an LLM-as-Judge and accuracy metrics. Results show that while models can capture what was said, they often fail to identify who said it, revealing a key gap in speaker-aware dialogue understanding. M3-SLU offers as a challenging benchmark to advance research in speaker-aware multimodal understanding.

[44] AgenticMath: Enhancing LLM Reasoning via Agentic-based Math Data Generation

Xianyang Liu, Yilin Liu, Shuai Wang, Hao Cheng, Andrew Estornell, Yuzhi Zhao, Jiaheng Wei

Main category: cs.CL

TL;DR: AgenticMath is a novel agentic pipeline that generates high-quality mathematical question-answer pairs for LLM fine-tuning, achieving competitive performance with much smaller datasets than traditional methods.

Details

Motivation: Current methods for creating reasoning datasets for LLMs often produce low-quality/incorrect answers and have limited information richness from available data sources.

Method: Four-stage pipeline: (1) Seed Question Filter for high-quality questions, (2) Agentic Question Rephrase using multi-agent system for diverse paraphrases, (3) Answer Augment with chain-of-thought reasoning for correctness, (4) Question and Answer Evaluation to retain superior pairs.

Result: Fine-tuning 3B-8B parameter LLMs on AgenticMath datasets (30-60K samples) achieves competitive or superior performance on mathematical reasoning benchmarks compared to baselines trained on much larger datasets (400K-2.3M samples).

Conclusion: Targeted, high-quality data generation is more efficient for improving mathematical reasoning in LLMs than large-scale, low-quality alternatives.

Abstract: The creation of high-quality datasets to improve Large Language Model (LLM) reasoning remains a significant challenge, as current methods often suffer from generating low-quality/incorrect answers and limited information richness from available data sources. To address this, we propose AgenticMath, a novel agentic pipeline for generating high-quality mathematical question-answer pairs to enhance the supervised fine-tuning of LLMs. Our method operates through four stages: (1) Seed Question Filter that selects questions with high information richness, complexity, and clarity; (2) an Agentic Question Rephrase step that employs a multi-agent system to generate diverse, logically consistent paraphrases; (3) an Answer Augment step where rewrite answers using chain-of-thought reasoning to enhance numerical and logical correctness, without reliance on human-provided labels; and (4) a final Question and Answer Evaluation that retains only the most superior pairs. Extensive experiments demonstrate that, fine-tuning 3B-8B parameter LLMs on AgenticMath generated datasets (comprising only 30-60K math samples) achieves competitive or superior performance on diverse in domain and out-of-domain mathematical reasoning benchmarks compared to baselines trained on much more data (e.g., 400K or 2.3M samples). Our work demonstrates that targeted, high-quality data generation is a more efficient path to improving mathematical reasoning in LLMs than large-scale, low-quality alternatives.

[45] LoongRL:Reinforcement Learning for Advanced Reasoning over Long Contexts

Siyuan Wang, Gaokai Zhang, Li Lyna Zhang, Ning Shang, Fan Yang, Dongyao Chen, Mao Yang

Main category: cs.CL

TL;DR: LoongRL is a data-driven RL method that uses KeyChain to transform short multi-hop QA into high-difficulty long-context tasks, enabling models to develop advanced reasoning patterns that generalize to much longer contexts than trained on.

Details

Motivation: While RL enhances short-context reasoning, advanced thinking patterns for long-context reasoning remain unexplored, and high-difficulty RL data are scarce.

Method: KeyChain synthesizes long-context tasks by inserting UUID chains that hide true questions among distracting documents, requiring step-by-step tracing, question identification, fact retrieval, and reasoning. RL training on this data induces plan-retrieve-reason-recheck patterns.

Result: Models trained at 16K effectively solve 128K tasks without full-length RL costs. Qwen2.5-7B and 14B show +23.5% and +21.1% gains in long-context multi-hop QA. LoongRL-14B scores 74.2, rivaling larger frontier models, passes all 128K needle-in-a-haystack tests, and preserves short-context reasoning.

Conclusion: LoongRL enables efficient development of advanced long-context reasoning capabilities that generalize well beyond training length constraints, achieving competitive performance with much larger models while maintaining short-context reasoning abilities.

Abstract: Reasoning over long contexts is essential for large language models. While reinforcement learning (RL) enhances short-context reasoning by inducing “Aha” moments in chain-of-thought, the advanced thinking patterns required for long-context reasoning remain largely unexplored, and high-difficulty RL data are scarce. In this paper, we introduce LoongRL, a data-driven RL method for advanced long-context reasoning. Central to LoongRL is KeyChain, a synthesis approach that transforms short multi-hop QA into high-difficulty long-context tasks by inserting UUID chains that hide the true question among large collections of distracting documents. Solving these tasks requires the model to trace the correct chain step-by-step, identify the true question, retrieve relevant facts and reason over them to answer correctly. RL training on KeyChain data induces an emergent plan-retrieve-reason-recheck reasoning pattern that generalizes far beyond training length. Models trained at 16K effectively solve 128K tasks without prohibitive full-length RL rollout costs. On Qwen2.5-7B and 14B, LoongRL substantially improves long-context multi-hop QA accuracy by +23.5% and +21.1% absolute gains. The resulting LoongRL-14B reaches a score of 74.2, rivaling much larger frontier models such as o3-mini (74.5) and DeepSeek-R1 (74.9). It also improves long-context retrieval, passes all 128K needle-in-a-haystack stress tests, and preserves short-context reasoning capabilities.

[46] The Massive Legal Embedding Benchmark (MLEB)

Umar Butler, Abdur-Rahman Butler, Adrian Lucas Malec

Main category: cs.CL

TL;DR: MLEB is the largest open-source benchmark for legal information retrieval, featuring 10 expert-annotated datasets across multiple jurisdictions, document types, and task types.

Details

Motivation: To address the lack of comprehensive, diverse, and open-source benchmarks for legal information retrieval across different jurisdictions and document types.

Method: Constructed 10 expert-annotated datasets spanning multiple jurisdictions (US, UK, EU, Australia, Ireland, Singapore), document types (cases, legislation, regulatory guidance, contracts, literature), and task types (search, zero-shot classification, QA). Seven datasets were newly created to fill domain gaps.

Result: Created the most comprehensive legal information retrieval benchmark to date, with code, results, and data released openly for reproducible evaluations.

Conclusion: MLEB provides a foundation for advancing legal information retrieval research through its scale, diversity, and open accessibility.

Abstract: We present the Massive Legal Embedding Benchmark (MLEB), the largest, most diverse, and most comprehensive open-source benchmark for legal information retrieval to date. MLEB consists of ten expert-annotated datasets spanning multiple jurisdictions (the US, UK, EU, Australia, Ireland, and Singapore), document types (cases, legislation, regulatory guidance, contracts, and literature), and task types (search, zero-shot classification, and question answering). Seven of the datasets in MLEB were newly constructed in order to fill domain and jurisdictional gaps in the open-source legal information retrieval landscape. We document our methodology in building MLEB and creating the new constituent datasets, and release our code, results, and data openly to assist with reproducible evaluations.

[47] MoE-Prism: Disentangling Monolithic Experts for Elastic MoE Services via Model-System Co-Designs

Xinfeng Xia, Jiacheng Liu, Xiaofeng Hou, Peng Tang, Mingxuan Zhang, Wenfeng Wang, Chao Li

Main category: cs.CL

TL;DR: MoE-Prism transforms rigid Mixture-of-Experts models into elastic services by decomposing monolithic experts into fine-grained sub-experts and implementing QoS-aware scheduling, providing 4x more operating points and significant performance improvements.

Details

Motivation: Current MoE models suffer from a 'quality cliff' with only a few coarse-grained operating points, forcing difficult trade-offs between cost and quality and preventing adaptation to diverse Service Level Objectives.

Method: Two-phase approach: 1) Offline Refactoring Engine that deconstructs monolithic experts into fine-grained sub-experts using partitioning optimization with metaheuristic-based neuron grouping, 2) Online Scheduling Engine with QoS-aware scheduling policies for throughput maximization and latency-optimized offloading.

Result: Provides over 4 times more distinct, stable operating points than baseline, improves throughput by up to 19.9% under strict latency budget, reduces latency by up to 10.36% under limited resources.

Conclusion: MoE-Prism bridges the model-system gap with critical control knobs, enabling adaptive, efficient, and QoS-aware AI services.

Abstract: Mixture-of-Experts (MoE) models, the state-of-the-art in large-scale AI, achieve high quality by sparsely activating parameters. However, their reliance on routing between a few monolithic experts via a top-k mechanism creates a “quality cliff”, offering only a few coarse-grained operating points. This inflexibility forces a difficult trade-off between cost and quality, preventing adaptation to diverse Service Level Objectives (SLOs) and leading to significant resource over-provisioning. This paper introduces MoE-Prism, a model-system co-design that transforms rigid MoE models into elastic services. Our methodology is divided into two phases. First, an \emph{Offline Refactoring Engine} systematically deconstructs monolithic experts into fine-grained “sub-experts.” This engine employs a partitioning optimization solver that uses a metaheuristic-based approach to group neurons, preserving functional locality without requiring retraining. Second, an \emph{Online Scheduling Engine} leverages this new elasticity through QoS-aware scheduling. It implements specialized policies to solve complex system problems, including maximizing throughput in cloud deployments and managing latency-optimized offloading for memory-constrained devices. Our evaluation across three different MoE models shows that MoE-Prismprovides over 4 times more distinct, stable operating points than the baseline. This allows an AI service to dynamically improve throughput by up to 19.9% under a strict latency budget or reduce latency by up to 10.36% under limited resources. MoE-Prism provides the critical “control knob” to bridge the model-system gap, enabling the next generation of adaptive, efficient, and QoS-aware AI services.

[48] Sign Language Translation with Sentence Embedding Supervision

Yasser Hamidullah, Josef van Genabith, Cristina España-Bonet

Main category: cs.CL

TL;DR: A novel gloss-free sign language translation approach using sentence embeddings of target sentences instead of manual gloss annotations, achieving state-of-the-art performance on German and American sign language datasets.

Details

Motivation: Gloss annotations for sign language translation are scarce, inconsistent across datasets, and require manual labeling, limiting scalability and performance of SLT systems.

Method: Uses sentence embeddings of target sentences as supervision during training, eliminating need for manual gloss annotations. Evaluated with mono- and multilingual embeddings on German (PHOENIX-2014T) and American (How2Sign) sign language datasets.

Result: Significantly outperforms other gloss-free approaches, setting new state-of-the-art for datasets without glosses and without pretraining on additional SLT datasets. Reduces gap between gloss-free and gloss-dependent systems.

Conclusion: Sentence embeddings effectively replace gloss annotations in SLT, enabling scalable training without manual labeling while achieving competitive performance across multiple sign languages.

Abstract: State-of-the-art sign language translation (SLT) systems facilitate the learning process through gloss annotations, either in an end2end manner or by involving an intermediate step. Unfortunately, gloss labelled sign language data is usually not available at scale and, when available, gloss annotations widely differ from dataset to dataset. We present a novel approach using sentence embeddings of the target sentences at training time that take the role of glosses. The new kind of supervision does not need any manual annotation but it is learned on raw textual data. As our approach easily facilitates multilinguality, we evaluate it on datasets covering German (PHOENIX-2014T) and American (How2Sign) sign languages and experiment with mono- and multilingual sentence embeddings and translation systems. Our approach significantly outperforms other gloss-free approaches, setting the new state-of-the-art for data sets where glosses are not available and when no additional SLT datasets are used for pretraining, diminishing the gap between gloss-free and gloss-dependent systems.

[49] ToMMeR – Efficient Entity Mention Detection from Large Language Models

Victor Morand, Nadi Tomeh, Josiane Mothe, Benjamin Piwowarski

Main category: cs.CL

TL;DR: ToMMeR is a lightweight model (<300K parameters) that probes early LLM layers for mention detection, achieving 93% recall zero-shot and near SOTA NER performance when extended with classification heads.

Details

Motivation: Mention detection is foundational for information extraction but is a known performance bottleneck. The paper aims to show that structured entity representations exist in early transformer layers and can be efficiently recovered.

Method: Introduces ToMMeR, a lightweight model that probes mention detection capabilities from early LLM layers across 13 NER benchmarks, using cross-model analysis with diverse architectures (14M-15B parameters).

Result: Achieves 93% recall zero-shot with over 90% precision using LLM as judge, showing rare spurious predictions despite high recall. Cross-model analysis reveals convergence on similar mention boundaries (DICE >75%). When extended with span classification, achieves near SOTA NER performance (80-87% F1).

Conclusion: Structured entity representations exist in early transformer layers and can be efficiently recovered with minimal parameters, confirming that mention detection emerges naturally from language modeling.

Abstract: Identifying which text spans refer to entities – mention detection – is both foundational for information extraction and a known performance bottleneck. We introduce ToMMeR, a lightweight model (<300K parameters) probing mention detection capabilities from early LLM layers. Across 13 NER benchmarks, ToMMeR achieves 93% recall zero-shot, with over 90% precision using an LLM as a judge showing that ToMMeR rarely produces spurious predictions despite high recall. Cross-model analysis reveals that diverse architectures (14M-15B parameters) converge on similar mention boundaries (DICE >75%), confirming that mention detection emerges naturally from language modeling. When extended with span classification heads, ToMMeR achieves near SOTA NER performance (80-87% F1 on standard benchmarks). Our work provides evidence that structured entity representations exist in early transformer layers and can be efficiently recovered with minimal parameters.

[50] SONAR-SLT: Multilingual Sign Language Translation via Language-Agnostic Sentence Embedding Supervision

Yasser Hamidullah, Shakib Yazdani, Cennet Oguz, Josef van Genabith, Cristina España-Bonet

Main category: cs.CL

TL;DR: The paper proposes using language-agnostic multimodal embeddings trained on text and speech from multiple languages to supervise sign language translation, enabling direct multilingual translation and addressing data scarcity through coupled augmentation.

Details

Motivation: Traditional sign language translation is limited by single-language text supervision, which restricts scalability and cross-language generalization. Earlier approaches using text-based sentence embeddings remain tied to specific languages and modalities.

Method: Employ language-agnostic multimodal embeddings trained on text and speech from multiple languages for SLT supervision. Propose coupled augmentation combining multilingual target augmentations (translations into many languages) with video-level perturbations to improve model robustness.

Result: Experiments show consistent BLEURT gains over text-only sentence embedding supervision, with larger improvements in low-resource settings.

Conclusion: Language-agnostic embedding supervision combined with coupled augmentation provides a scalable and semantically robust alternative to traditional SLT training.

Abstract: Sign language translation (SLT) is typically trained with text in a single spoken language, which limits scalability and cross-language generalization. Earlier approaches have replaced gloss supervision with text-based sentence embeddings, but up to now, these remain tied to a specific language and modality. In contrast, here we employ language-agnostic, multimodal embeddings trained on text and speech from multiple languages to supervise SLT, enabling direct multilingual translation. To address data scarcity, we propose a coupled augmentation method that combines multilingual target augmentations (i.e. translations into many languages) with video-level perturbations, improving model robustness. Experiments show consistent BLEURT gains over text-only sentence embedding supervision, with larger improvements in low-resource settings. Our results demonstrate that language-agnostic embedding supervision, combined with coupled augmentation, provides a scalable and semantically robust alternative to traditional SLT training.

[51] Spatio-temporal Sign Language Representation and Translation

Yasser Hamidullah, Josef van Genabith, Cristina España-Bonet

Main category: cs.CL

TL;DR: The paper presents an end-to-end sign language translation system that learns spatio-temporal features directly from video frames for Swiss German Sign Language to German text translation, achieving 5 BLEU on development but only 0.11 BLEU on test data.

Details

Motivation: Standard SLT approaches using generic seq2seq architectures with video frame features often fail to leverage temporal information effectively, limiting generalization to new datasets.

Method: Proposed a single model that learns spatio-temporal feature representations and translation simultaneously, creating a true end-to-end architecture for sign language translation.

Result: Best system achieved 5±1 BLEU points on development set but performance dropped significantly to 0.11±0.06 BLEU points on test data, indicating poor generalization.

Conclusion: While the end-to-end approach shows promise for learning spatio-temporal features, the significant performance drop on test data highlights challenges in generalization for sign language translation systems.

Abstract: This paper describes the DFKI-MLT submission to the WMT-SLT 2022 sign language translation (SLT) task from Swiss German Sign Language (video) into German (text). State-of-the-art techniques for SLT use a generic seq2seq architecture with customized input embeddings. Instead of word embeddings as used in textual machine translation, SLT systems use features extracted from video frames. Standard approaches often do not benefit from temporal features. In our participation, we present a system that learns spatio-temporal feature representations and translation in a single model, resulting in a real end-to-end architecture expected to better generalize to new data sets. Our best system achieved $5\pm1$ BLEU points on the development set, but the performance on the test dropped to $0.11\pm0.06$ BLEU points.

[52] BLiSS 1.0: Evaluating Bilingual Learner Competence in Second Language Small Language Models

Yuan Gao, Suchir Salhan, Andrew Caines, Paula Buttery, Weiwei Sun

Main category: cs.CL

TL;DR: BLiSS 1.0 is a benchmark that tests selective tolerance - whether models find naturalistic learner errors more plausible than artificial errors - using 136,867 controlled triplets from 2.8M learner sentences.

Details

Motivation: To bridge the gap between performance-oriented benchmarks and evaluation of cognitively inspired models by testing selective tolerance as a distinct capability from standard grammaticality.

Method: Constructed benchmark from 2.8M naturalistic learner sentences, providing 136,867 controlled triplets (corrected, learner, artificial) to test selective tolerance paradigm.

Result: Experiments show selective tolerance is distinct from standard grammaticality, with performance clustering strongly by training paradigm across diverse models.

Conclusion: BLiSS validates as a robust tool for measuring how different training objectives impact model alignment with human language acquisition patterns.

Abstract: To bridge the gap between performance-oriented benchmarks and the evaluation of cognitively inspired models, we introduce BLiSS 1.0, a Benchmark of Learner Interlingual Syntactic Structure. Our benchmark operationalizes a new paradigm of selective tolerance, testing whether a model finds a naturalistic learner error more plausible than a matched, artificial error within the same sentence. Constructed from over 2.8 million naturalistic learner sentences, BLiSS provides 136,867 controlled triplets (corrected, learner, artificial) for this purpose. Experiments on a diverse suite of models demonstrate that selective tolerance is a distinct capability from standard grammaticality, with performance clustering strongly by training paradigm. This validates BLiSS as a robust tool for measuring how different training objectives impact a model’s alignment with the systematic patterns of human language acquisition.

[53] MINED: Probing and Updating with Multimodal Time-Sensitive Knowledge for Large Multimodal Models

Kailin Jiang, Ning Jiang, Yuchen Ren, Yuchen Li, Yifan Gao, Jinhe Bi, Yunpu Ma, Qingqing Liu, Xianhao Wang, Yifan Jia, Hongbo Jiang, Yaocong Hu, Bin Li, Lei Liu, Yuntao Du

Main category: cs.CL

TL;DR: MINED is a comprehensive benchmark for evaluating temporal awareness in Large Multimodal Models (LMMs) across 6 dimensions and 11 tasks, revealing that most LMMs struggle with time-sensitive knowledge and showing that knowledge editing methods can effectively update such knowledge.

Details

Motivation: Existing LMM benchmarks are static and inadequate for evaluating time-sensitive knowledge understanding, as LMMs' static representations struggle with maintaining accurate temporal factual knowledge.

Method: Constructed MINED benchmark from Wikipedia with professional annotators, containing 2,104 time-sensitive knowledge samples across six knowledge types. Evaluated 15 LMMs and investigated knowledge editing methods for updating temporal knowledge.

Result: Gemini-2.5-Pro achieved highest average CEM score (63.07), while most open-source LMMs lack time understanding. LMMs perform best on organization knowledge and worst on sports knowledge. Knowledge editing methods effectively update time-sensitive knowledge in single editing scenarios.

Conclusion: Current LMMs have limited temporal awareness capabilities, but knowledge editing provides a viable approach for updating time-sensitive knowledge in these models.

Abstract: Large Multimodal Models (LMMs) encode rich factual knowledge via cross-modal pre-training, yet their static representations struggle to maintain an accurate understanding of time-sensitive factual knowledge. Existing benchmarks remain constrained by static designs, inadequately evaluating LMMs’ ability to understand time-sensitive knowledge. To address this gap, we propose MINED, a comprehensive benchmark that evaluates temporal awareness along 6 key dimensions and 11 challenging tasks: cognition, awareness, trustworthiness, understanding, reasoning, and robustness. MINED is constructed from Wikipedia by two professional annotators, containing 2,104 time-sensitive knowledge samples spanning six knowledge types. Evaluating 15 widely used LMMs on MINED shows that Gemini-2.5-Pro achieves the highest average CEM score of 63.07, while most open-source LMMs still lack time understanding ability. Meanwhile, LMMs perform best on organization knowledge, whereas their performance is weakest on sport. To address these challenges, we investigate the feasibility of updating time-sensitive knowledge in LMMs through knowledge editing methods and observe that LMMs can effectively update knowledge via knowledge editing methods in single editing scenarios.

[54] VideoAgentTrek: Computer Use Pretraining from Unlabeled Videos

Dunjie Lu, Yiheng Xu, Junli Wang, Haoyuan Wu, Xinyuan Wang, Zekun Wang, Junlin Yang, Hongjin Su, Jixuan Chen, Junda Chen, Yuchen Mao, Jingren Zhou, Junyang Lin, Binyuan Hui, Tao Yu

Main category: cs.CL

TL;DR: VideoAgentTrek automatically mines training data from screen-recorded videos to train computer-use agents, eliminating manual annotation through an inverse dynamics module that extracts GUI actions and parameters.

Details

Motivation: Manual annotation of GUI interaction data for training computer-use agents is prohibitively expensive at scale, creating a need for automated data generation from existing video resources.

Method: Developed Video2Action inverse dynamics module with video grounding for temporal action detection and action-content recognizer for extracting structured parameters like click coordinates and typed text. Applied to 39,000 YouTube tutorial videos.

Result: Generated 1.52 million interaction steps automatically. On OSWorld-Verified, improved task success rates from 9.3% to 15.8% (70% relative improvement). On AgentNetBench, step accuracy increased from 64.1% to 69.3%.

Conclusion: Passive internet videos can be effectively transformed into high-quality supervision for computer-use agents, providing a scalable alternative to expensive manual annotation pipelines.

Abstract: Training computer-use agents requires massive amounts of GUI interaction data, but manually annotating action trajectories at scale is prohibitively expensive. We present VideoAgentTrek, a scalable pipeline that automatically mines training data from publicly available screen-recorded videos at web scale, eliminating the need for manual annotation. Our approach addresses a key challenge: raw videos contain implicit demonstrations but lack explicit action labels. To solve this, we develop Video2Action, an inverse dynamics module (IDM) with two components: (1) a video grounding model that detects and localizes GUI actions with precise temporal boundaries and context, and (2) an action-content recognizer that extracts structured parameters like click coordinates and typed text with high fidelity. Applied to 39,000 YouTube tutorial videos, our pipeline generates 1.52 million interaction steps automatically. We leverage this data through continued pretraining followed by supervised fine-tuning. On OSWorld-Verified, our approach improves task success rates from 9.3% (SFT-only baseline) to 15.8%, a 70% relative improvement. On AgentNetBench, step accuracy increases from 64.1% to 69.3%. Our results demonstrate that passive internet videos can be transformed into high-quality supervision for computer-use agents, providing a scalable alternative to expensive manual annotation.

[55] Machine Text Detectors are Membership Inference Attacks

Ryuto Koike, Liam Dugan, Masahiro Kaneko, Chris Callison-Burch, Naoaki Okazaki

Main category: cs.CL

TL;DR: This paper investigates the transferability between membership inference attacks (MIAs) and machine-generated text detection, showing they share similar methodological foundations and that methods developed for one task can perform well on the other.

Details

Motivation: MIAs and machine text detection have been studied independently despite sharing similar methodological foundations based on language model probability distributions, potentially missing valuable insights and stronger methods from the other field.

Method: The authors theoretically prove that the same metric achieves asymptotically highest performance for both tasks, and conduct large-scale empirical experiments with 7 MIA methods and 5 machine text detectors across 13 domains and 10 generators.

Result: Strong rank correlation (rho > 0.6) in cross-task performance was found, with Binoculars (originally for machine text detection) achieving state-of-the-art performance on MIA benchmarks. The authors also introduce MINT, a unified evaluation suite.

Conclusion: There is significant transferability between MIAs and machine text detection, highlighting the need for greater cross-task awareness and collaboration between research communities.

Abstract: Although membership inference attacks (MIAs) and machine-generated text detection target different goals, identifying training samples and synthetic texts, their methods often exploit similar signals based on a language model’s probability distribution. Despite this shared methodological foundation, the two tasks have been independently studied, which may lead to conclusions that overlook stronger methods and valuable insights developed in the other task. In this work, we theoretically and empirically investigate the transferability, i.e., how well a method originally developed for one task performs on the other, between MIAs and machine text detection. For our theoretical contribution, we prove that the metric that achieves the asymptotically highest performance on both tasks is the same. We unify a large proportion of the existing literature in the context of this optimal metric and hypothesize that the accuracy with which a given method approximates this metric is directly correlated with its transferability. Our large-scale empirical experiments, including 7 state-of-the-art MIA methods and 5 state-of-the-art machine text detectors across 13 domains and 10 generators, demonstrate very strong rank correlation (rho > 0.6) in cross-task performance. We notably find that Binoculars, originally designed for machine text detection, achieves state-of-the-art performance on MIA benchmarks as well, demonstrating the practical impact of the transferability. Our findings highlight the need for greater cross-task awareness and collaboration between the two research communities. To facilitate cross-task developments and fair evaluations, we introduce MINT, a unified evaluation suite for MIAs and machine-generated text detection, with implementation of 15 recent methods from both tasks.

[56] What is the Best Sequence Length for BABYLM?

Suchir Salhan, Richard Diehl Martinez, Zébulon Goriely, Paula Buttery

Main category: cs.CL

TL;DR: The paper examines optimal sequence lengths for BabyLM pretraining, finding that longer sequences are generally better but the optimal length depends on task and architecture.

Details

Motivation: Many BabyLM Challenge submissions use shorter sequence lengths than typical transformer models, so the authors want to determine the optimal sequence length for training Baby LMs.

Method: Used 100M-word training data with fixed compute budgets to compare 125M-parameter Mamba and OPT models across different sequence lengths.

Result: Longer sequences are often better, but optimal length depends on task and architecture. Shorter sequences suffice for grammatical generalization, while longer contexts benefit morphological analogical reasoning.

Conclusion: Sequence length optimization should consider both the specific task requirements and model architecture, rather than using a one-size-fits-all approach.

Abstract: Transformer language models typically operate with a fixed-length context window, which has grown in step with large-scale pretraining datasets. In the BabyLM Challenge, however, many past submissions have defaulted to using much shorter sequence lengths. We examine the impact of sequence length on BabyLM pretraining, to answer the simple question: what sequence length should we be using when training Baby LMs? Using 100M-word training data and fixed compute budgets, we compare 125M-parameter Mamba and OPT models, finding that although longer is often better, the optimal length depends on both task and architecture. Shorter sequences are sufficient for grammatical generalization tasks whereas longer contexts benefit morphological analogical reasoning tasks.

[57] Lookahead Routing for Large Language Models

Canbin Huang, Tianyuan Shi, Yuhua Zhu, Ruijun Chen, Xiaojun Quan

Main category: cs.CL

TL;DR: Lookahead is a routing framework that predicts potential model outputs to guide model selection in multi-LLM systems, outperforming existing methods by 7.7% on average across seven benchmarks.

Details

Motivation: Existing LLM routing approaches treat routing as classification based only on input queries, missing valuable information from potential outputs and failing to capture implicit intent that emerges during response generation, leading to suboptimal routing decisions.

Method: Proposes Lookahead framework that “foresees” potential model outputs by predicting their latent representations to guide model selection without full inference. Implements two approaches based on causal and masked language models.

Result: Empirical evaluations across seven benchmarks (instruction following, mathematical reasoning, code generation) show Lookahead consistently outperforms existing routing baselines, achieving 7.7% average performance gain over state-of-the-art.

Conclusion: Lookahead enables more informed routing decisions by leveraging predicted output representations, demonstrating significant improvements over query-only routing approaches in multi-LLM systems.

Abstract: Large language model (LLM) routers improve the efficiency of multi-model systems by directing each query to the most appropriate model while leveraging the diverse strengths of heterogeneous LLMs. Most existing approaches frame routing as a classification problem based solely on the input query. While this reduces overhead by avoiding inference across all models, it overlooks valuable information that could be gleaned from potential outputs and fails to capture implicit intent or contextual nuances that often emerge only during response generation. These limitations can result in suboptimal routing decisions, particularly for complex or ambiguous queries that require deeper semantic understanding. To address this challenge, we propose Lookahead, a routing framework that “foresees” potential model outputs by predicting their latent representations and uses these predictions to guide model selection, thus enabling more informed routing without full inference. Within this framework, we implement two approaches based on causal and masked language models. Empirical evaluations across seven public benchmarks - spanning instruction following, mathematical reasoning, and code generation - show that Lookahead consistently outperforms existing routing baselines, achieving an average performance gain of 7.7% over the state-of-the-art. Our code is available at https://github.com/huangcb01/lookahead-routing.

[58] Conditions for Catastrophic Forgetting in Multilingual Translation

Danni Liu, Jan Niehues

Main category: cs.CL

TL;DR: Systematic study reveals that relative scale between model and data size is the primary determinant of catastrophic forgetting in multilingual fine-tuning, with instruction-following ability being more critical than architecture for retaining multilingual knowledge.

Details

Motivation: Fine-tuning multilingual foundation models on specific languages often causes catastrophic forgetting of unseen languages, but literature presents fragmented results about when this occurs, creating ambiguity.

Method: Conducted controlled experiments using machine translation as testbed across different model architectures, data scales, and fine-tuning approaches to identify conditions triggering catastrophic forgetting.

Result: Parameter-efficient fine-tuning offers no clear advantage over full fine-tuning in mitigating forgetting, while cross-lingual alignment can mitigate forgetting and facilitate positive transfer to unseen target languages.

Conclusion: The relative scale between model and data size is a primary determinant of catastrophic forgetting, and instruction-following ability is more critical than architecture for retaining multilingual knowledge during fine-tuning.

Abstract: Fine-tuning multilingual foundation models on specific languages often induces catastrophic forgetting, degrading performance on languages unseen in fine-tuning. While this phenomenon is widely-documented, the literature presents fragmented results about when forgetting occurs. To address this ambiguity, we conduct a systematic empirical study using machine translation as a testbed to identify the conditions that trigger catastrophic forgetting in multilingual fine-tuning. Through controlled experiments across different model architectures, data scales, and fine-tuning approaches, we reveal that the relative scale between model and data size is a primary determinant of forgetting. Moreover, we demonstrate that a model’s instruction-following ability is more critical for retaining multilingual knowledge than its architecture. Contrary to assumptions, parameter-efficient fine-tuning offers no clear advantage over full fine-tuning in mitigating forgetting. Lastly, we show that cross-lingual alignment can mitigate forgetting while also facilitating positive transfer to unseen target languages.

[59] Detecting Latin in Historical Books with Large Language Models: A Multimodal Benchmark

Yu Wu, Ke Shu, Jonas Fischer, Lidia Pivovarova, David Rosson, Eetu Mäkelä, Mikko Tolonen

Main category: cs.CL

TL;DR: This paper introduces a new task of extracting Latin fragments from mixed-language historical documents with varied layouts, evaluating large foundation models on a multimodal dataset of 724 annotated pages.

Details

Motivation: To address the challenge of extracting Latin fragments from complex historical documents that contain multiple languages and varied layouts, which is important for historical document analysis and digital humanities research.

Method: Benchmarked and evaluated the performance of large foundation models using a multimodal dataset containing 724 annotated pages of historical documents with mixed languages and varied layouts.

Result: The study demonstrates that reliable Latin detection is achievable with contemporary models, showing promising performance for this task.

Conclusion: This work provides the first comprehensive analysis of large foundation models’ capabilities and limitations for Latin fragment extraction from mixed-language historical documents.

Abstract: This paper presents a novel task of extracting Latin fragments from mixed-language historical documents with varied layouts. We benchmark and evaluate the performance of large foundation models against a multimodal dataset of 724 annotated pages. The results demonstrate that reliable Latin detection with contemporary models is achievable. Our study provides the first comprehensive analysis of these models’ capabilities and limits for this task.

[60] PBBQ: A Persian Bias Benchmark Dataset Curated with Human-AI Collaboration for Large Language Models

Farhan Farsi, Shayan Bali, Fatemeh Valeh, Parsa Ghofrani, Alireza Pakniat, Kian Kashfipour, Amir H. Payberah

Main category: cs.CL

TL;DR: PBBQ is a comprehensive benchmark dataset for evaluating social biases in Persian LLMs, covering 16 cultural categories with over 37,000 questions developed through expert collaboration and human surveys.

Details

Motivation: There's a significant gap in resources addressing social biases within Persian cultural contexts, despite increasing adoption of LLMs and the critical need for alignment with social norms.

Method: Developed PBBQ benchmark through questionnaires completed by 250 diverse individuals across multiple demographics, in collaboration with social science experts. Benchmarked open-source LLMs, closed-source models, and Persian-specific fine-tuned models.

Result: Current LLMs exhibit significant social biases across Persian culture. LLMs often replicate human bias patterns, highlighting the complex interplay between learned representations and cultural stereotypes.

Conclusion: The PBBQ dataset provides a foundation for evaluating and mitigating bias in Persian language models, and will be publicly available for future research.

Abstract: With the increasing adoption of large language models (LLMs), ensuring their alignment with social norms has become a critical concern. While prior research has examined bias detection in various languages, there remains a significant gap in resources addressing social biases within Persian cultural contexts. In this work, we introduce PBBQ, a comprehensive benchmark dataset designed to evaluate social biases in Persian LLMs. Our benchmark, which encompasses 16 cultural categories, was developed through questionnaires completed by 250 diverse individuals across multiple demographics, in close collaboration with social science experts to ensure its validity. The resulting PBBQ dataset contains over 37,000 carefully curated questions, providing a foundation for the evaluation and mitigation of bias in Persian language models. We benchmark several open-source LLMs, a closed-source model, and Persian-specific fine-tuned models on PBBQ. Our findings reveal that current LLMs exhibit significant social biases across Persian culture. Additionally, by comparing model outputs to human responses, we observe that LLMs often replicate human bias patterns, highlighting the complex interplay between learned representations and cultural stereotypes.Upon acceptance of the paper, our PBBQ dataset will be publicly available for use in future work. Content warning: This paper contains unsafe content.

[61] CrossNews-UA: A Cross-lingual News Semantic Similarity Benchmark for Ukrainian, Polish, Russian, and English

Daryna Dementieva, Evgeniya Sukhodolskaya, Alexander Fraser

Main category: cs.CL

TL;DR: This paper introduces CrossNews-UA, a scalable crowdsourcing pipeline for cross-lingual news similarity assessment, creating a dataset in Ukrainian with related languages (Polish, Russian, English) and testing various models.

Details

Motivation: Address the limitations of manually curated cross-lingual news datasets by creating a scalable approach for fake news detection across multiple languages, particularly beyond English.

Method: Developed a scalable, explainable crowdsourcing pipeline for cross-lingual news similarity assessment and collected CrossNews-UA dataset with news pairs in Ukrainian, Polish, Russian, and English, annotated using 4W criteria (Who, What, Where, When).

Result: Tested various models including bag-of-words, Transformer-based architectures, and large language models (LLMs), revealing challenges in multilingual news analysis and providing insights into model performance.

Conclusion: The work provides a scalable solution for cross-lingual news comparison and highlights the difficulties in multilingual news analysis while offering performance benchmarks for different model types.

Abstract: In the era of social networks and rapid misinformation spread, news analysis remains a critical task. Detecting fake news across multiple languages, particularly beyond English, poses significant challenges. Cross-lingual news comparison offers a promising approach to verify information by leveraging external sources in different languages (Chen and Shu, 2024). However, existing datasets for cross-lingual news analysis (Chen et al., 2022a) were manually curated by journalists and experts, limiting their scalability and adaptability to new languages. In this work, we address this gap by introducing a scalable, explainable crowdsourcing pipeline for cross-lingual news similarity assessment. Using this pipeline, we collected a novel dataset CrossNews-UA of news pairs in Ukrainian as a central language with linguistically and contextually relevant languages-Polish, Russian, and English. Each news pair is annotated for semantic similarity with detailed justifications based on the 4W criteria (Who, What, Where, When). We further tested a range of models, from traditional bag-of-words, Transformer-based architectures to large language models (LLMs). Our results highlight the challenges in multilingual news analysis and offer insights into models performance.

[62] Style Attack Disguise: When Fonts Become a Camouflage for Adversarial Intent

Yangshijie Zhang, Xinda Wang, Jialin Liu, Wenqiang Wang, Zhicong Ma, Xingxing Jia

Main category: cs.CL

TL;DR: Style Attack Disguise (SAD) exploits the human-model perception gap in processing stylistic fonts and emoji, creating effective adversarial attacks on NLP models while remaining human-readable.

Details

Motivation: Social media users increasingly use stylistic fonts and emoji for personal expression, creating visually appealing text that humans can read but NLP models process as distinct tokens, revealing a vulnerability in model perception.

Method: Proposed Style Attack Disguise (SAD) with two variants: light version for query efficiency and strong version for superior attack performance, using stylistic fonts and font-like emoji to create adversarial examples.

Result: Experiments on sentiment classification and machine translation across traditional models, LLMs, and commercial services show SAD achieves strong attack performance. Also demonstrates threats to multimodal tasks like text-to-image and text-to-speech generation.

Conclusion: The human-model perception gap in processing stylistic text represents a significant vulnerability in NLP systems, and SAD effectively exploits this gap for adversarial attacks across various model types and tasks.

Abstract: With social media growth, users employ stylistic fonts and font-like emoji to express individuality, creating visually appealing text that remains human-readable. However, these fonts introduce hidden vulnerabilities in NLP models: while humans easily read stylistic text, models process these characters as distinct tokens, causing interference. We identify this human-model perception gap and propose a style-based attack, Style Attack Disguise (SAD). We design two sizes: light for query efficiency and strong for superior attack performance. Experiments on sentiment classification and machine translation across traditional models, LLMs, and commercial services demonstrate SAD’s strong attack performance. We also show SAD’s potential threats to multimodal tasks including text-to-image and text-to-speech generation.

[63] LLavaCode: Compressed Code Representations for Retrieval-Augmented Code Generation

Daria Cherniuk, Nikita Sukhorukov, Nikita Sushko, Daniil Gusak, Danil Sivtsov, Elena Tutubalina, Evgeny Frolov

Main category: cs.CL

TL;DR: LlavaCode is a framework that compresses code context into compact single-token vectors for faster retrieval-augmented generation in code completion, reducing latency while maintaining quality.

Details

Motivation: Retrieval-augmented generation for code completion suffers from slow inference due to extended sequence lengths from context, which is problematic for interactive IDE settings.

Method: Uses a small projector module to compress code into compact, semantically rich single-token vectors that are interpretable by code LLMs, reducing retrieved context to just a few compressed vectors.

Result: Achieves 20-38% reduction in Time-to-First-Token (TTFT) on line completion tasks compared to full-RAG pipelines, with significant improvements in EM and ES metrics and negligible latency increase.

Conclusion: Compressed context representation enables faster code completion while maintaining generation quality, making retrieval-augmented generation more practical for interactive development environments.

Abstract: Retrieval-augmented generation has emerged as one of the most effective approaches for code completion, particularly when context from a surrounding repository is essential. However, incorporating context significantly extends sequence length, leading to slower inference - a critical limitation for interactive settings such as IDEs. In this work, we introduce LlavaCode, a framework that compresses code into compact, semantically rich representations interpretable by code LLM, enhancing generation quality while reducing the retrieved context to only a few compressed single-token vectors. Using a small projector module we can significantly increase the EM and ES metrics of coding model with negligible latency increase. Our experiments demonstrate that compressed context enables 20-38% reduction in Time-to-First-Token (TTFT) on line completion tasks compared to full-RAG pipelines.

[64] Unraveling Emotions with Pre-Trained Models

Alejandro Pajón-Sanmartín, Francisco De Arriba-Pérez, Silvia García-Méndez, Fátima Leal, Benedita Malheiro, Juan Carlos Burguillo-Rial

Main category: cs.CL

TL;DR: This paper compares fine-tuning vs prompt engineering for emotion detection in LLMs, finding that fine-tuned models achieve >70% accuracy while LLMs need structured prompts and emotion grouping.

Details

Motivation: Transformer models have advanced emotion recognition but struggle with open-ended queries due to contextual ambiguity, linguistic variability, and complex emotional expressions, making direct application of generalist models difficult.

Method: Compares fine-tuning pre-trained models vs prompt engineering in three scenarios: (i) fine-tuned models vs LLMs with simple prompts, (ii) different emotion prompt designs with LLMs, and (iii) impact of emotion grouping techniques.

Result: Fine-tuned pre-trained models achieved metrics above 70% for emotion recognition. LLMs require structured prompt engineering and emotion grouping to enhance their performance.

Conclusion: The advancements improve sentiment analysis, human-computer interaction, and understanding of user behavior across various domains.

Abstract: Transformer models have significantly advanced the field of emotion recognition. However, there are still open challenges when exploring open-ended queries for Large Language Models (LLMs). Although current models offer good results, automatic emotion analysis in open texts presents significant challenges, such as contextual ambiguity, linguistic variability, and difficulty interpreting complex emotional expressions. These limitations make the direct application of generalist models difficult. Accordingly, this work compares the effectiveness of fine-tuning and prompt engineering in emotion detection in three distinct scenarios: (i) performance of fine-tuned pre-trained models and general-purpose LLMs using simple prompts; (ii) effectiveness of different emotion prompt designs with LLMs; and (iii) impact of emotion grouping techniques on these models. Experimental tests attain metrics above 70% with a fine-tuned pre-trained model for emotion recognition. Moreover, the findings highlight that LLMs require structured prompt engineering and emotion grouping to enhance their performance. These advancements improve sentiment analysis, human-computer interaction, and understanding of user behavior across various domains.

[65] DiffAdapt: Difficulty-Adaptive Reasoning for Token-Efficient LLM Inference

Xiang Liu, Xuming Hu, Xiaowen Chu, Eunsol Choi

Main category: cs.CL

TL;DR: DiffAdapt is a lightweight framework that improves LLM reasoning efficiency by selecting appropriate inference strategies (Easy/Normal/Hard) based on question difficulty and reasoning trace entropy, reducing token usage by up to 22.4% without fine-tuning base models.

Details

Motivation: LLMs often generate unnecessarily long thinking traces with unclear utility, leading to inefficient overthinking on easy problems while maintaining high accuracy.

Method: Analyze entropy patterns in reasoning traces, then use DiffAdapt - a small probe that classifies LLM’s final hidden state to select optimal inference strategies (prompt, temperature, max tokens) per question based on difficulty.

Result: Achieves comparable or improved accuracy while reducing token usage by up to 22.4% across five models and eight benchmarks, with observed 22-25% entropy reduction from easy to medium difficulty problems.

Conclusion: Provides a practical path toward compute-efficient reasoning by addressing overthinking through adaptive inference strategies without requiring base model fine-tuning.

Abstract: Recent reasoning Large Language Models (LLMs) demonstrate remarkable problem-solving abilities but often generate long thinking traces whose utility is unclear. Our work aims to improve their efficiency, enabling them to reach high performance without overthinking. First, we analyze the entropy of token probabilities in reasoning traces. Across three models, we observe a consistent U-shaped entropy pattern: high entropy on easy problems despite high accuracy, low entropy on problems with medium difficulty, and high entropy on hard problems reflecting uncertainty. Specifically, we notice 22–25% entropy reduction from easy to medium difficulty regions, suggesting an {overthinking} phenomenon on easy instances. Building on these insights, we introduce \textbf{DiffAdapt}, a lightweight framework that selects Easy/Normal/Hard inference strategies per question based on their difficulty and reasoning trace entropy. Each inference strategy consists of a fixed prompt, temperature and maximum token length. In contrast to existing efficiency optimization methods, our approach does not fine-tune base LLM but a small probe that classifies LLM’s final hidden state, allowing inexpensive adaptation. We comprehensively evaluate our method on five models and eight benchmarks. Our method achieves comparable or improved accuracy while reducing token usage by up to 22.4%, establishing a practical path toward compute-efficient reasoning.

[66] CoSense-LLM: Semantics at the Edge with Cost- and Uncertainty-Aware Cloud-Edge Cooperation

Hasan Akgul, Mari Eplik, Javier Rojas, Aina Binti Abdullah, Pieter van der Merwe

Main category: cs.CL

TL;DR: CoSense-LLM is an edge-first framework that processes multimodal sensor data into semantic tokens for LLM coordination while meeting latency, energy, bandwidth, and privacy constraints through four key components: SenseFusion, Edge-RAG, PromptRouter, and Secure Execution.

Details

Motivation: To enable large language model deployments in interference-prone environments by addressing the challenges of latency, energy consumption, bandwidth limitations, and privacy concerns when processing continuous multimodal sensor streams.

Method: The framework consists of four main components: (1) SenseFusion - lightweight encoder for sensor embedding alignment and compression, (2) Edge-RAG - local hybrid retrieval for site-specific grounding, (3) PromptRouter - cost-aware policy for edge/cloud routing, and (4) Secure Execution - auditable redaction for data minimization. It integrates modern serving optimizations like KV caches, FlashAttention, speculative decoding, and quantized LoRA adapters.

Result: CoSense-LLM achieves sub-second (p95) end-to-end latency on edge paths, reduces inter-tier token and bandwidth costs through local retrieval, and preserves privacy by transmitting only discrete codes and redacted metadata. Edge-RAG improves factual consistency, uncertainty calibration enables selective abstention, and accelerators lower energy per decision.

Conclusion: The results validate an edge-first design approach that treats semantics, privacy, and predictable latency as equally important goals for deploying large models in challenging environments, demonstrating effective coordination between sensor processing and language models under real-world constraints.

Abstract: We present CoSense-LLM, an edge-first framework that turns continuous multimodal sensor streams (for example Wi-Fi CSI, IMU, audio, RFID, and lightweight vision) into compact, verifiable semantic tokens and coordinates with large language models under explicit latency, energy, bandwidth, and privacy constraints. CoSense-LLM has four parts: (i) SenseFusion, a lightweight encoder that aligns sensor embeddings with language and compresses them into short discrete code sequences; (ii) Edge-RAG, a local hybrid retrieval layer that grounds generation in site specific policies and notes; (iii) PromptRouter, a cost and uncertainty aware policy that selects edge only generation, edge plus retrieval, or compact cloud escalation; and (iv) Secure Execution, an auditable redaction path that enforces data minimization so raw waveforms never leave the device. The system works with modern serving optimizations, including paged or streaming KV caches, FlashAttention style kernels, speculative decoding, and quantized LoRA adapters, and supports on device personalization and federated updates under non IID drift. Across home, office, and clinic deployments, CoSense-LLM delivers grounded explanations while meeting tight service level objectives: it sustains sub second (p95) end to end latency on edge dominant paths, reduces inter tier token and bandwidth costs by preferring local retrieval grounded responses, and preserves privacy by transmitting only discrete codes and redacted metadata. Ablations show that Edge-RAG improves factual consistency and reduces contradictions, calibrated uncertainty enables selective abstention and controlled escalations, and KV plus decoding accelerators lower energy per decision. The results support an edge first design that treats semantics, privacy, and predictable latency as co equal goals for large model deployments in interference prone environments.

[67] Are Large Language Models Sensitive to the Motives Behind Communication?

Addison J. Wu, Ryan Liu, Kerem Oktar, Theodore R. Sumers, Thomas L. Griffiths

Main category: cs.CL

TL;DR: LLMs can detect and discount biased information in controlled settings but struggle with real-world scenarios like online ads. Simple interventions that highlight motivations improve their performance.

Details

Motivation: Human communication is motivated by intent, and for LLMs to be effective in real-world applications, they need to critically evaluate content by considering the motivations of the source.

Method: Used controlled experiments from cognitive science to test LLMs’ ability to discount biased information, then evaluated on sponsored online ads. Applied a steering intervention to boost awareness of intentions.

Result: LLMs successfully discount biased information in controlled settings but perform poorly in naturalistic online ad scenarios. The steering intervention significantly improved their performance.

Conclusion: LLMs have basic sensitivity to motivations but need improvements to generalize effectively to real-world settings.

Abstract: Human communication is motivated: people speak, write, and create content with a particular communicative intent in mind. As a result, information that large language models (LLMs) and AI agents process is inherently framed by humans’ intentions and incentives. People are adept at navigating such nuanced information: we routinely identify benevolent or self-serving motives in order to decide what statements to trust. For LLMs to be effective in the real world, they too must critically evaluate content by factoring in the motivations of the source – for instance, weighing the credibility of claims made in a sales pitch. In this paper, we undertake a comprehensive study of whether LLMs have this capacity for motivational vigilance. We first employ controlled experiments from cognitive science to verify that LLMs’ behavior is consistent with rational models of learning from motivated testimony, and find they successfully discount information from biased sources in a human-like manner. We then extend our evaluation to sponsored online adverts, a more naturalistic reflection of LLM agents’ information ecosystems. In these settings, we find that LLMs’ inferences do not track the rational models’ predictions nearly as closely – partly due to additional information that distracts them from vigilance-relevant considerations. However, a simple steering intervention that boosts the salience of intentions and incentives substantially increases the correspondence between LLMs and the rational model. These results suggest that LLMs possess a basic sensitivity to the motivations of others, but generalizing to novel real-world settings will require further improvements to these models.

[68] Do Prompts Reshape Representations? An Empirical Study of Prompting Effects on Embeddings

Cesar Gonzalez-Gutierrez, Dirk Hovy

Main category: cs.CL

TL;DR: Study examines how prompting affects LM internal representations in zero-shot settings, finding that prompt relevance doesn’t consistently correlate with representation quality.

Details

Motivation: To understand the mechanisms behind LMs' ability to perform diverse tasks without supervision by studying the relationship between prompting and internal representation quality.

Method: Conducted probing experiments on prompt embeddings using various prompt template combinations for zero-shot classification tasks.

Result: Prompting affects representation quality, but these changes don’t consistently correlate with prompt relevance to target tasks, challenging assumptions about relevant prompts.

Conclusion: More relevant prompts don’t necessarily lead to better representations, suggesting complex factors influence how prompting affects LM internal representations.

Abstract: Prompting is a common approach for leveraging LMs in zero-shot settings. However, the underlying mechanisms that enable LMs to perform diverse tasks without task-specific supervision remain poorly understood. Studying the relationship between prompting and the quality of internal representations can shed light on how pre-trained embeddings may support in-context task solving. In this empirical study, we conduct a series of probing experiments on prompt embeddings, analyzing various combinations of prompt templates for zero-shot classification. Our findings show that while prompting affects the quality of representations, these changes do not consistently correlate with the relevance of the prompts to the target task. This result challenges the assumption that more relevant prompts necessarily lead to better representations. We further analyze potential factors that may contribute to this unexpected behavior.

[69] From Answers to Guidance: A Proactive Dialogue System for Legal Documents

Ashish Chouhan, Michael Gertz

Main category: cs.CL

TL;DR: EUDial is a multi-turn dialogue dataset from EU parliamentary blogs, and LexGuide is a framework using retrieval-augmented generation with hierarchical topics to improve legal information accessibility for citizens.

Details

Motivation: Legal information is hard for laypeople to understand despite EU's open access to legislation. There's a gap between available legal resources and citizen comprehension.

Method: Created EUDial dataset from 204 parliamentary blogs (880 dialogue turns), and developed LexGuide framework using retrieval-augmented generation with hierarchical topic organization for structured dialogue progression.

Result: Proactive, structured navigation effectively bridges the gap between legal information availability and citizen understanding.

Conclusion: EUDial and LexGuide provide practical resources for advancing proactive legal dialogue systems that make legal information more accessible to citizens.

Abstract: The accessibility of legal information remains a constant challenge, particularly for laypersons seeking to understand and apply complex institutional texts. While the European Union provides open access to legislation, parliamentary responses, and regulatory documents, these resources can be challenging for laypeople to explore. In this paper, we introduce EUDial, a proactive multi-turn dialogue dataset constructed from 204 blogs curated by the Citizens’ Enquiries Unit (AskEP) of the European Parliamentary Research Service. EUDial contains 880 dialogue turns (averaging 4.3 turns per dialogue), where each dialogue includes initial questions, structured answers, and follow-up questions. Beyond dataset construction, we propose the LexGuide framework that leverages retrieval-augmented generation with hierarchical topic organization to structure dialogue progression, ensuring both comprehensive coverage of legal aspects and coherence across conversational turns. The results demonstrate that proactive, structured navigation closes the gap between the availability of legal information and citizen comprehension, establishing EUDial and LexGuide as practical resources for advancing proactive legal dialogue systems.

[70] Zhyper: Factorized Hypernetworks for Conditioned LLM Fine-Tuning

M. H. I. Abdalla, Zhipin Wang, Christian Frey, Steffen Eger, Josif Grabocka

Main category: cs.CL

TL;DR: Zhyper is a parameter-efficient hypernetwork framework that generates context-aware LoRA adapters from text descriptions for LLM conditioning, achieving competitive performance with 26x fewer parameters than SOTA methods.

Details

Motivation: Prompt engineering fails to ensure LLMs behave according to specific cultural or semantic conditioning due to pre-training biases, and existing fine-tuning methods require too many parameters.

Method: Proposed Zhyper - a factorized hypernetwork framework that generates context-aware LoRA adapters from textual descriptions for parameter-efficient LLM conditioning.

Result: Zhyper achieves competitive performance on multiple benchmarks with up to 26x fewer parameters than state-of-the-art baselines, and shows improved generalization to out-of-domain settings in cultural alignment.

Conclusion: Zhyper provides an effective parameter-efficient solution for LLM conditioning that captures fine-grained contextual values while significantly reducing parameter requirements.

Abstract: Large Language Model (LLM) conditioning refers to instructing an LLM to generate content in accordance with the norms and values of a specific culture, beliefs of a particular political orientation, or any desired text-specified semantic conditioning. Unfortunately, prompt engineering does not ensure that LLMs behave in accordance with a desired conditioning due to the inductive bias of the pre-training and alignment datasets. Prior works have focused on fine-tuning LLMs by directly conditioning the LoRA weights; however, such methods introduce a large number of parameters. As a remedy, we propose Zhyper, a parameter-efficient factorized hypernetwork framework that generates context-aware LoRA adapters from textual descriptions. Experiments on multiple benchmarks show that Zhyper achieves competitive performance with up to 26x fewer parameters than the state-of-the-art baselines. Furthermore, we extend Zhyper to cultural alignment, demonstrating improved generalization to out-of-domain settings and a better capturing of fine-grained contextual values.

[71] SmartSwitch: Advancing LLM Reasoning by Overcoming Underthinking via Promoting Deeper Thought Exploration

Xichen Zhang, Sitong Wu, Haoru Tan, Shaozuo Yu, Yinghao Zhu, Ziyi He, Jiaya Jia

Main category: cs.CL

TL;DR: SmartSwitch is a plug-and-play inference framework that detects and addresses underthinking in LLMs by monitoring thought switches and guiding models to explore promising but abandoned reasoning paths.

Details

Motivation: Address the underthinking problem in LLMs where models exhibit shallow reasoning by frequently switching thoughts without sufficient exploration, limiting performance and token efficiency.

Method: Uses a perception module to detect thought switches and evaluate potential of preceding thoughts using a process reward model. When high-potential thoughts are prematurely abandoned, an intervention module backtracks and inserts deepening prompts to encourage further exploration.

Result: Extensive experiments on challenging mathematical reasoning benchmarks show significant performance improvements across various LLMs of different sizes.

Conclusion: SmartSwitch effectively enhances LLM reasoning by preventing premature thought abandonment and promoting deeper exploration of promising reasoning paths.

Abstract: The long chain-of-thought (LongCoT) capability is central to the recent breakthroughs achieved by large language models in complex reasoning tasks. However, the accompanying issue of ‘‘underthinking’’, where models exhibit shallow reasoning by frequently switching thoughts without sufficient exploration, limits both performance and token efficiency. To address this problem, we propose a simple yet effective reasoning strategy: the SmartSwitch inference framework. This framework can be easily integrated into any large language model as a plug-and-play solution, continuously monitoring the model’s reasoning process to detect underthinking and guide it toward deeper exploration of promising but overlooked thoughts. Specifically, the perception module identifies points where thoughts switch and evaluates the potential of the preceding thought using an off-the-shelf process reward model (PRM). If a high-potential thought is found to be prematurely abandoned, the intervention module interrupts the ongoing inference, backtracks to the point before the switch, and inserts a “deepening prompt” to encourage further exploration along that promising path. Extensive experiments on challenging mathematical reasoning benchmarks demonstrate that our method significantly enhances the performance of various large language models of different sizes.

[72] AdaSPEC: Selective Knowledge Distillation for Efficient Speculative Decoders

Yuezhou Hu, Jiaxin Guo, Xinyu Feng, Tuo Zhao

Main category: cs.CL

TL;DR: AdaSPEC improves speculative decoding by using selective token filtering in knowledge distillation to enhance draft-target model alignment and increase token acceptance rates.

Details

Motivation: Conventional knowledge distillation methods misalign with speculative decoding's goal of maximizing token acceptance, as they minimize KL divergence across all tokens, causing draft models to struggle with target model knowledge assimilation due to capacity constraints.

Method: AdaSPEC incorporates selective token filtering into knowledge distillation using a reference model to identify and filter out difficult-to-fit tokens, allowing the draft model to better align with the target model on simpler tokens.

Result: AdaSPEC consistently outperforms DistillSpec across diverse tasks (arithmetic reasoning, instruction-following, coding, summarization) with model configurations of 31M/1.4B and 350M/2.7B parameters, achieving up to 15% higher acceptance rates.

Conclusion: AdaSPEC improves speculative decoding performance by aligning knowledge distillation with the true objective of maximizing token acceptance rate, enhancing draft model effectiveness without compromising generation quality.

Abstract: Speculative Decoding (SD) accelerates large language model inference by employing a small draft model to generate predictions, which are then verified by a larger target model. The effectiveness of SD hinges on the alignment between these models, which is typically enhanced by Knowledge Distillation (KD). However, conventional KD methods aim to minimize the KL divergence between the draft and target models across all tokens, a goal that is misaligned with the true objective of SD, which is to maximize token acceptance rate. Therefore, draft models often struggle to fully assimilate the target model’s knowledge due to capacity constraints, leading to suboptimal performance. To address this challenge, we propose AdaSPEC, a novel method that incorporates selective token filtering into the KD process. AdaSPEC utilizes a reference model to identify and filter out difficult-to-fit tokens, enabling the distillation of a draft model that better aligns with the target model on simpler tokens. This approach improves the overall token acceptance rate without compromising generation quality. We evaluate AdaSPEC across diverse tasks, including arithmetic reasoning, instruction-following, coding, and summarization, using model configurations of 31M/1.4B and 350M/2.7B parameters. Our results demonstrate that AdaSPEC consistently outperforms the state-of-the-art DistillSpec method, achieving higher acceptance rates across all tasks (up to 15%). The code is publicly available at https://github.com/yuezhouhu/adaspec.

[73] Adapting Multilingual Models to Code-Mixed Tasks via Model Merging

Prashant Kodali, Vaishnavi Shivkumar, Swarang Joshi, Monojit Choudhary, Ponnurangam Kumaraguru, Manish Shrivastava

Main category: cs.CL

TL;DR: Model merging outperforms conventional adaptation methods for code-mixed NLP, achieving 2-5 F1 point gains over full fine-tuning and better cross-language transfer than monolingual baselines.

Details

Motivation: To develop more effective adaptation strategies for code-mixed NLP that better leverage unlabeled data and improve performance on low-resource language pairs.

Method: Three-step approach: (1) continued pre-training on unlabeled code-mixed text, (2) merging adapted checkpoint with base multilingual model, (3) fine-tuning on downstream task data. Evaluated on English-Hindi and English-Spanish sentence classification.

Result: Merged models consistently outperform full fine-tuning (2-5 F1 points gain) and CPT->FT (~1-2 points gain). Better cross-pair transfer: 0.65-0.68 F1 vs 0.61-0.63 for full fine-tuning. Zero/few-shot LLM prompting lags behind fine-tuned models.

Conclusion: Model merging effectively leverages unlabeled code-mixed data and provides reliable adaptation recipes for different data regimes. Code-mixed knowledge serves as better substrate for low-resource language pairs than monolingual approaches.

Abstract: We study model merging as a practical alternative to conventional adaptation strategies for code-mixed NLP. Starting from a multilingual base model, we: (i) perform continued pre-training (CPT) on unlabeled code-mixed text to obtain an adapted checkpoint, (ii) merge checkpoint with the base model, and (iii) fine-tune (FT) on the downstream task data. We evaluate our approach for sentence classification (sentiment and hate speech) task in English-Hindi (En-Hi) and English-Spanish (En-Es) using XLM-R and Llama-3.2-1B models. Our results show that merged models consistently outperform full fine-tuning and CPT->FT. We observe gains of 2–5 points in F1 over full fine-tuning and ~1-2 points over CPT->FT, indicating that unlabeled data is leveraged more effectively via merging than via CPT alone. Zero-/few-shot prompting with larger LLMs (e.g., Llama-3.3-70B) lags behind fine-tuned and merged checkpoints, underscoring limits of in-context learning for code-mixed inputs. We further test cross-pair transfer by training on En-Hi and evaluating on En-Ta and En-Ml: merged checkpoints transfer more strongly than monolingual-English baselines (e.g., TV/TIES variants reaching 0.65-0.68 F1 vs 0.61-0.63 for full fine-tuning), suggesting that code-mixed knowledge is a more reliable substrate for low-resource pairs. We conclude with adaptation recipes matched to common data regimes (labeled only; labeled+unlabeled; transfer-only) and discuss limitations and scaling considerations for broader tasks and larger models.

[74] ToolDreamer: Instilling LLM Reasoning Into Tool Retrievers

Saptarshi Sengupta, Zhengyu Zhou, Jun Araki, Xingbo Wang, Bingqing Wang, Suhang Wang, Zhe Feng

Main category: cs.CL

TL;DR: ToolDreamer improves tool retrieval for LLMs by generating synthetic tool descriptions that better align with user queries, enabling more effective handling of large tool sets without exceeding context window limits.

Details

Motivation: Existing tool retrieval methods fail because user queries often don't match the language of tool descriptions, leading to suboptimal retrieval when dealing with large tool sets that exceed LLM context windows.

Method: Proposes ToolDreamer framework that conditions retriever models to fetch tools based on hypothetical (synthetic) tool descriptions generated by an LLM, creating better alignment between queries and tools.

Result: Applied on ToolRet dataset, ToolDreamer improves performance of both sparse and dense retrievers with and without training, demonstrating flexibility and effectiveness.

Conclusion: The framework successfully offloads reasoning burden to the retriever, allowing LLMs to handle large tool collections effectively without overwhelming their context windows.

Abstract: Tool calling has become increasingly popular for Large Language Models (LLMs). However, for large tool sets, the resulting tokens would exceed the LLM’s context window limit, making it impossible to include every tool. Hence, an external retriever is used to provide LLMs with the most relevant tools for a query. Existing retrieval models rank tools based on the similarity between a user query and a tool description (TD). This leads to suboptimal retrieval as user requests are often poorly aligned with the language of TD. To remedy the issue, we propose ToolDreamer, a framework to condition retriever models to fetch tools based on hypothetical (synthetic) TD generated using an LLM, i.e., description of tools that the LLM feels will be potentially useful for the query. The framework enables a more natural alignment between queries and tools within the language space of TD’s. We apply ToolDreamer on the ToolRet dataset and show that our method improves the performance of sparse and dense retrievers with and without training, thus showcasing its flexibility. Through our proposed framework, our aim is to offload a portion of the reasoning burden to the retriever so that the LLM may effectively handle a large collection of tools without inundating its context window.

[75] The Art of Asking: Multilingual Prompt Optimization for Synthetic Data

David Mora, Viraat Aryabumi, Wei-Yin Ko, Sara Hooker, Julia Kreutzer, Marzieh Fadaee

Main category: cs.CL

TL;DR: A lightweight framework for optimizing multilingual prompts through naturalness, cultural adaptation, and difficulty enhancement, achieving significant performance improvements over translation-only baselines.

Details

Motivation: Current synthetic data for multilingual LLMs relies on translation-based prompts that inherit English-centric framing and neglect cultural dimensions, limiting model generalization.

Method: Systematic transformation of translated prompts for Naturalness, Cultural Adaptation, and Difficulty Enhancement across 12 languages spanning 7 families using an off-the-shelf multilingual LLM.

Result: Substantial improvements over translation-only baseline: +4.7% on Global-MMLU accuracy, +2.4% on Flores XCometXL, and +35.3% wins in preferences on mArenaHard.

Conclusion: Prompt-space optimization is a simple yet powerful paradigm for building more robust, culturally grounded, and globally capable multilingual LLMs.

Abstract: Synthetic data has become a cornerstone for scaling large language models, yet its multilingual use remains bottlenecked by translation-based prompts. This strategy inherits English-centric framing and style and neglects cultural dimensions, ultimately constraining model generalization. We argue that the overlooked prompt space-the very inputs that define training distributions-offers a more powerful lever for improving multilingual performance. We introduce a lightweight framework for prompt-space optimization, where translated prompts are systematically transformed for Naturalness, Cultural Adaptation, and Difficulty Enhancement. Using an off-the-shelf multilingual LLM, we apply these transformations to prompts for 12 languages spanning 7 families. Under identical data conditions, our approaches achieve substantial and consistent downstream improvements over the translation-only baseline: +4.7% on Global-MMLU accuracy, +2.4% on Flores XCometXL and +35.3% wins in preferences on mArenaHard. We establish prompt-space optimization as a simple yet powerful paradigm for building multilingual LLMs that are more robust, culturally grounded, and globally capable.

[76] Scaf-GRPO: Scaffolded Group Relative Policy Optimization for Enhancing LLM Reasoning

Xichen Zhang, Sitong Wu, Yinghao Zhu, Haoru Tan, Shaozuo Yu, Ziyi He, Jiaya Jia

Main category: cs.CL

TL;DR: Scaf-GRPO addresses the ’learning cliff’ problem in RL for LLMs by providing tiered hints when learning plateaus, enabling models to solve problems beyond their current capabilities.

Details

Motivation: To overcome the 'learning cliff' phenomenon where models fail on problems far beyond their current capabilities, resulting in zero-reward signals that stall learning progress in policy optimization algorithms like GRPO.

Method: A progressive training framework that diagnoses learning stagnation and intervenes by injecting tiered in-prompt hints (from abstract concepts to concrete steps) to help models construct valid solutions independently.

Result: Boosted the pass@1 score of Qwen2.5-Math-7B model on AIME24 benchmark by 44.3% relative to vanilla GRPO baseline, demonstrating effectiveness in solving previously unsolvable problems.

Conclusion: Scaf-GRPO provides a robust methodology for extending LLMs’ autonomous reasoning capabilities by enabling them to tackle problems beyond their current reach through strategic scaffolding.

Abstract: Reinforcement learning from verifiable rewards has emerged as a powerful technique for enhancing the complex reasoning abilities of Large Language Models (LLMs). However, these methods are fundamentally constrained by the ‘’learning cliff’’ phenomenon: when faced with problems far beyond their current capabilities, models consistently fail, yielding a persistent zero-reward signal. In policy optimization algorithms like GRPO, this collapses the advantage calculation to zero, rendering these difficult problems invisible to the learning gradient and stalling progress. To overcome this, we introduce Scaf-GRPO (Scaffolded Group Relative Policy Optimization), a progressive training framework that strategically provides minimal guidance only when a model’s independent learning has plateaued. The framework first diagnoses learning stagnation and then intervenes by injecting tiered in-prompt hints, ranging from abstract concepts to concrete steps, enabling the model to construct a valid solution by itself. Extensive experiments on challenging mathematics benchmarks demonstrate Scaf-GRPO’s effectiveness, boosting the pass@1 score of the Qwen2.5-Math-7B model on the AIME24 benchmark by a relative 44.3% over a vanilla GRPO baseline. This result demonstrates our framework provides a robust and effective methodology for unlocking a model’s ability to solve problems previously beyond its reach, a critical step towards extending the frontier of autonomous reasoning in LLM.

[77] Hubble: a Model Suite to Advance the Study of LLM Memorization

Johnny Tian-Zheng Wei, Ameya Godbole, Mohammad Aflah Khan, Ryan Wang, Xiaoyuan Zhu, James Flemings, Nitya Kashyap, Krishna P. Gummadi, Willie Neiswanger, Robin Jia

Main category: cs.CL

TL;DR: Hubble is a suite of open-source LLMs for studying memorization, with standard and perturbed variants that insert controlled text to analyze memorization risks and patterns.

Details

Motivation: To enable scientific study of LLM memorization by providing controlled models that emulate key memorization risks through text insertion.

Method: Created 8 models (standard and perturbed with 1B/8B parameters, trained on 100B/500B tokens) plus 6 additional perturbed models with text inserted at different pretraining phases.

Result: Memorization risks depend on frequency relative to corpus size, and sensitive data without continued exposure can be forgotten. Biographies reveal different private information memorization patterns.

Conclusion: Two best practices for addressing memorization: dilute sensitive data by increasing corpus size, and order sensitive data to appear earlier in training. Hubble enables broader memorization research, membership inference, and machine unlearning studies.

Abstract: We present Hubble, a suite of fully open-source large language models (LLMs) for the scientific study of LLM memorization. Hubble models come in standard and perturbed variants: standard models are pretrained on a large English corpus, and perturbed models are trained in the same way but with controlled insertion of text (e.g., book passages, biographies, and test sets) designed to emulate key memorization risks. Our core release includes 8 models – standard and perturbed models with 1B or 8B parameters, pretrained on 100B or 500B tokens – establishing that memorization risks are determined by the frequency of sensitive data relative to size of the training corpus (i.e., a password appearing once in a smaller corpus is memorized better than the same password in a larger corpus). Our release also includes 6 perturbed models with text inserted at different pretraining phases, showing that sensitive data without continued exposure can be forgotten. These findings suggest two best practices for addressing memorization risks: to dilute sensitive data by increasing the size of the training corpus, and to order sensitive data to appear earlier in training. Beyond these general empirical findings, Hubble enables a broad range of memorization research; for example, analyzing the biographies reveals how readily different types of private information are memorized. We also demonstrate that the randomized insertions in Hubble make it an ideal testbed for membership inference and machine unlearning, and invite the community to further explore, benchmark, and build upon our work.

[78] LV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K

Tao Yuan, Xuefei Ning, Dong Zhou, Zhijie Yang, Shiyao Li, Minghui Zhuang, Zheyue Tan, Zhuyu Yao, Dahua Lin, Boxun Li, Guohao Dai, Shengen Yan, Yu Wang

Main category: cs.CL

TL;DR: LV-Eval is a challenging long-context benchmark with five length levels up to 256k words, featuring single-hop and multi-hop QA tasks across 11 bilingual datasets, designed to address limitations of existing benchmarks.

Details

Motivation: Current LLM benchmarks have insufficient context lengths (5k-21k), suffer from knowledge leakage, and use inaccurate metrics, leading to biased evaluation of long-context capabilities.

Method: LV-Eval incorporates three key techniques: confusing facts insertion, keyword/phrase replacement, and keyword-recall-based metric design to create challenging test instances and mitigate evaluation biases.

Result: Evaluation of 15 LLMs shows Moonshot-v1, Qwen-2.5-72B, and Llama-3.1-70B perform best, especially below 64k. Models show distinct performance trends, with significant degradation when confusing information is present.

Conclusion: LV-Eval provides more objective long-context evaluation by addressing knowledge leakage and metric issues, revealing that models exhibit different performance patterns and are sensitive to confusing information.

Abstract: State-of-the-art large language models (LLMs) are now claiming remarkable supported context lengths of 256k or even more. In contrast, the average context lengths of mainstream benchmarks are insufficient (5k-21k), and they suffer from potential knowledge leakage and inaccurate metrics, resulting in biased evaluation. This paper introduces LV-Eval, a challenging long-context benchmark with five length levels (16k, 32k, 64k, 128k, and 256k) reaching up to 256k words. LV-Eval features two main tasks, single-hop QA and multi-hop QA, comprising 11 bilingual datasets. The design of LV-Eval has incorporated three key techniques, namely confusing facts insertion, keyword and phrase replacement, and keyword-recall-based metric design. The advantages of LV-Eval include controllable evaluation across different context lengths, challenging test instances with confusing facts, mitigated knowledge leakage, and more objective evaluations. We evaluate 15 LLMs on LV-Eval and conduct ablation studies on the benchmarking techniques. The results reveal that: (i) Moonshot-v1 and recent large-scale open-source models, such as Qwen-2.5-72B and Llama-3.1-70B, achieve the highest performance on LV-Eval, particularly at lengths below 64k. (ii) Models exhibit distinct score trends. For example, GLM-4-9B-128k, Yi-6B-200k, and Llama3-8B-1M exhibit a relatively gentle degradation of performance, but their absolute performances may not necessarily be higher than those of LLMs with shorter context lengths. (iii) LLMs’ performances can significantly degrade in the presence of confusing information, especially in the pressure test of “needle in a haystack”. (iv) Issues related to knowledge leakage and inaccurate metrics introduce bias in evaluation, and these concerns are alleviated in LV-Eval. All datasets and evaluation codes are released at: https://github.com/infinigence/LVEval.

[79] LASeR: Learning to Adaptively Select Reward Models with Multi-Armed Bandits

Duy Nguyen, Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal

Main category: cs.CL

TL;DR: LASeR is a method that frames reward model selection as a multi-armed bandit problem to efficiently train LLMs using multiple reward models by adaptively selecting the most suitable RM for each instance.

Details

Motivation: Using a single fixed reward model for training LLMs is suboptimal as RMs specialized for one task may not generalize well to new tasks, while using multiple RMs simultaneously can be computationally expensive and lead to conflicting signals.

Method: LASeR treats reward model selection as a multi-armed bandit problem, iteratively training LLMs by adaptively selecting the most appropriate reward model for each training instance.

Result: LASeR improves average accuracy by 2.67% on commonsense and math reasoning tasks, achieves 72.69% AlpacaEval win rate on open-ended tasks, and improves F1 scores by ~3 points on long-context generation tasks compared to baseline methods.

Conclusion: LASeR provides an efficient and effective approach for training LLMs with multiple reward models, outperforming ensemble baselines while being computationally more efficient.

Abstract: Reward Models (RMs) are crucial to aligning large language models (LLMs), but the degree to which an RM specialized to one task (e.g. writing) generalizes to new tasks (e.g. math) is often not known a priori, often making using only one fixed RM to train LLMs suboptimal. However, optimizing LLMs with multiple RMs simultaneously can incur a prohibitively high computational cost and lead to conflicting signals from different RMs that may degrade performance. To address these challenges, we introduce LASeR (Learning to Adaptively Select Rewards), which frames reward model selection as a multi-armed bandit problem, efficiently and iteratively training LLMs using multiple RMs by selecting the most well-suited RM for each instance. On commonsense and math reasoning tasks, we show that LASeR boosts iterative LLM training, improving the absolute average accuracy of Llama-3-8B over three datasets by 2.67% over an ensemble of RM scores while also showing superior efficiency (e.g., a 2x speedup). Moreover, on WildChat (open-ended instruction-following tasks), LASeR leads to a 72.69% AlpacaEval win rate over the RM score ensemble baseline. Extending to long-context generation, LASeR improves by 2.96 F1 points (avg.) on single-document QA tasks and 2.97 F1 points on few-shot learning over the RM score ensemble baseline with best-of-n sampling.

[80] Demystifying Domain-adaptive Post-training for Financial LLMs

Zixuan Ke, Yifei Ming, Xuan-Phi Nguyen, Caiming Xiong, Shafiq Joty

Main category: cs.CL

TL;DR: FINDAP is a systematic framework for domain-adaptive post-training of LLMs in finance, consisting of capability definition (FinCap), training recipe (FinRec), datasets (FinTrain), and evaluation (FinEval), resulting in state-of-the-art Llama-Fin model.

Details

Motivation: Domain-adaptive post-training shows promise for specialized domains like finance, but challenges remain in identifying optimal adaptation criteria and training strategies across different data and model configurations.

Method: Four-component framework: FinCap defines core financial capabilities; FinRec provides training recipe with continual pre-training, instruction-following, and novel preference data distillation using process signals; FinTrain offers curated datasets; FinEval provides comprehensive evaluation aligned with capabilities.

Result: The resulting Llama-Fin model achieves state-of-the-art performance across a wide range of financial tasks. Analysis shows how each post-training stage contributes to distinct capabilities.

Conclusion: The framework uncovers specific challenges and effective solutions for domain adaptation, providing valuable insights for adapting LLMs to specialized domains.

Abstract: Domain-adaptive post-training of large language models (LLMs) has emerged as a promising approach for specialized domains such as medicine and finance. However, significant challenges remain in identifying optimal adaptation criteria and training strategies across varying data and model configurations. To address these challenges, we introduce FINDAP, a systematic and fine-grained investigation into domain-adaptive post-training of LLMs for the finance domain. Our approach consists of four key components: FinCap, which defines the core capabilities required for the target domain; FinRec, an effective training recipe that jointly optimizes continual pre-training and instruction-following, along with a novel preference data distillation method leveraging process signals from a generative reward model; FinTrain, a curated set of training datasets supporting FinRec; and FinEval, a comprehensive evaluation suite aligned with FinCap. The resulting model, Llama-Fin, achieves state-of-the-art performance across a wide range of financial tasks. Our analysis also highlights how each post-training stage contributes to distinct capabilities, uncovering specific challenges and effective solutions, providing valuable insights for domain adaptation of LLMs

[81] Bias Beware: The Impact of Cognitive Biases on LLM-Driven Product Recommendations

Giorgos Filandrianos, Angeliki Dimitriou, Maria Lymperaiou, Konstantinos Thomas, Giorgos Stamou

Main category: cs.CL

TL;DR: This paper investigates how cognitive biases can be used as black-box adversarial strategies to manipulate LLM-based product recommenders, finding that some biases like social proof boost recommendations while others like scarcity reduce them.

Details

Motivation: LLMs have revolutionized product recommenders but are vulnerable to adversarial manipulation, posing critical challenges in real-world commercial applications where such manipulations are hard to detect.

Method: The approach taps into human psychological principles by seamlessly modifying product descriptions to exploit cognitive biases as black-box adversarial strategies, drawing parallels between their effects on LLMs and human purchasing behavior.

Result: Certain biases like social proof consistently boost product recommendation rate and ranking, while others like scarcity and exclusivity surprisingly reduce visibility. Cognitive biases are deeply embedded in state-of-the-art LLMs.

Conclusion: Cognitive biases lead to highly unpredictable behavior in product recommendations and pose significant challenges for effective mitigation in LLM-based recommender systems.

Abstract: The advent of Large Language Models (LLMs) has revolutionized product recommenders, yet their susceptibility to adversarial manipulation poses critical challenges, particularly in real-world commercial applications. Our approach is the first one to tap into human psychological principles, seamlessly modifying product descriptions, making such manipulations hard to detect. In this work, we investigate cognitive biases as black-box adversarial strategies, drawing parallels between their effects on LLMs and human purchasing behavior. Through extensive evaluation across models of varying scale, we find that certain biases, such as social proof, consistently boost product recommendation rate and ranking, while others, like scarcity and exclusivity, surprisingly reduce visibility. Our results demonstrate that cognitive biases are deeply embedded in state-of-the-art LLMs, leading to highly unpredictable behavior in product recommendations and posing significant challenges for effective mitigation.

[82] From TOWER to SPIRE: Adding the Speech Modality to a Translation-Specialist LLM

Kshitij Ambilduke, Ben Peters, Sonal Sannigrahi, Anil Keshwani, Tsz Kin Lam, Bruno Martins, André F. T. Martins, Marcely Zanon Boito

Main category: cs.CL

TL;DR: Spire is a speech-augmented language model that translates and transcribes English speech into 10 other languages, integrating speech modality through discretization and continued pre-training with only 42.5K hours of speech data.

Details

Motivation: To create a multilingual language model that can handle both speech and text translation while preserving strong text-based performance, using significantly less data than existing speech LMs.

Method: Integrates speech modality into existing multilingual LM via speech discretization and continued pre-training, treating discretized speech as an additional translation language within the multilingual LM framework.

Result: Successfully equipped the model with speech capabilities while maintaining strong text performance, achieving this with significantly less data than existing speech LMs.

Conclusion: The approach demonstrates that discretized speech input integration as an additional language is feasible during LM adaptation, and the code and models are made available to the community.

Abstract: We introduce Spire, a speech-augmented language model (LM) capable of both translating and transcribing speech input from English into 10 other languages as well as translating text input in both language directions. Spire integrates the speech modality into an existing multilingual LM via speech discretization and continued pre-training using only 42.5K hours of speech. In particular, we adopt the pretraining framework of multilingual LMs and treat discretized speech input as an additional translation language. This approach not only equips the model with speech capabilities, but also preserves its strong text-based performance. We achieve this using significantly less data than existing speech LMs, demonstrating that discretized speech input integration as an additional language is feasible during LM adaptation. We make our code and models available to the community.

[83] GeoBenchX: Benchmarking LLMs in Agent Solving Multistep Geospatial Tasks

Varvara Krechetova, Denis Kochedykov

Main category: cs.CL

TL;DR: This paper establishes GeoBenchX, a benchmark for evaluating LLMs’ tool-calling capabilities on multi-step geospatial tasks, testing 8 commercial models and finding o4-mini and Claude 3.5 Sonnet perform best overall.

Details

Motivation: To create a standardized evaluation framework for assessing LLMs' geospatial tool-calling capabilities relevant to commercial GIS practitioners, addressing the need for systematic testing of multi-step geospatial reasoning.

Method: Developed a benchmark with tasks in four complexity categories, using a tool-calling agent with 23 geospatial functions, and implemented LLM-as-Judge evaluation framework to compare agent solutions against references.

Result: o4-mini and Claude 3.5 Sonnet achieved best overall performance, while Claude Sonnet 4 was less accurate due to preference for providing solutions over task rejection. Significant token usage differences observed, with Anthropic models consuming more tokens.

Conclusion: The GeoBenchX benchmark, evaluation framework, and data generation pipeline provide standardized methods for ongoing LLM evaluation in GeoAI, with identified common errors including geometrical misunderstandings and outdated knowledge.

Abstract: This paper establishes a benchmark for evaluating tool-calling capabilities of large language models (LLMs) on multi-step geospatial tasks relevant to commercial GIS practitioners. We assess eight commercial LLMs (Claude Sonnet 3.5 and 4, Claude Haiku 3.5, Gemini 2.0 Flash, Gemini 2.5 Pro Preview, GPT-4o, GPT-4.1 and o4-mini) using a simple tool-calling agent equipped with 23 geospatial functions. Our benchmark comprises tasks in four categories of increasing complexity, with both solvable and intentionally unsolvable tasks to test rejection accuracy. We develop a LLM-as-Judge evaluation framework to compare agent solutions against reference solutions. Results show o4-mini and Claude 3.5 Sonnet achieve the best overall performance, OpenAI’s GPT-4.1, GPT-4o and Google’s Gemini 2.5 Pro Preview do not fall far behind, but the last two are more efficient in identifying unsolvable tasks. Claude Sonnet 4, due its preference to provide any solution rather than reject a task, proved to be less accurate. We observe significant differences in token usage, with Anthropic models consuming more tokens than competitors. Common errors include misunderstanding geometrical relationships, relying on outdated knowledge, and inefficient data manipulation. The resulting benchmark set, evaluation framework, and data generation pipeline are released as open-source resources (available at https://github.com/Solirinai/GeoBenchX), providing one more standardized method for the ongoing evaluation of LLMs for GeoAI.

[84] CausalRAG: Integrating Causal Graphs into Retrieval-Augmented Generation

Nengbo Wang, Xiaotian Han, Jagdip Singh, Jing Ma, Vipin Chaudhary

Main category: cs.CL

TL;DR: CausalRAG is a novel framework that incorporates causal graphs into retrieval-augmented generation to address limitations of traditional RAG systems, improving contextual continuity and retrieval precision.

Details

Motivation: Traditional RAG systems face limitations including disrupted contextual integrity from text chunking and over-reliance on semantic similarity for retrieval, which CausalRAG aims to overcome.

Method: The proposed CausalRAG framework incorporates causal graphs into the retrieval process by constructing and tracing causal relationships to preserve contextual continuity.

Result: CausalRAG demonstrates superiority over regular RAG and graph-based RAG approaches across several metrics, showing improved accuracy and interpretability.

Conclusion: Grounding retrieval in causal reasoning provides a promising approach to knowledge-intensive tasks, offering better contextual preservation and retrieval precision.

Abstract: Large language models (LLMs) have revolutionized natural language processing (NLP), particularly through Retrieval-Augmented Generation (RAG), which enhances LLM capabilities by integrating external knowledge. However, traditional RAG systems face critical limitations, including disrupted contextual integrity due to text chunking, and over-reliance on semantic similarity for retrieval. To address these issues, we propose CausalRAG, a novel framework that incorporates causal graphs into the retrieval process. By constructing and tracing causal relationships, CausalRAG preserves contextual continuity and improves retrieval precision, leading to more accurate and interpretable responses. We evaluate CausalRAG against regular RAG and graph-based RAG approaches, demonstrating its superiority across several metrics. Our findings suggest that grounding retrieval in causal reasoning provides a promising approach to knowledge-intensive tasks.

[85] NAACL2025 Tutorial: Adaptation of Large Language Models

Zixuan Ke, Yifei Ming, Shafiq Joty

Main category: cs.CL

TL;DR: This tutorial provides an overview of LLM adaptation techniques to address limitations of generic LLMs in specialized domains and their static nature, covering parametric and semi-parametric knowledge adaptation methods.

Details

Motivation: Generic LLMs struggle with specialized domains like finance and healthcare, cannot evolve with changing world knowledge, and are too large/costly for practical deployment. Adaptation is crucial for both industry and academia.

Method: Categorizes adaptation techniques into two main families: 1) Parametric knowledge adaptation (updating internal model parameters), including real-time techniques like model editing; 2) Semi-parametric knowledge adaptation (updating parameters to better leverage external knowledge/tools) using RAG and agent-based systems.

Result: The tutorial provides a comprehensive framework for understanding different approaches to LLM adaptation, establishing evaluation metrics and benchmarks specific to adaptation techniques.

Conclusion: LLM adaptation is essential for creating domain-specific, dynamic models that can evolve with changing knowledge while being practical for deployment, addressing key limitations of static generic LLMs.

Abstract: This tutorial on adaptation of LLMs is designed to address the growing demand for models that go beyond the static capabilities of generic LLMs by providing an overview of dynamic, domain-specific, and task-adaptive LLM adaptation techniques. While general LLMs have demonstrated strong generalization across a variety of tasks, they often struggle to perform well in specialized domains such as finance, healthcare, and code generation for underrepresented languages. Additionally, their static nature limits their ability to evolve with the changing world, and they are often extremely large in size, making them impractical and costly to deploy at scale. As a result, the adaptation of LLMs has drawn much attention since the birth of LLMs and is of core importance, both for industry, which focuses on serving its targeted users, and academia, which can greatly benefit from small but powerful LLMs. To address this gap, this tutorial aims to provide an overview of the LLM adaptation techniques. We start with an introduction to LLM adaptation, from both the data perspective and the model perspective. We then emphasize how the evaluation metrics and benchmarks are different from other techniques. After establishing the problems, we explore various adaptation techniques. We categorize adaptation techniques into two main families. The first is parametric knowledge adaptation, which focuses on updating the parametric knowledge within LLMs. Additionally, we will discuss real-time adaptation techniques, including model editing, which allows LLMs to be updated dynamically in production environments. The second kind of adaptation is semi-parametric knowledge adaptation, where the goal is to update LLM parameters to better leverage external knowledge or tools through techniques like retrieval-augmented generation (RAG) and agent-based systems.

[86] Quantum Natural Language Processing: A Comprehensive Review of Models, Methods, and Applications

Farha Nausheen, Khandakar Ahmed, M Imad Khan, Farina Riaz

Main category: cs.CL

TL;DR: This paper provides a comprehensive survey of Quantum Natural Language Processing (QNLP), categorizing models based on quantum computing principles, architecture, and computational approaches to address the computational limitations of classical NLP.

Details

Motivation: Deep learning in NLP improves performance but demands considerable data and resources. Quantum computing offers potential to overcome these computational limitations and achieve quantum advantage in processing linguistic structures.

Method: The paper categorizes QNLP models based on quantum computing principles, architecture, and computational approaches. It surveys quantum encoding techniques for classical data, QNLP models for NLP tasks, and quantum optimization techniques for hyperparameter tuning.

Result: QNLP approaches are currently limited to small datasets with only a few models explored extensively. However, there is increasing interest in applying quantum computing to NLP tasks, and the survey maps the state-of-the-art methods and their popularity.

Conclusion: Quantum Natural Language Processing is an emerging field with potential for quantum advantage, but current approaches face limitations in dataset size and model exploration, though interest in the field is growing.

Abstract: In recent developments, deep learning methodologies applied to Natural Language Processing (NLP) have revealed a paradox: They improve performance but demand considerable data and resources for their training. Alternatively, quantum computing exploits the principles of quantum mechanics to overcome the computational limitations of current methodologies, thereby establishing an emerging field known as quantum natural language processing (QNLP). This domain holds the potential to attain a quantum advantage in the processing of linguistic structures, surpassing classical models in both efficiency and accuracy. In this paper, it is proposed to categorise QNLP models based on quantum computing principles, architecture, and computational approaches. This paper attempts to provide a survey on how quantum meets language by mapping state-of-the-art in this area, embracing quantum encoding techniques for classical data, QNLP models for prevalent NLP tasks, and quantum optimisation techniques for hyper parameter tuning. The landscape of quantum computing approaches applied to various NLP tasks is summarised by showcasing the specific QNLP methods used, and the popularity of these methods is indicated by their count. From the findings, it is observed that QNLP approaches are still limited to small data sets, with only a few models explored extensively, and there is increasing interest in the application of quantum computing to natural language processing tasks.

[87] LongCodeBench: Evaluating Coding LLMs at 1M Context Windows

Stefano Rando, Luca Romani, Alessio Sampieri, Luca Franco, John Yang, Yuta Kyuragi, Fabio Galasso, Tatsunori Hashimoto

Main category: cs.CL

TL;DR: LongCodeBench (LCB) is a new benchmark for testing long-context language models on code comprehension and repair tasks, showing significant performance drops for all models in long-context scenarios.

Details

Motivation: The rapid growth of model context lengths (from thousands to millions of tokens) has made it difficult to create realistic long-context benchmarks due to cost and identifying realistic scenarios that require significant contexts.

Method: The authors introduce LongCodeBench (LCB) which tests both comprehension and repair capabilities using real-world GitHub issues, constructing QA (LongCodeQA) and bug fixing (LongSWE-Bench) tasks, with stratified complexity to evaluate models across different scales.

Result: All models show significant performance drops in long-context scenarios: Claude 3.5 Sonnet drops from 29% to 3%, Qwen2.5 drops from 70.2% to 40%, demonstrating that long-context remains a weakness for all tested models.

Conclusion: Long-context capabilities remain a significant challenge for current language models, and code comprehension/repair serves as a natural testbed for evaluating these capabilities in realistic scenarios.

Abstract: Context lengths for models have grown rapidly, from thousands to millions of tokens in just a few years. The extreme context sizes of modern long-context models have made it difficult to construct realistic long-context benchmarks – not only due to the cost of collecting million-context tasks but also in identifying realistic scenarios that require significant contexts. We identify code comprehension and repair as a natural testbed and challenge task for long-context models and introduce LongCodeBench (LCB), a benchmark to test LLM coding abilities in long-context scenarios. Our benchmark tests both the comprehension and repair capabilities of LCLMs in realistic and important settings by drawing from real-world GitHub issues and constructing QA (LongCodeQA) and bug fixing (LongSWE-Bench) tasks. We carefully stratify the complexity of our benchmark, enabling us to evaluate models across different scales – ranging from Qwen2.5 14B Instruct to Google’s flagship Gemini model. We find that long-context remains a weakness for all models, with performance drops such as from 29% to 3% for Claude 3.5 Sonnet, or from 70.2% to 40% for Qwen2.5. The LCB dataset is available publicly at https://huggingface.co/datasets/Steefano/LCB and the codebase to replicate the work on this paper at https://github.com/Zteefano/long-code-bench.

[88] CtrlDiff: Boosting Large Diffusion Language Models with Dynamic Block Prediction and Controllable Generation

Chihan Huang, Hao Tang

Main category: cs.CL

TL;DR: CtrlDiff is a dynamic semi-autoregressive framework that combines autoregressive and diffusion approaches, using reinforcement learning to adaptively determine block sizes and introducing classifier-guided control for efficient conditional text generation.

Details

Motivation: To address limitations of fixed-length generation and weak controllability in current diffusion language models, while combining the strengths of autoregressive and diffusion paradigms.

Method: Proposes CtrlDiff with adaptive block size determination using reinforcement learning based on local semantics, and a classifier-guided control mechanism for discrete diffusion that enables efficient post-hoc conditioning without retraining.

Result: Extensive experiments show CtrlDiff sets new standards among hybrid diffusion models, narrows performance gap to state-of-the-art autoregressive approaches, and enables effective conditional text generation across diverse tasks.

Conclusion: CtrlDiff successfully addresses critical limitations of fixed granularity and weak controllability in diffusion language models, demonstrating improved performance and flexible control capabilities.

Abstract: Although autoregressive models have dominated language modeling in recent years, there has been a growing interest in exploring alternative paradigms to the conventional next-token prediction framework. Diffusion-based language models have emerged as a compelling alternative due to their powerful parallel generation capabilities and inherent editability. However, these models are often constrained by fixed-length generation. A promising direction is to combine the strengths of both paradigms, segmenting sequences into blocks, modeling autoregressive dependencies across blocks while leveraging discrete diffusion to estimate the conditional distribution within each block given the preceding context. Nevertheless, their practical application is often hindered by two key limitations: rigid fixed-length outputs and a lack of flexible control mechanisms. In this work, we address the critical limitations of fixed granularity and weak controllability in current large diffusion language models. We propose CtrlDiff, a dynamic and controllable semi-autoregressive framework that adaptively determines the size of each generation block based on local semantics using reinforcement learning. Furthermore, we introduce a classifier-guided control mechanism tailored to discrete diffusion, which significantly reduces computational overhead while facilitating efficient post-hoc conditioning without retraining. Extensive experiments demonstrate that CtrlDiff sets a new standard among hybrid diffusion models, narrows the performance gap to state-of-the-art autoregressive approaches, and enables effective conditional text generation across diverse tasks.

[89] Can Large Language Models be Effective Online Opinion Miners?

Ryang Heo, Yongsik Seo, Junseong Lee, Dongha Lee

Main category: cs.CL

TL;DR: OOMB is a new benchmark dataset and evaluation protocol for testing LLMs’ ability to mine opinions from diverse online content, with comprehensive annotations for both extractive and abstractive opinion mining tasks.

Details

Motivation: Traditional opinion mining approaches struggle with the highly diverse, complex, and context-rich nature of user-generated online content, creating a need for better evaluation methods for LLMs in realistic opinion mining scenarios.

Method: Introduces Online Opinion Mining Benchmark (OOMB) with extensive (entity, feature, opinion) tuple annotations and opinion-centric summaries to evaluate both extractive and abstractive capabilities of LLMs.

Result: The benchmark enables comprehensive analysis of challenging aspects and LLM adaptability in opinion mining, identifying where LLMs excel and where they struggle in realistic online scenarios.

Conclusion: This study establishes the foundation for LLM-based opinion mining and discusses future research directions in the field.

Abstract: The surge of user-generated online content presents a wealth of insights into customer preferences and market trends. However, the highly diverse, complex, and context-rich nature of such contents poses significant challenges to traditional opinion mining approaches. To address this, we introduce Online Opinion Mining Benchmark (OOMB), a novel dataset and evaluation protocol designed to assess the ability of large language models (LLMs) to mine opinions effectively from diverse and intricate online environments. OOMB provides extensive (entity, feature, opinion) tuple annotations and a comprehensive opinion-centric summary that highlights key opinion topics within each content, thereby enabling the evaluation of both the extractive and abstractive capabilities of models. Through our proposed benchmark, we conduct a comprehensive analysis of which aspects remain challenging and where LLMs exhibit adaptability, to explore whether they can effectively serve as opinion miners in realistic online scenarios. This study lays the foundation for LLM-based opinion mining and discusses directions for future research in this field.

[90] Evaluating NLP Embedding Models for Handling Science-Specific Symbolic Expressions in Student Texts

Tom Bleckmann, Paul Tschisgale

Main category: cs.CL

TL;DR: This study evaluates how different NLP embedding models handle science-related symbolic expressions (equations, formulas) in educational data mining, finding significant performance differences with OpenAI’s GPT-text-embedding-3-large performing best.

Details

Motivation: Current embedding models struggle with symbolic expressions in science-related language, leading to biased research findings and diminished application performance when these expressions are overlooked or removed.

Method: Evaluated various embedding models using physics-specific symbolic expressions from authentic student responses, assessed through similarity-based analyses and integration into machine learning pipelines.

Result: Significant differences in model performance were found, with OpenAI’s GPT-text-embedding-3-large outperforming all other models, though its advantage was moderate rather than decisive.

Conclusion: Educational data mining researchers and practitioners should carefully select NLP embedding models when working with science-related language containing symbolic expressions.

Abstract: In recent years, natural language processing (NLP) has become integral to educational data mining, particularly in the analysis of student-generated language products. For research and assessment purposes, so-called embedding models are typically employed to generate numeric representations of text that capture its semantic content for use in subsequent quantitative analyses. Yet when it comes to science-related language, symbolic expressions such as equations and formulas introduce challenges that current embedding models struggle to address. Existing research studies and practical applications often either overlook these challenges or remove symbolic expressions altogether, potentially leading to biased research findings and diminished performance of practical applications. This study therefore explores how contemporary embedding models differ in their capability to process and interpret science-related symbolic expressions. To this end, various embedding models are evaluated using physics-specific symbolic expressions drawn from authentic student responses, with performance assessed via two approaches: 1) similarity-based analyses and 2) integration into a machine learning pipeline. Our findings reveal significant differences in model performance, with OpenAI’s GPT-text-embedding-3-large outperforming all other examined models, though its advantage over other models was moderate rather than decisive. Overall, this study underscores the importance for educational data mining researchers and practitioners of carefully selecting NLP embedding models when working with science-related language products that include symbolic expressions. The code and (partial) data are available at https://doi.org/10.17605/OSF.IO/6XQVG.

[91] metaTextGrad: Automatically optimizing language model optimizers

Guowei Xu, Mert Yuksekgonul, Carlos Guestrin, James Zou

Main category: cs.CL

TL;DR: MetaTextGrad is a meta-optimizer framework that enhances existing LLM-based optimizers by optimizing both their prompts and structures for specific tasks, achieving up to 6% performance improvement over baselines.

Details

Motivation: Current LLM-based optimizers are manually designed, not optimized themselves, and are general-purpose rather than tailored for specific tasks, limiting their effectiveness.

Method: Proposes metaTextGrad with two components: meta prompt optimizer to refine optimizer prompts, and meta structure optimizer to optimize the optimizer’s architecture for specific tasks.

Result: Achieved average absolute performance improvement of up to 6% across multiple benchmarks compared to the best baseline optimizer.

Conclusion: Meta-optimization of LLM-based optimizers through prompt and structure optimization significantly enhances performance on specific tasks, addressing limitations of general-purpose optimizer designs.

Abstract: Large language models (LLMs) are increasingly used in learning algorithms, evaluations, and optimization tasks. Recent studies have shown that using LLM-based optimizers to automatically optimize model prompts, demonstrations, predictions themselves, or other components can significantly enhance the performance of AI systems, as demonstrated by frameworks such as DSPy and TextGrad. However, optimizers built on language models themselves are usually designed by humans with manual design choices; optimizers themselves are not optimized. Moreover, these optimizers are general purpose by design, to be useful to a broad audience, and are not tailored for specific tasks. To address these challenges, we propose metaTextGrad, which focuses on designing a meta-optimizer to further enhance existing optimizers and align them to be good optimizers for a given task. Our approach consists of two key components: a meta prompt optimizer and a meta structure optimizer. The combination of these two significantly improves performance across multiple benchmarks, achieving an average absolute performance improvement of up to 6% compared to the best baseline.

[92] Do LLMs Really Forget? Evaluating Unlearning with Knowledge Correlation and Confidence Awareness

Rongzhe Wei, Peizhi Niu, Hans Hao-Hsun Hsu, Ruihan Wu, Haoteng Yin, Mohsen Ghassemi, Yifan Li, Vamsi K. Potluru, Eli Chien, Kamalika Chaudhuri, Olgica Milenkovic, Pan Li

Main category: cs.CL

TL;DR: Proposes a knowledge unlearning evaluation framework using knowledge graphs and LLM judges to better assess implicit knowledge retention after unlearning, revealing that current methods overestimate unlearning effectiveness.

Details

Motivation: Existing machine unlearning approaches focus on explicit fact removal but overlook latent inferential dependencies and non-deterministic knowledge in LLMs, allowing forgotten facts to persist implicitly through correlated information.

Method: Developed a knowledge unlearning evaluation framework that represents factual contexts as knowledge graphs with confidence scores, and uses LLM judges with carefully designed prompts to reason over knowledge subgraphs for determining unlearning success.

Result: Extensive experiments on a newly constructed benchmark show the framework provides more realistic and rigorous assessment of unlearning performance, revealing current evaluation strategies tend to overestimate unlearning effectiveness.

Conclusion: The proposed framework offers a more accurate approach to evaluating knowledge unlearning in LLMs by capturing implicit knowledge structures and dependencies, addressing limitations of existing evaluation methods.

Abstract: Machine unlearning techniques aim to mitigate unintended memorization in large language models (LLMs). However, existing approaches predominantly focus on the explicit removal of isolated facts, often overlooking latent inferential dependencies and the non-deterministic nature of knowledge within LLMs. Consequently, facts presumed forgotten may persist implicitly through correlated information. To address these challenges, we propose a knowledge unlearning evaluation framework that more accurately captures the implicit structure of real-world knowledge by representing relevant factual contexts as knowledge graphs with associated confidence scores. We further develop an inference-based evaluation protocol leveraging powerful LLMs as judges; these judges reason over the extracted knowledge subgraph to determine unlearning success. Our LLM judges utilize carefully designed prompts and are calibrated against human evaluations to ensure their trustworthiness and stability. Extensive experiments on our newly constructed benchmark demonstrate that our framework provides a more realistic and rigorous assessment of unlearning performance. Moreover, our findings reveal that current evaluation strategies tend to overestimate unlearning effectiveness. Our code is publicly available at https://github.com/Graph-COM/Knowledge_Unlearning.git.

[93] ETT: Expanding the Long Context Understanding Capability of LLMs at Test-Time

Kiarash Zahirnia, Zahra Golpayegani, Walid Ahmed, Yang Liu

Main category: cs.CL

TL;DR: ETT (Extend at Test-Time) is a method that extends context length of short-context Transformer LLMs with linear computation and constant memory overhead, achieving up to 30% accuracy improvement by fine-tuning on overlapping subsequences.

Details

Motivation: Transformer-based LLMs face quadratic computation and memory overhead with sequence length, making long sequence processing challenging. ETT addresses this by enabling context extension at test-time.

Method: ETT efficiently fine-tunes model parameters on input context chunked into overlapping small subsequences, focusing on specific Transformer modules like second FFN layers rather than full fine-tuning.

Result: ETT extended context length of GPT-Large and Phi-2 from 1k to 32k tokens (32x increase) on LongBench, achieving up to 30% accuracy improvement with linear computation and constant memory requirements.

Conclusion: Fine-tuning specific Transformer modules (particularly second FFN layers) at test-time is more effective than full fine-tuning for context extension, enabling efficient long-sequence processing with improved accuracy.

Abstract: Transformer-based Language Models’ computation and memory overhead increase quadratically as a function of sequence length. The quadratic cost poses challenges when employing LLMs for processing long sequences. In this work, we introduce \ourmodelacronym~(Extend at Test-Time), method for extending the context length of short context Transformer-based LLMs, with constant memory requirement and linear computation overhead. ETT enable the extension of the context length at test-time by efficient fine-tuning the model’s parameters on the input context, chunked into overlapping small subsequences. We evaluate ETT on LongBench by extending the context length of GPT-Large and Phi-2 up to 32 times, increasing from 1k to 32k tokens. This results in up to a 30 percent improvement in the model’s accuracy. We also study how context can be stored in LLM’s weights effectively and efficiently. Through a detailed ablation study, we examine which Transformer modules are most beneficial to fine-tune at test-time. Interestingly, we find that fine-tuning the second layer of the FFNs is more effective than full fine-tuning, leading to a further improvement in the models’ accuracy.

[94] DeCAL Tokenwise Compression

Sameer Panwar

Main category: cs.CL

TL;DR: DeCAL is a tokenwise compression method using encoder-decoder language models with denoising pretraining to create high-quality compressed representations, achieving competitive performance up to 8x compression.

Details

Motivation: To develop efficient compression methods for dense representations that can be pre-computed and stored, enabling significant computational savings while maintaining task performance.

Method: Uses encoder-decoder language model pretrained with denoising, with small encoder modifications focused on maximizing compression quality rather than computational efficiency.

Result: At 2x compression, matches uncompressed performance on downstream tasks; maintains good performance up to 8x compression for question-answering, summarization, and multi-vector retrieval with only minor metric dropoffs.

Conclusion: DeCAL provides substantial savings for pre-computed dense representations and has potential for broader applicability with further development.

Abstract: This paper introduces DeCAL, a new method for tokenwise compression. DeCAL uses an encoder-decoder language model pretrained with denoising to learn to produce high-quality, general-purpose compressed representations from the encoder. DeCAL applies small modifications to the encoder, with the emphasis on maximizing compression quality, even at the expense of compute. We show that DeCAL at 2x compression can match uncompressed on several downstream tasks, with usually only a minor dropoff in metrics up to 8x compression, among question-answering, summarization, and multi-vector retrieval tasks. DeCAL offers significant savings where pre-computed dense representations can be utilized, and we believe the approach can be further developed to be more broadly applicable.

[95] Flow-SLM: Joint Learning of Linguistic and Acoustic Information for Spoken Language Modeling

Ju-Chieh Chou, Jiawei Zhou, Karen Livescu

Main category: cs.CL

TL;DR: Proposes a textless spoken language model that jointly models linguistic and acoustic information by generating semantic tokens and continuous acoustic representations using flow-matching, improving acoustic detail in speech generation.

Details

Motivation: Existing textless SLMs only predict semantic tokens and rely on separate vocoders for acoustic information, lacking acoustic context and control over acoustic details.

Method: Jointly model linguistic and acoustic information by generating semantic tokens and continuous acoustic representations using flow-matching, with multiple future semantic token prediction to preserve linguistic information.

Result: Achieves comparable linguistic performance to existing models while providing better acoustic detail in prompted generation.

Conclusion: Joint modeling of linguistic and acoustic information with flow-matching enables better acoustic detail in textless speech generation while maintaining linguistic quality.

Abstract: Textless spoken language models (SLMs) are generative models of speech that do not rely on text supervision. Most textless SLMs learn to predict the next semantic token, a discrete representation of linguistic content, and rely on a separate vocoder to add acoustic information to the generated speech. Such models have no access to acoustic context and no built-in control over acoustic details. In this work, we propose to jointly model linguistic and acoustic information by generating semantic tokens and a continuous real-valued representation of the acoustic frame. We use a flow-matching objective to predict the continuous vector conditioned on the semantic tokens. We study the design space of this approach and find that predicting multiple future semantic tokens helps preserve linguistic information. Our approach achieves comparable performance to existing models in terms of linguistic likelihood benchmarks, while providing better acoustic detail in prompted generation.

[96] Beyond GPT-5: Making LLMs Cheaper and Better via Performance-Efficiency Optimized Routing

Yiqun Zhang, Hao Li, Jianhao Chen, Hangfan Zhang, Peng Ye, Lei Bai, Shuyue Hu

Main category: cs.CL

TL;DR: Avengers-Pro is a test-time routing framework that dynamically routes queries to the most suitable LLM from an ensemble of models with varying capacities and efficiencies, achieving state-of-the-art performance-efficiency tradeoffs.

Details

Motivation: To address the central challenge of balancing performance and efficiency in large language models by providing a unified solution for all performance-efficiency tradeoffs through intelligent query routing.

Method: Embeds and clusters incoming queries, then routes each query to the most suitable model based on a performance-efficiency score from an ensemble of LLMs with varying capacities.

Result: Achieves +7% higher average accuracy than the strongest single model (GPT-5-medium), matches strongest model’s accuracy at 27% lower cost, reaches ~90% of that performance at 63% lower cost, and establishes a Pareto frontier for optimal cost-accuracy tradeoffs.

Conclusion: Avengers-Pro provides an effective framework for optimizing performance-efficiency tradeoffs in LLM inference through intelligent test-time routing across model ensembles.

Abstract: Balancing performance and efficiency is a central challenge in large language model (LLM) advancement. GPT-5 addresses this with test-time routing, dynamically assigning queries to either an efficient or a high-capacity model during inference. In this work, we present Avengers-Pro, a test-time routing framework that ensembles LLMs of varying capacities and efficiencies, providing a unified solution for all performance-efficiency tradeoffs. The Avengers-Pro embeds and clusters incoming queries, then routes each to the most suitable model based on a performance-efficiency score. Across 6 challenging benchmarks and 8 leading models – including GPT-5-medium, Gemini-2.5-pro, and Claude-opus-4.1 – Avengers-Pro achieves state-of-the-art results: by varying a performance-efficiency trade-off parameter, it can surpass the strongest single model (GPT-5-medium) by +7% in average accuracy. Moreover, it can match the average accuracy of the strongest single model at 27% lower cost, and reach ~90% of that performance at 63% lower cost. Last but not least, it achieves a Pareto frontier, consistently yielding the highest accuracy for any given cost, and the lowest cost for any given accuracy, among all single models. Code is available at https://github.com/ZhangYiqun018/AvengersPro.

[97] Who’s Asking? Investigating Bias Through the Lens of Disability Framed Queries in LLMs

Vishnu Hari, Kalpana Panda, Srikant Panda, Amit Agarwal, Hitesh Laxmichand Patel

Main category: cs.CL

TL;DR: First systematic audit shows LLMs make unwarranted demographic inferences from disability cues, with larger models showing more bias despite scale.

Details

Motivation: To investigate how disability cues shape demographic bias in LLMs, as this area remains largely unexplored despite LLMs' tendency to infer user demographics from phrasing.

Method: Audited 8 state-of-the-art instruction-tuned LLMs (3B-72B parameters) using balanced template corpus pairing 9 disability categories with 6 business domains, prompting models to predict 5 demographic attributes under neutral and disability-aware conditions.

Result: Models made definitive demographic guesses in up to 97% of cases, disability context heavily shifted attribute distributions, domain context amplified deviations, and larger models were more sensitive to disability cues and prone to biased reasoning.

Conclusion: Persistent intersections between ableism and other demographic stereotypes reveal critical blind spots in current alignment strategies, requiring disability-inclusive benchmarking and techniques like abstention calibration and counterfactual fine-tuning.

Abstract: Large Language Models (LLMs) routinely infer users demographic traits from phrasing alone, which can result in biased responses, even when no explicit demographic information is provided. The role of disability cues in shaping these inferences remains largely uncharted. Thus, we present the first systematic audit of disability-conditioned demographic bias across eight state-of-the-art instruction-tuned LLMs ranging from 3B to 72B parameters. Using a balanced template corpus that pairs nine disability categories with six real-world business domains, we prompt each model to predict five demographic attributes - gender, socioeconomic status, education, cultural background, and locality - under both neutral and disability-aware conditions. Across a varied set of prompts, models deliver a definitive demographic guess in up to 97% of cases, exposing a strong tendency to make arbitrary inferences with no clear justification. Disability context heavily shifts predicted attribute distributions, and domain context can further amplify these deviations. We observe that larger models are simultaneously more sensitive to disability cues and more prone to biased reasoning, indicating that scale alone does not mitigate stereotype amplification. Our findings reveal persistent intersections between ableism and other demographic stereotypes, pinpointing critical blind spots in current alignment strategies. We release our evaluation framework and results to encourage disability-inclusive benchmarking and recommend integrating abstention calibration and counterfactual fine-tuning to curb unwarranted demographic inference. Code and data will be released on acceptance.

[98] Middo: Model-Informed Dynamic Data Optimization for Enhanced LLM Fine-Tuning via Closed-Loop Learning

Zinan Tang, Xin Gao, Qizhi Pei, Zhuoshi Pan, Mengzhang Cai, Jiang Wu, Conghui He, Lijun Wu

Main category: cs.CL

TL;DR: Middo is a self-evolving framework that dynamically optimizes training data for LLMs through model-aware selection and context-preserving refinement, improving model performance while maintaining dataset scale.

Details

Motivation: Existing data selection and synthesis approaches have limitations in static dataset curation that fail to adapt to evolving model capabilities during SFT training.

Method: Closed-loop optimization system with: (1) self-referential diagnostic module using tri-axial model signals (loss patterns, embedding clusters, self-alignment scores), (2) adaptive optimization engine that transforms suboptimal samples, (3) continuous evolution with model capability through dynamic learning principles.

Result: Consistently enhances seed data quality and boosts LLM performance with 7.15% average accuracy improvement while maintaining original dataset scale across multiple benchmarks.

Conclusion: Establishes a new paradigm for sustainable LLM training through dynamic human-AI co-evolution of data and models.

Abstract: Supervised Fine-Tuning (SFT) Large Language Models (LLM) fundamentally rely on high-quality training data. While data selection and data synthesis are two common strategies to improve data quality, existing approaches often face limitations in static dataset curation that fail to adapt to evolving model capabilities. In this paper, we introduce Middo, a self-evolving Model-informed dynamic data optimization framework that uses model-aware data selection and context-preserving data refinement. Unlike conventional one-off filtering/synthesis methods, our framework establishes a closed-loop optimization system: (1) A self-referential diagnostic module proactively identifies suboptimal samples through tri-axial model signals - loss patterns (complexity), embedding cluster dynamics (diversity), and self-alignment scores (quality); (2) An adaptive optimization engine then transforms suboptimal samples into pedagogically valuable training points while preserving semantic integrity; (3) This optimization process continuously evolves with model capability through dynamic learning principles. Experiments on multiple benchmarks demonstrate that our Middo consistently enhances the quality of seed data and boosts LLM’s performance with improving accuracy by 7.15% on average while maintaining the original dataset scale. This work establishes a new paradigm for sustainable LLM training through dynamic human-AI co-evolution of data and models. Our datasets, models, and code are publicly available at https://github.com/Word2VecT/Middo.

[99] Evaluation Framework for Highlight Explanations of Context Utilisation in Language Models

Jingyi Sun, Pepa Atanasova, Sagnik Ray Choudhury, Sekh Mainul Islam, Isabelle Augenstein

Main category: cs.CL

TL;DR: The paper introduces the first gold standard evaluation framework for highlight explanations (HEs) in context attribution, testing four HE methods across various scenarios and finding MechLight performs best but all methods struggle with long contexts and positional biases.

Details

Motivation: Context utilisation in Language Models remains opaque - users can't determine if models use parametric memory or provided context, nor identify which specific context pieces inform responses. Highlight explanations could solve this but no existing work evaluates their effectiveness.

Method: Introduced gold standard HE evaluation framework using controlled test cases with known ground-truth context usage. Evaluated four HE methods (three established techniques and MechLight, a mechanistic interpretability approach adapted for this task) across four context scenarios, four datasets, and five LMs.

Result: MechLight performs best across all context scenarios. However, all methods struggle with longer contexts and exhibit positional biases, pointing to fundamental challenges in explanation accuracy.

Conclusion: New approaches are needed to deliver reliable context utilisation explanations at scale, as current highlight explanation methods face fundamental accuracy challenges particularly with longer contexts and positional biases.

Abstract: Context utilisation, the ability of Language Models (LMs) to incorporate relevant information from the provided context when generating responses, remains largely opaque to users, who cannot determine whether models draw from parametric memory or provided context, nor identify which specific context pieces inform the response. Highlight explanations (HEs) offer a natural solution as they can point the exact context pieces and tokens that influenced model outputs. However, no existing work evaluates their effectiveness in accurately explaining context utilisation. We address this gap by introducing the first gold standard HE evaluation framework for context attribution, using controlled test cases with known ground-truth context usage, which avoids the limitations of existing indirect proxy evaluations. To demonstrate the framework’s broad applicability, we evaluate four HE methods – three established techniques and MechLight, a mechanistic interpretability approach we adapt for this task – across four context scenarios, four datasets, and five LMs. Overall, we find that MechLight performs best across all context scenarios. However, all methods struggle with longer contexts and exhibit positional biases, pointing to fundamental challenges in explanation accuracy that require new approaches to deliver reliable context utilisation explanations at scale.

[100] ALHD: A Large-Scale and Multigenre Benchmark Dataset for Arabic LLM-Generated Text Detection

Ali Khairallah, Arkaitz Zubiaga

Main category: cs.CL

TL;DR: ALHD is the first large-scale Arabic dataset for distinguishing human- and LLM-generated texts across news, social media, and reviews in both MSA and dialectal Arabic, with over 400K balanced samples.

Details

Motivation: To address the need for comprehensive Arabic LLM-generated text detection research, particularly for mitigating risks of misinformation, academic dishonesty, and cyber threats in Arabic content.

Method: Created a large-scale dataset spanning three genres with balanced samples from three leading LLMs and multiple human sources. Conducted benchmark experiments using traditional classifiers, BERT-based models, and LLMs in zero-shot and few-shot settings.

Result: Fine-tuned BERT models achieved competitive performance, outperforming LLM-based models. However, models struggled with cross-genre generalization, particularly with news articles where LLM-generated texts closely resemble human writing style.

Conclusion: ALHD establishes a foundation for Arabic LLM-detection research, revealing challenges in cross-genre generalization and opening avenues for future work on detecting LLM-generated content that mimics human writing styles.

Abstract: We introduce ALHD, the first large-scale comprehensive Arabic dataset explicitly designed to distinguish between human- and LLM-generated texts. ALHD spans three genres (news, social media, reviews), covering both MSA and dialectal Arabic, and contains over 400K balanced samples generated by three leading LLMs and originated from multiple human sources, which enables studying generalizability in Arabic LLM-genearted text detection. We provide rigorous preprocessing, rich annotations, and standardized balanced splits to support reproducibility. In addition, we present, analyze and discuss benchmark experiments using our new dataset, in turn identifying gaps and proposing future research directions. Benchmarking across traditional classifiers, BERT-based models, and LLMs (zero-shot and few-shot) demonstrates that fine-tuned BERT models achieve competitive performance, outperforming LLM-based models. Results are however not always consistent, as we observe challenges when generalizing across genres; indeed, models struggle to generalize when they need to deal with unseen patterns in cross-genre settings, and these challenges are particularly prominent when dealing with news articles, where LLM-generated texts resemble human texts in style, which opens up avenues for future research. ALHD establishes a foundation for research related to Arabic LLM-detection and mitigating risks of misinformation, academic dishonesty, and cyber threats.

[101] Improving Metacognition and Uncertainty Communication in Language Models

Mark Steyvers, Catarina Belem, Padhraic Smyth

Main category: cs.CL

TL;DR: Fine-tuning LLMs improves their ability to communicate uncertainty through better calibration and discrimination, but gains are task-specific and require multitask training for effective generalization across domains.

Details

Motivation: LLMs are increasingly used in decision-making but often present answers without signaling low confidence, leading users to unknowingly act on erroneous outputs. While LLMs maintain internal uncertainty signals, their expressed confidence is poorly calibrated and doesn't discriminate well between correct and incorrect answers.

Method: Supervised fine-tuning of LLMs on datasets spanning general knowledge, mathematics, and open-ended trivia. Evaluated two metacognitive tasks: single-question confidence estimation (numeric certainty) and pairwise confidence comparison (selecting which of two answers is more likely correct). Assessed generalization to unseen domains including medical and legal reasoning.

Result: Fine-tuning improves calibration (alignment between stated confidence and accuracy) and discrimination (higher confidence for correct vs. incorrect responses) within and across domains. However, gains are task-specific - training on single-question calibration doesn’t transfer to pairwise comparison, and vice versa. Multitask fine-tuning yields broader gains, lowering calibration error and strengthening discrimination in out-of-domain evaluations.

Conclusion: Uncertainty communication in LLMs is trainable but requires multitask training to generalize effectively across tasks and domains.

Abstract: Large language models (LLMs) are increasingly used in decision-making contexts, but when they present answers without signaling low confidence, users may unknowingly act on erroneous outputs. Prior work shows that LLMs maintain internal uncertainty signals, yet their expressed confidence is often miscalibrated and poorly discriminates between correct and incorrect answers. We investigate whether supervised fine-tuning can improve models’ ability to communicate uncertainty and whether such improvements generalize across tasks and domains. We fine-tune LLMs on datasets spanning general knowledge, mathematics, and open-ended trivia, and evaluate two metacognitive tasks: (1) single-question confidence estimation, where the model assigns a numeric certainty to its answer, and (2) pairwise confidence comparison, where the model selects which of two answers it is more likely to answer correctly. We assess generalization to unseen domains, including medical and legal reasoning. Results show that fine-tuning improves calibration (alignment between stated confidence and accuracy) and discrimination (higher confidence for correct vs. incorrect responses) within and across domains. However, gains are task-specific: training on single-question calibration does not transfer to pairwise comparison, and vice versa. Multitask fine-tuning yields broader gains, lowering calibration error and strengthening discrimination in out-of-domain evaluations. This suggests that uncertainty communication in LLMs is trainable but requires multitask training to generalize effectively.

[102] Presenting a Paper is an Art: Self-Improvement Aesthetic Agents for Academic Presentations

Chengzhi Liu, Yuzhe Yang, Kaiwen Zhou, Zhen Zhang, Yue Fan, Yanan Xie, Peng Qi, Xin Eric Wang

Main category: cs.CL

TL;DR: EvoPresent is a self-improvement agent framework that creates engaging academic presentations by unifying narratives, aesthetic design, and virtual character delivery, using a multi-task RL aesthetic model for iterative improvement.

Details

Motivation: Existing automated presentation methods struggle with limited storytelling, poor aesthetic quality, and lack of self-adjustment capabilities, making academic dissemination inefficient and unengaging.

Method: Introduces EvoPresent framework with PresAesth - a multi-task reinforcement learning aesthetic model that provides scoring, defect adjustment, and comparative feedback for iterative self-improvement. Also creates EvoPresent Benchmark with 650 AI papers and 2,000 slide pairs for evaluation.

Result: Findings show: (i) High-quality feedback is essential for self-improvement, (ii) Trade-off exists between visual design and content construction, (iii) Multi-task RL training shows stronger generalization in aesthetic tasks.

Conclusion: EvoPresent successfully addresses automated presentation challenges through self-improvement framework and multi-task RL aesthetic modeling, enabling more engaging academic dissemination.

Abstract: The promotion of academic papers has become an important means of enhancing research visibility. However, existing automated methods struggle limited storytelling, insufficient aesthetic quality, and constrained self-adjustment, making it difficult to achieve efficient and engaging dissemination. At the heart of those challenges is a simple principle: \emph{there is no way to improve it when you cannot evaluate it right}. To address this, we introduce \textbf{EvoPresent}, a self-improvement agent framework that unifies coherent narratives, aesthetic-aware designs, and realistic presentation delivery via virtual characters. Central to EvoPresent is \textbf{PresAesth}, a multi-task reinforcement learning (RL) aesthetic model that provides reliable aesthetic scoring, defect adjustment, and comparative feedback, enabling iterative self-improvement even under limited aesthetic training data. To systematically evaluate the methods, we introduce \textbf{EvoPresent Benchmark}, a comprehensive benchmark comprising: \textit{Presentation Generation Quality}, built on 650 top-tier AI conference papers with multimodal resources (slides, videos and scripts) to assess both content and design; and \textit{Aesthetic Awareness}, consisting of 2,000 slide pairs with varying aesthetic levels, supporting joint training and evaluation on scoring, defect adjustment, and comparison. Our findings highlight that (i) High-quality feedback is essential for agent self-improvement, while initial capability alone does not guarantee effective self-correction. (ii) Automated generation pipelines exhibit a trade-off between visual design and content construction. (iii) Multi-task RL training shows stronger generalization in aesthetic awareness tasks.

[103] Hire Your Anthropologist! Rethinking Culture Benchmarks Through an Anthropological Lens

Mai AlKhamissi, Yunze Xiao, Badr AlKhamissi, Mona Diab

Main category: cs.CL

TL;DR: The paper critiques current cultural benchmarks for LLMs as overly static and proposes a framework to improve them by incorporating anthropological perspectives and real-world cultural complexity.

Details

Motivation: Current cultural benchmarks for large language models are inadequate because they treat culture as static facts or homogeneous values, which conflicts with anthropological understanding of culture as dynamic and context-dependent.

Method: Developed a four-part framework to categorize how benchmarks frame culture, qualitatively analyzed 20 cultural benchmarks, identified six methodological issues, and proposed improvements based on anthropological methods.

Result: Identified six recurring methodological issues in current benchmarks: treating countries as cultures, overlooking within-culture diversity, relying on oversimplified survey formats, and other problems that oversimplify cultural complexity.

Conclusion: Cultural benchmarks should be improved by incorporating real-world narratives, involving cultural communities in design, and evaluating models in context rather than isolation to better capture responses to complex cultural situations.

Abstract: Cultural evaluation of large language models has become increasingly important, yet current benchmarks often reduce culture to static facts or homogeneous values. This view conflicts with anthropological accounts that emphasize culture as dynamic, historically situated, and enacted in practice. To analyze this gap, we introduce a four-part framework that categorizes how benchmarks frame culture, such as knowledge, preference, performance, or bias. Using this lens, we qualitatively examine 20 cultural benchmarks and identify six recurring methodological issues, including treating countries as cultures, overlooking within-culture diversity, and relying on oversimplified survey formats. Drawing on established anthropological methods, we propose concrete improvements: incorporating real-world narratives and scenarios, involving cultural communities in design and validation, and evaluating models in context rather than isolation. Our aim is to guide the development of cultural benchmarks that go beyond static recall tasks and more accurately capture the responses of the models to complex cultural situations.

[104] dInfer: An Efficient Inference Framework for Diffusion Language Models

Yuxin Ma, Lun Du, Lanning Wei, Kun Chen, Qian Xu, Kangyu Wang, Guofeng Feng, Guoshan Lu, Lin Liu, Xiaojing Qi, Xinyuan Zhang, Zhen Tao, Haibo Feng, Ziyun Jiang, Ying Xu, Zenan Huang, Yihong Zhuang, Haokai Xu, Jiaqi Hu, Zhenzhong Lan, Junbo Zhao, Jianguo Li, Da Zheng

Main category: cs.CL

TL;DR: dInfer is an efficient inference framework for diffusion-based LLMs that achieves 10x speedup over prior systems and 2-3x speedup over optimized AR models while maintaining output quality.

Details

Motivation: Diffusion-based LLMs offer inherent parallelism but lack standardized efficient inference frameworks, limiting their adoption despite increasing open-source availability.

Method: Decomposes inference pipeline into four modular components (model, diffusion iteration manager, decoding strategy, KV-cache manager) with novel algorithms and system-level optimizations.

Result: Achieves over 1,100 tokens/sec on HumanEval and average 800+ tokens/sec across six benchmarks on 8×H800 GPUs, with 10x speedup over Fast-dLLM and 2-3x speedup over optimized AR model QWen2.5-3B.

Conclusion: dInfer provides an efficient and extensible framework that enables practical deployment of diffusion-based LLMs with significant performance improvements over existing systems.

Abstract: Diffusion-based large language models (dLLMs) have emerged as a promising alternative to autoregressive (AR) LLMs, leveraging denoising-based generation to enable inherent parallelism. Even more and more open-sourced dLLM models emerge, yet their widespread adoption remains constrained by the lack of a standardized and efficient inference framework. We present dInfer, an efficient and extensible framework for dLLM inference. dInfer decomposes the inference pipeline into four modular components–model, diffusion iteration manager, decoding strategy, and KV-cache manager–and integrates novel algorithms for each component alongside system-level optimizations. Through this combination of algorithmic innovations and system enhancements, dInfer achieves substantial efficiency gains without compromising output quality on LLaDA-MoE. At batch size 1, it surpasses 1,100 tokens per second on HumanEval and averages over 800 tokens per second across six benchmarks on $8\times$ H800 GPUs. Compared to prior systems, dInfer delivers a $10\times$ speedup over Fast-dLLM while maintaining similar model performance. Even compared to the AR model (with a comparable number of activation parameters and performance) QWen2.5-3B, which is highly optimized with the latest vLLM inference engine, dInfer still delivers a $2$-$3\times$ speedup. The implementation of dInfer is open-sourced at https://github.com/inclusionAI/dInfer.

[105] Mathematics with large language models as provers and verifiers

Hieu Le Duc, Leo Liberti

Main category: cs.CL

TL;DR: ChatGPT using gpt-5 models collaboratively solved 5/6 IMO 2025 problems and proved about 1/3 of number theory conjectures, with formal verification in Lean to prevent hallucinations.

Details

Motivation: To demonstrate theorem-proving capabilities of large language models through collaborative protocols and formal verification, addressing concerns about hallucinations in AI-generated proofs.

Method: Used collaborative protocol with different gpt-5 instances as provers and verifiers, with final proofs formally verified by Lean proof assistant and human-checked for premise-conclusion conformance.

Result: Successfully solved 5 out of 6 2025 IMO problems and proved approximately one-third (about 22) of the 66 number theory conjectures from Cohen’s paper.

Conclusion: The methodology, while not complete or exact, demonstrates significant theorem-proving capabilities of large language models when using collaborative protocols and formal verification to ensure proof validity.

Abstract: During 2024 and 2025 the discussion about the theorem-proving capabilities of large language models started reporting interesting success stories, mostly to do with difficult exercises (such as problems from the International Mathematical Olympiad), but also with conjectures [Feldman & Karbasi, arXiv:2509.18383v1] formulated for the purpose of verifying whether the artificial intelligence could prove it. In this paper we report a theorem proving feat achieved by ChatGPT by using a protocol involving different prover and verifier instances of the gpt-5 model working collaboratively. To make sure that the produced proofs do not suffer from hallucinations, the final proof is formally verified by the lean proof assistant, and the conformance of premises and conclusion of the lean code is verified by a human. Our methodology is by no means complete or exact. It was nonetheless able to solve five out of six 2025 IMO problems, and close about a third of the sixty-six number theory conjectures in [Cohen, Journal of Integer Sequences, 2025].

[106] Can LLMs Correct Themselves? A Benchmark of Self-Correction in LLMs

Guiyao Tie, Zenghui Yuan, Zeli Zhao, Chaoran Hu, Tianhe Gu, Ruihang Zhang, Sizhe Zhang, Junran Wu, Xiaoyue Tu, Ming Jin, Qingsong Wen, Lixing Chen, Pan Zhou, Lichao Sun

Main category: cs.CL

TL;DR: CorrectBench benchmark evaluates LLM self-correction methods across reasoning tasks, finding they improve accuracy but reduce efficiency, with simple CoT being competitive.

Details

Motivation: To comprehensively evaluate various self-correction methods for LLMs and determine if they can truly correct themselves, as previous evaluations were limited.

Method: Developed CorrectBench benchmark to test intrinsic, external, and fine-tuned self-correction strategies across commonsense reasoning, mathematical reasoning, and code generation tasks.

Result: Self-correction improves accuracy especially for complex reasoning; mixing strategies yields further improvements but reduces efficiency; reasoning LLMs show limited optimization with additional self-correction; simple CoT baseline remains competitive.

Conclusion: Self-correction has potential to enhance LLM reasoning but efficiency remains a challenge, advocating for research to balance reasoning capabilities with operational efficiency.

Abstract: Self-correction of large language models (LLMs) emerges as a critical component for enhancing their reasoning performance. Although various self-correction methods have been proposed, a comprehensive evaluation of these methods remains largely unexplored, and the question of whether LLMs can truly correct themselves is a matter of significant interest and concern. In this study, we introduce CorrectBench, a benchmark developed to evaluate the effectiveness of self-correction strategies, including intrinsic, external, and fine-tuned approaches, across three tasks: commonsense reasoning, mathematical reasoning, and code generation. Our findings reveal that: 1) Self-correction methods can improve accuracy, especially for complex reasoning tasks; 2) Mixing different self-correction strategies yields further improvements, though it reduces efficiency; 3) Reasoning LLMs (e.g., DeepSeek-R1) have limited optimization under additional self-correction methods and have high time costs. Interestingly, a comparatively simple chain-of-thought (CoT) baseline demonstrates competitive accuracy and efficiency. These results underscore the potential of self-correction to enhance LLM’s reasoning performance while highlighting the ongoing challenge of improving their efficiency. Consequently, we advocate for further research focused on optimizing the balance between reasoning capabilities and operational efficiency. Project Page: https://correctbench.github.io/

[107] FrugalPrompt: Reducing Contextual Overhead in Large Language Models via Token Attribution

Syed Rifat Raiyan, Md Farhan Ishmam, Abdullah Al Imran, Mohammad Ali Moni

Main category: cs.CL

TL;DR: FrugalPrompt is a novel prompt compression framework that retains only the most semantically significant tokens using token attribution methods, achieving 20% prompt reduction with minimal performance loss on most NLP tasks except mathematical reasoning.

Details

Motivation: Large language models face efficiency challenges due to redundant low-utility tokens in prompts, which inflate costs, carbon footprint, and inference latency.

Method: Uses GlobEnc and DecompX token attribution methods to assign salience scores, rank tokens, and preserve top-k% tokens in original order to create sparse frugalized prompts.

Result: 20% prompt reduction causes only marginal performance loss on sentiment analysis, commonsense QA, and summarization, but sharp deterioration on mathematical reasoning. Bottom-k% and random-k% tokens reveal asymmetric patterns suggesting potential task contamination.

Conclusion: The work provides nuanced understanding of LLM behavior in performance-efficiency trade-offs, delineating boundaries between tasks tolerant to contextual sparsity and those requiring exhaustive context.

Abstract: Large language models (LLMs) owe much of their stellar performance to expansive input contexts, yet such verbosity inflates monetary costs, carbon footprint, and inference-time latency. Much of this overhead manifests from the redundant low-utility tokens present in typical prompts, as only a fraction of tokens typically carries the majority of the semantic weight. We address this inefficiency by introducing FrugalPrompt, a novel prompt compression framework for LLMs, which retains only the most semantically significant tokens. Leveraging two state-of-the-art token attribution methods, GlobEnc and DecompX, we assign salience scores to every token in an input sequence, rank them to preserve the top-k% tokens in their original order, and obtain a sparse frugalized prompt. We evaluate the approach across four NLP tasks: Sentiment Analysis, Commonsense QA, Summarization, and Mathematical Reasoning, using a suite of frontier LLMs. For the first three tasks, a 20% prompt reduction incurs only a marginal loss in task performance, demonstrating that contemporary LLMs can reconstruct elided context from high-salience cues. In contrast, performance on mathematical reasoning deteriorates sharply, reflecting a stronger dependence on complete token continuity. Further analysis with bottom-k% and random-k% tokens reveals asymmetric performance patterns that may suggest potential task contamination effects, wherein models may resort to shallow memorized patterns from pretraining exposure for conventional NLP tasks. We posit that our work contributes to a more nuanced understanding of LLM behavior in performance-efficiency trade-offs, and delineate the boundary between tasks tolerant to contextual sparsity and those requiring exhaustive context. Our source code and models are available at: https://github.com/Starscream-11813/Frugal-ICL.

[108] Natural Language Processing for Cardiology: A Narrative Review

Kailai Yang, Yan Leng, Xin Zhang, Tianlin Zhang, Paul Thompson, Bernard Keavney, Maciej Tomaszewski, Sophia Ananiadou

Main category: cs.CL

TL;DR: This is a comprehensive review of NLP applications in cardiology from 2014-2025, analyzing 265 studies across multiple dimensions including NLP paradigms, cardiology tasks, disease types, and data sources.

Details

Motivation: Cardiovascular diseases are complex and multifactorial, with information dispersed across various textual data sources. NLP can help analyze this unstructured data to improve diagnosis, treatment, and prevention of cardiac disorders.

Method: Systematic review of six literature databases, rigorous screening process to identify 265 relevant articles, multi-dimensional analysis including NLP paradigms, cardiology tasks, disease types, and data sources, plus temporal analysis of methodological trends.

Result: Found substantial diversity across all analyzed dimensions, showing breadth and evolution of NLP research in cardiology. Temporal analysis revealed progression from rule-based systems to large language models.

Conclusion: This represents the most comprehensive synthesis of NLP research in cardiology to date, with future directions including developing interpretable LLMs and integrating multimodal data.

Abstract: Cardiovascular diseases are becoming increasingly prevalent in modern society, with a profound impact on global health and well-being. These Cardiovascular disorders are complex and multifactorial, influenced by genetic predispositions, lifestyle choices, and diverse socioeconomic and clinical factors. Information about these interrelated factors is dispersed across multiple types of textual data, including patient narratives, medical records, and scientific literature. Natural language processing (NLP) has emerged as a powerful approach for analysing such unstructured data, enabling healthcare professionals and researchers to gain deeper insights that may transform the diagnosis, treatment, and prevention of cardiac disorders. This review provides a comprehensive overview of NLP research in cardiology from 2014 to 2025. We systematically searched six literature databases for studies describing NLP applications across a range of cardiovascular diseases. After a rigorous screening process, we identified 265 relevant articles. Each study was analysed across multiple dimensions, including NLP paradigms, cardiology-related tasks, disease types, and data sources. Our findings reveal substantial diversity within these dimensions, reflecting the breadth and evolution of NLP research in cardiology. A temporal analysis further highlights methodological trends, showing a progression from rule-based systems to large language models. Finally, we discuss key challenges and future directions, such as developing interpretable LLMs and integrating multimodal data. To the best of our knowledge, this review represents the most comprehensive synthesis of NLP research in cardiology to date.

[109] Text or Pixels? It Takes Half: On the Token Efficiency of Visual Text Inputs in Multimodal LLMs

Yanhong Li, Zixuan Lan, Jiawei Zhou

Main category: cs.CL

TL;DR: Using text-as-image compression for LLMs reduces token usage by nearly half while maintaining performance on long-context tasks.

Details

Motivation: To explore if visual text representations can compress textual inputs for LLMs to reduce token consumption while preserving performance.

Method: Render long text inputs as single images and feed them directly to decoder LLMs, exploiting visual text representations as input compression.

Result: Substantial token savings (often nearly half) without degrading performance on RULER (long-context retrieval) and CNN/DailyMail (document summarization) benchmarks.

Conclusion: Visual text representations are a practical and surprisingly effective form of input compression for decoder LLMs.

Abstract: Large language models (LLMs) and their multimodal variants can now process visual inputs, including images of text. This raises an intriguing question: can we compress textual inputs by feeding them as images to reduce token usage while preserving performance? In this paper, we show that visual text representations are a practical and surprisingly effective form of input compression for decoder LLMs. We exploit the idea of rendering long text inputs as a single image and provide it directly to the model. This leads to dramatically reduced number of decoder tokens required, offering a new form of input compression. Through experiments on two distinct benchmarks RULER (long-context retrieval) and CNN/DailyMail (document summarization) we demonstrate that this text-as-image method yields substantial token savings (often nearly half) without degrading task performance.

cs.CV

[110] Reasoning Like Experts: Leveraging Multimodal Large Language Models for Drawing-based Psychoanalysis

Xueqi Ma, Yanbei Jiang, Sarah Erfani, James Bailey, Weifeng Liu, Krista A. Ehinger, Jey Han Lau

Main category: cs.CV

TL;DR: PICK is a multi-step framework using MLLMs for psychoanalytical image comprehension, specifically for the House-Tree-Person psychological test, achieving expert-level reasoning through hierarchical analysis and knowledge injection.

Details

Motivation: MLLMs excel at objective multimodal tasks but lack application in subjective, emotionally nuanced domains like psychological analysis, particularly for clinical assessments like the HTP test.

Method: Hierarchical decomposition of drawings into single-object, multi-object, and whole levels; targeted analysis with visual cue extraction; HTP knowledge base with feature extraction module trained via reinforcement learning; integration of multi-faceted information.

Result: PICK significantly enhances MLLMs’ psychological analysis capabilities and is validated as a general framework through extensions to emotion understanding tasks.

Conclusion: The framework bridges MLLMs with specialized expert domains, providing a structured, interpretable approach for understanding mental states through visual expression.

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated exceptional performance across various objective multimodal perception tasks, yet their application to subjective, emotionally nuanced domains, such as psychological analysis, remains largely unexplored. In this paper, we introduce PICK, a multi-step framework designed for Psychoanalytical Image Comprehension through hierarchical analysis and Knowledge injection with MLLMs, specifically focusing on the House-Tree-Person (HTP) Test, a widely used psychological assessment in clinical practice. First, we decompose drawings containing multiple instances into semantically meaningful sub-drawings, constructing a hierarchical representation that captures spatial structure and content across three levels: single-object level, multi-object level, and whole level. Next, we analyze these sub-drawings at each level with a targeted focus, extracting psychological or emotional insights from their visual cues. We also introduce an HTP knowledge base and design a feature extraction module, trained with reinforcement learning, to generate a psychological profile for single-object level analysis. This profile captures both holistic stylistic features and dynamic object-specific features (such as those of the house, tree, or person), correlating them with psychological states. Finally, we integrate these multi-faceted information to produce a well-informed assessment that aligns with expert-level reasoning. Our approach bridges the gap between MLLMs and specialized expert domains, offering a structured and interpretable framework for understanding human mental states through visual expression. Experimental results demonstrate that the proposed PICK significantly enhances the capability of MLLMs in psychological analysis. It is further validated as a general framework through extensions to emotion understanding tasks.

[111] Dimensionality Reduction for Remote Sensing Data Analysis: A Systematic Review of Methods and Applications

Nathan Mankovich, Kai-Hendrik Cohrs, Homer Durand, Vasileios Sitokonstantinou, Tristan Williams, Gustau Camps-Valls

Main category: cs.CV

TL;DR: This paper reviews dimensionality reduction techniques for Earth observation data to address challenges of high-dimensional data and enhance machine learning applications in remote sensing.

Details

Motivation: Earth observation generates massive datasets that are crucial for addressing societal challenges, but high dimensionality causes issues like sparsity and inefficiency that limit machine learning effectiveness.

Method: The paper provides a comprehensive review and handbook for applying dimensionality reduction techniques, particularly feature extraction methods, across the remote sensing data value chain.

Result: The review identifies how dimensionality reduction preserves essential data properties while reducing complexity, enabling improved data compression, cleaning, fusion, visualization, anomaly detection, and prediction.

Conclusion: The paper serves as a practical guide for leveraging dimensionality reduction in remote sensing and highlights opportunities for under-explored algorithms and future research directions.

Abstract: Earth observation involves collecting, analyzing, and processing an ever-growing mass of data. Automatically harvesting information is crucial for addressing significant societal, economic, and environmental challenges, ranging from environmental monitoring to urban planning and disaster management. However, the high dimensionality of these data poses challenges in terms of sparsity, inefficiency, and the curse of dimensionality, which limits the effectiveness of machine learning models. Dimensionality reduction (DR) techniques, specifically feature extraction, address these challenges by preserving essential data properties while reducing complexity and enhancing tasks such as data compression, cleaning, fusion, visualization, anomaly detection, and prediction. This review provides a handbook for leveraging DR across the RS data value chain and identifies opportunities for under-explored DR algorithms and their application in future research.

[112] A Matter of Time: Revealing the Structure of Time in Vision-Language Models

Nidham Tekaya, Manuela Waldner, Matthias Zeppelzauer

Main category: cs.CV

TL;DR: This paper investigates the temporal awareness of vision-language models (VLMs) like CLIP, introduces a benchmark dataset TIME10k, and proposes methods to extract explicit timeline representations from VLM embeddings for temporal reasoning tasks.

Details

Motivation: To assess whether large-scale vision-language models have temporal awareness - the ability to position visual content in time - given their generalizable multimodal representations and open-vocabulary capabilities.

Method: Created TIME10k benchmark dataset with 10,000+ images and temporal ground truth; evaluated 37 VLMs using novel methodology; discovered temporal information forms low-dimensional non-linear manifold in embedding space; proposed methods to derive explicit timeline representations from embeddings.

Result: Temporal information is structured along a low-dimensional, non-linear manifold in VLM embedding space; timeline representations achieve competitive to superior accuracy compared to prompt-based baselines while being computationally efficient.

Conclusion: VLMs do possess temporal awareness that can be extracted and structured into explicit timeline representations, enabling effective temporal reasoning tasks with computational efficiency.

Abstract: Large-scale vision-language models (VLMs) such as CLIP have gained popularity for their generalizable and expressive multimodal representations. By leveraging large-scale training data with diverse textual metadata, VLMs acquire open-vocabulary capabilities, solving tasks beyond their training scope. This paper investigates the temporal awareness of VLMs, assessing their ability to position visual content in time. We introduce TIME10k, a benchmark dataset of over 10,000 images with temporal ground truth, and evaluate the time-awareness of 37 VLMs by a novel methodology. Our investigation reveals that temporal information is structured along a low-dimensional, non-linear manifold in the VLM embedding space. Based on this insight, we propose methods to derive an explicit ``timeline’’ representation from the embedding space. These representations model time and its chronological progression and thereby facilitate temporal reasoning tasks. Our timeline approaches achieve competitive to superior accuracy compared to a prompt-based baseline while being computationally efficient. All code and data are available at https://tekayanidham.github.io/timeline-page/.

[113] Ninja Codes: Neurally Generated Fiducial Markers for Stealthy 6-DoF Tracking

Yuichiro Takeuchi, Yusuke Imoto, Shunya Kato

Main category: cs.CV

TL;DR: Ninja Codes are stealthy neural fiducial markers that blend into environments while enabling 6-DoF location tracking for AR, robotics, and motion interfaces.

Details

Motivation: Conventional fiducial markers are visually conspicuous and undesirable for aesthetic reasons, limiting their use in real-world applications.

Method: End-to-end neural network training using encoder to apply subtle visual alterations to images, creating codes that can be printed on regular paper and detected with RGB cameras.

Result: Reliable 6-DoF location tracking under indoor lighting conditions while successfully concealing markers within diverse environmental textures.

Conclusion: Ninja Codes provide stealthy tracking solution valuable for applications where traditional markers’ conspicuous appearance is problematic.

Abstract: In this paper we describe Ninja Codes, neurally-generated fiducial markers that can be made to naturally blend into various real-world environments. An encoder network converts arbitrary images into Ninja Codes by applying visually modest alterations; the resulting codes, printed and pasted onto surfaces, can provide stealthy 6-DoF location tracking for a wide range of applications including augmented reality, robotics, motion-based user interfaces, etc. Ninja Codes can be printed using off-the-shelf color printers on regular printing paper, and can be detected using any device equipped with a modern RGB camera and capable of running inference. Using an end-to-end process inspired by prior work on deep steganography, we jointly train a series of network modules that perform the creation and detection of Ninja Codes. Through experiments, we demonstrate Ninja Codes’ ability to provide reliable location tracking under common indoor lighting conditions, while successfully concealing themselves within diverse environmental textures. We expect Ninja Codes to offer particular value in scenarios where the conspicuous appearances of conventional fiducial markers make them undesirable for aesthetic and other reasons.

[114] Robust Driving QA through Metadata-Grounded Context and Task-Specific Prompts

Seungjun Yu, Junsung Park, Youngsun Lim, Hyunjung Shim

Main category: cs.CV

TL;DR: A two-phase vision-language QA system for autonomous driving that uses multimodal LLM with camera inputs, temporal history, and chain-of-thought prompting to answer perception, prediction, and planning questions.

Details

Motivation: To enhance high-level driving question answering by leveraging pretrained vision-language models with carefully engineered prompts and contextual grounding.

Method: Phase-1: Uses Qwen2.5-VL-32B with six-camera inputs, temporal history, chain-of-thought prompts with few-shot exemplars, and self-consistency ensemble. Phase-2: Augments prompts with nuScenes scene metadata and category-specific instructions for different task types.

Result: Achieves 67.37% overall accuracy on driving QA benchmark, with significant improvements over baseline models. Self-consistency raises accuracy from 65.1% to 66.85%. Maintains 96% accuracy under severe visual corruption.

Conclusion: Carefully engineered prompts and contextual grounding can greatly enhance high-level driving QA performance with pretrained vision-language models.

Abstract: We present a two-phase vision-language QA system for autonomous driving that answers high-level perception, prediction, and planning questions. In Phase-1, a large multimodal LLM (Qwen2.5-VL-32B) is conditioned on six-camera inputs, a short temporal window of history, and a chain-of-thought prompt with few-shot exemplars. A self-consistency ensemble (multiple sampled reasoning chains) further improves answer reliability. In Phase-2, we augment the prompt with nuScenes scene metadata (object annotations, ego-vehicle state, etc.) and category-specific question instructions (separate prompts for perception, prediction, planning tasks). In experiments on a driving QA benchmark, our approach significantly outperforms the baseline Qwen2.5 models. For example, using 5 history frames and 10-shot prompting in Phase-1 yields 65.1% overall accuracy (vs.62.61% with zero-shot); applying self-consistency raises this to 66.85%. Phase-2 achieves 67.37% overall. Notably, the system maintains 96% accuracy under severe visual corruption. These results demonstrate that carefully engineered prompts and contextual grounding can greatly enhance high-level driving QA with pretrained vision-language models.

Zhenxing Zhang, Yaxiong Wang, Lechao Cheng, Zhun Zhong, Dan Guo, Meng Wang

Main category: cs.CV

TL;DR: ASAP is a framework for detecting and grounding multi-modal media manipulation that improves cross-modal semantic alignment between images and text using MLLMs/LLMs and a Manipulation-Guided Cross Attention mechanism.

Details

Motivation: Existing DGM4 methods lack attention to cross-modal semantic alignment between image and text, which hampers accurate manipulation detection and grounding.

Method: Uses off-the-shelf MLLMs and LLMs to construct paired image-text data, performs cross-modal alignment learning, and implements Manipulation-Guided Cross Attention to focus on manipulated components.

Result: Extensive experiments on DGM4 dataset show the model surpasses comparison methods by a clear margin.

Conclusion: The proposed ASAP framework effectively advances semantic alignment learning for multi-modal media manipulation detection and grounding.

Abstract: We present ASAP, a new framework for detecting and grounding multi-modal media manipulation (DGM4).Upon thorough examination, we observe that accurate fine-grained cross-modal semantic alignment between the image and text is vital for accurately manipulation detection and grounding. While existing DGM4 methods pay rare attention to the cross-modal alignment, hampering the accuracy of manipulation detecting to step further. To remedy this issue, this work targets to advance the semantic alignment learning to promote this task. Particularly, we utilize the off-the-shelf Multimodal Large-Language Models (MLLMs) and Large Language Models (LLMs) to construct paired image-text pairs, especially for the manipulated instances. Subsequently, a cross-modal alignment learning is performed to enhance the semantic alignment. Besides the explicit auxiliary clues, we further design a Manipulation-Guided Cross Attention (MGCA) to provide implicit guidance for augmenting the manipulation perceiving. With the grounding truth available during training, MGCA encourages the model to concentrate more on manipulated components while downplaying normal ones, enhancing the model’s ability to capture manipulations. Extensive experiments are conducted on the DGM4 dataset, the results demonstrate that our model can surpass the comparison method with a clear margin.

[116] $Δ$t-Mamba3D: A Time-Aware Spatio-Temporal State-Space Model for Breast Cancer Risk Prediction

Zhengbo Zhou, Dooman Arefan, Margarita Zuley, Shandong Wu

Main category: cs.CV

TL;DR: Time-Aware Δt-Mamba3D is a novel state-space architecture for longitudinal medical imaging that handles irregular time intervals between high-resolution images while maintaining computational efficiency.

Details

Motivation: Current methods fail to fully exploit spatial and temporal cues in sequential radiological images captured at irregular intervals, either collapsing spatial information or using inefficient spatio-temporal models.

Method: Uses continuous-time selective scanning that integrates true time differences between exams into state transitions, plus a multi-scale 3D neighborhood fusion module for capturing spatio-temporal relationships.

Result: Superior performance in breast cancer risk prediction, improving validation c-index by 2-5 percentage points and achieving higher 1-5 year AUC scores compared to recurrent, transformer, and state-space models.

Conclusion: The model efficiently processes long patient screening histories with linear complexity, forming a new framework for longitudinal image analysis.

Abstract: Longitudinal analysis of sequential radiological images is hampered by a fundamental data challenge: how to effectively model a sequence of high-resolution images captured at irregular time intervals. This data structure contains indispensable spatial and temporal cues that current methods fail to fully exploit. Models often compromise by either collapsing spatial information into vectors or applying spatio-temporal models that are computationally inefficient and incompatible with non-uniform time steps. We address this challenge with Time-Aware $\Delta$t-Mamba3D, a novel state-space architecture adapted for longitudinal medical imaging. Our model simultaneously encodes irregular inter-visit intervals and rich spatio-temporal context while remaining computationally efficient. Its core innovation is a continuous-time selective scanning mechanism that explicitly integrates the true time difference between exams into its state transitions. This is complemented by a multi-scale 3D neighborhood fusion module that robustly captures spatio-temporal relationships. In a comprehensive breast cancer risk prediction benchmark using sequential screening mammogram exams, our model shows superior performance, improving the validation c-index by 2-5 percentage points and achieving higher 1-5 year AUC scores compared to established variants of recurrent, transformer, and state-space models. Thanks to its linear complexity, the model can efficiently process long and complex patient screening histories of mammograms, forming a new framework for longitudinal image analysis.

[117] EgoBlind: Towards Egocentric Visual Assistance for the Blind

Junbin Xiao, Nanxin Huang, Hao Qiu, Zhulin Tao, Xun Yang, Richang Hong, Meng Wang, Angela Yao

Main category: cs.CV

TL;DR: EgoBlind is the first egocentric VideoQA dataset from blind individuals, containing 1,392 videos and 5,311 questions to evaluate MLLMs’ assistive capabilities for the visually impaired.

Details

Motivation: To address the lack of datasets evaluating multimodal large language models' ability to provide visual assistance to blind individuals in real-world scenarios.

Method: Created EgoBlind dataset with first-person videos from blind individuals’ daily lives, featuring questions directly from blind users with manual annotations. Evaluated 16 advanced MLLMs on this dataset.

Result: All evaluated MLLMs struggled significantly, with best performers achieving only 60% accuracy compared to human performance of 87.4%. Major limitations in egocentric visual assistance were identified.

Conclusion: EgoBlind serves as a foundation for developing effective AI assistants to enhance independence for blind and visually impaired individuals, highlighting the need for significant improvements in current MLLM capabilities.

Abstract: We present EgoBlind, the first egocentric VideoQA dataset collected from blind individuals to evaluate the assistive capabilities of contemporary multimodal large language models (MLLMs). EgoBlind comprises 1,392 first-person videos from the daily lives of blind and visually impaired individuals. It also features 5,311 questions directly posed or verified by the blind to reflect their in-situation needs for visual assistance. Each question has an average of 3 manually annotated reference answers to reduce subjectiveness. Using EgoBlind, we comprehensively evaluate 16 advanced MLLMs and find that all models struggle. The best performers achieve an accuracy near 60%, which is far behind human performance of 87.4%. To guide future advancements, we identify and summarize major limitations of existing MLLMs in egocentric visual assistance for the blind and explore heuristic solutions for improvement. With these efforts, we hope that EgoBlind will serve as a foundation for developing effective AI assistants to enhance the independence of the blind and visually impaired. Data and code are available at https://github.com/doc-doc/EgoBlind.

[118] MoAlign: Motion-Centric Representation Alignment for Video Diffusion Models

Aritra Bhowmik, Denis Korzhenkov, Cees G. M. Snoek, Amirhossein Habibian, Mohsen Ghafoorian

Main category: cs.CV

TL;DR: A motion-centric alignment framework that disentangles motion features from pretrained video encoders and aligns them with text-to-video diffusion models to improve temporal coherence and physical plausibility.

Details

Motivation: Text-to-video diffusion models often fail to generate temporally coherent and physically plausible motion due to insufficient understanding of complex motions, and existing methods using entangled video encoder features limit alignment benefits.

Method: Learn a disentangled motion subspace from pretrained video encoders optimized to predict ground-truth optical flow, then align latent features of text-to-video diffusion models to this motion subspace.

Result: Improves physical commonsense in state-of-the-art video diffusion models while preserving adherence to textual prompts, validated on VideoPhy, VideoPhy2, VBench, and VBench-2.0 benchmarks and user studies.

Conclusion: Motion-centric alignment with disentangled motion features effectively enhances motion quality and physical plausibility in text-to-video generation without compromising text-video alignment.

Abstract: Text-to-video diffusion models have enabled high-quality video synthesis, yet often fail to generate temporally coherent and physically plausible motion. A key reason is the models’ insufficient understanding of complex motions that natural videos often entail. Recent works tackle this problem by aligning diffusion model features with those from pretrained video encoders. However, these encoders mix video appearance and dynamics into entangled features, limiting the benefit of such alignment. In this paper, we propose a motion-centric alignment framework that learns a disentangled motion subspace from a pretrained video encoder. This subspace is optimized to predict ground-truth optical flow, ensuring it captures true motion dynamics. We then align the latent features of a text-to-video diffusion model to this new subspace, enabling the generative model to internalize motion knowledge and generate more plausible videos. Our method improves the physical commonsense in a state-of-the-art video diffusion model, while preserving adherence to textual prompts, as evidenced by empirical evaluations on VideoPhy, VideoPhy2, VBench, and VBench-2.0, along with a user study.

[119] PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions

Amith Ananthram, Elias Stengel-Eskin, Lorena A. Bradford, Julia Demarest, Adam Purvis, Keith Krut, Robert Stein, Rina Elster Pantalony, Mohit Bansal, Kathleen McKeown

Main category: cs.CV

TL;DR: PoSh is a new metric for evaluating detailed image descriptions that uses scene graphs to guide LLMs-as-a-Judge, providing interpretable scores with fine-grained error localization. The paper also introduces DOCENT, a challenging benchmark with expert-written references and human judgments for evaluating detailed image description.

Details

Motivation: Standard metrics like CIDEr and SPICE were designed for short texts and are insensitive to errors in attribute and relation attachments in long descriptions. There's a need for metrics that can localize errors and handle compositional understanding in detailed image descriptions.

Method: PoSh uses scene graphs as structured rubrics to guide LLM judges, producing aggregate scores based on fine-grained errors. The method is validated using the new DOCENT dataset containing artwork with expert references and human judgments from art history students.

Result: PoSh achieves stronger correlations (+0.05 Spearman ρ) with human judgments than best alternatives, is robust across image types, and works well as a reward function outperforming standard supervised fine-tuning. Foundation models struggle with full error-free coverage of images with rich scene dynamics.

Conclusion: PoSh and DOCENT enable advances in detailed image description evaluation and assistive text generation, establishing a demanding new task to gauge VLM progress in handling complex scene dynamics.

Abstract: While vision-language models (VLMs) have advanced into detailed image description, evaluation remains a challenge. Standard metrics (e.g. CIDEr, SPICE) were designed for short texts and tuned to recognize errors that are now uncommon, such as object misidentification. In contrast, long texts require sensitivity to attribute and relation attachments and scores that localize errors to particular text spans. In this work, we introduce PoSh, a metric for detailed image description that uses scene graphs as structured rubrics to guide LLMs-as-a-Judge, producing aggregate scores grounded in fine-grained errors (e.g. mistakes in compositional understanding). PoSh is replicable, interpretable and a better proxy for human raters than existing metrics (including GPT4o-as-a-Judge). To validate PoSh, we introduce a challenging new dataset, DOCENT. This novel benchmark contains artwork, paired with expert-written references, and model-generated descriptions, augmented with granular and coarse judgments of their quality from art history students. Thus, DOCENT enables evaluating both detailed image description metrics and detailed image description itself in a challenging new domain. We show that PoSh achieves stronger correlations (+0.05 Spearman $\rho$) with the human judgments in DOCENT than the best open-weight alternatives, is robust to image type (using CapArena, an existing dataset of web imagery) and is a capable reward function, outperforming standard supervised fine-tuning. Then, using PoSh, we characterize the performance of open and closed models in describing the paintings, sketches and statues in DOCENT and find that foundation models struggle to achieve full, error-free coverage of images with rich scene dynamics, establishing a demanding new task to gauge VLM progress. Through both PoSh and DOCENT, we hope to enable advances in important areas such as assistive text generation.

[120] UniHPR: Unified Human Pose Representation via Singular Value Contrastive Learning

Zhongyu Jiang, Wenhao Chai, Lei Li, Zhuoran Zhou, Cheng-Yen Yang, Jenq-Neng Hwang

Main category: cs.CV

TL;DR: UniHPR is a unified Human Pose Representation learning pipeline that aligns pose embeddings from images, 2D and 3D human poses using a novel singular value-based contrastive learning loss, achieving state-of-the-art performance on pose estimation and enabling pose retrieval.

Details

Motivation: There is growing interest in multi-modal alignment pipelines for human-centric applications, but limited research on correlating different human pose representations (images, 2D keypoints, 3D skeletons, mesh models) using contrastive learning.

Method: Proposes UniHPR pipeline with a novel singular value-based contrastive learning loss to align multiple data representations simultaneously, using a simple 3D human pose decoder for evaluation.

Result: Achieves MPJPE 49.9mm on Human3.6M and PA-MPJPE 51.6mm on 3DPW with cross-domain evaluation. Enables 2D and 3D pose retrieval with 9.24mm MPJPE retrieval error on Human3.6M.

Conclusion: UniHPR effectively aligns multiple human pose representations and demonstrates strong performance on pose estimation tasks while enabling cross-modal pose retrieval.

Abstract: In recent years, there has been a growing interest in developing effective alignment pipelines to generate unified representations from different modalities for multi-modal fusion and generation. As an important component of Human-Centric applications, Human Pose representations are critical in many downstream tasks, such as Human Pose Estimation, Action Recognition, Human-Computer Interaction, Object tracking, etc. Human Pose representations or embeddings can be extracted from images, 2D keypoints, 3D skeletons, mesh models, and lots of other modalities. Yet, there are limited instances where the correlation among all of those representations has been clearly researched using a contrastive paradigm. In this paper, we propose UniHPR, a unified Human Pose Representation learning pipeline, which aligns Human Pose embeddings from images, 2D and 3D human poses. To align more than two data representations at the same time, we propose a novel singular value-based contrastive learning loss, which better aligns different modalities and further boosts performance. To evaluate the effectiveness of the aligned representation, we choose 2D and 3D Human Pose Estimation (HPE) as our evaluation tasks. In our evaluation, with a simple 3D human pose decoder, UniHPR achieves remarkable performance metrics: MPJPE 49.9mm on the Human3.6M dataset and PA-MPJPE 51.6mm on the 3DPW dataset with cross-domain evaluation. Meanwhile, we are able to achieve 2D and 3D pose retrieval with our unified human pose representations in Human3.6M dataset, where the retrieval error is 9.24mm in MPJPE.

[121] Advancing Brain Tumor Segmentation via Attention-based 3D U-Net Architecture and Digital Image Processing

Eyad Gad, Seif Soliman, M. Saeed Darweesh

Main category: cs.CV

TL;DR: This paper proposes an improved 3D U-Net model with attention mechanism and tumor detection algorithm for brain tumor segmentation from MRI scans, achieving superior performance on BraTS 2020 dataset.

Details

Motivation: Standard U-Net models struggle with irregular tumor shapes and ambiguous boundaries in brain tumor segmentation, and face challenges with class imbalance and high computational requirements when training on high-resolution MRI data like BraTS datasets.

Method: Integration of attention mechanism into 3D U-Net to capture intricate details and prioritize informative regions, combined with a tumor detection algorithm based on digital image processing to address class imbalance and mitigate bias.

Result: The proposed model achieved outstanding performance on BraTS 2020 dataset with dice score of 0.975, specificity of 0.988, and sensitivity of 0.995, outperforming related studies.

Conclusion: The attention-enhanced 3D U-Net with tumor detection algorithm effectively improves brain tumor segmentation performance, offering valuable insights for reliable diagnosis in clinical settings.

Abstract: In the realm of medical diagnostics, rapid advancements in Artificial Intelligence (AI) have significantly yielded remarkable improvements in brain tumor segmentation. Encoder-Decoder architectures, such as U-Net, have played a transformative role by effectively extracting meaningful representations in 3D brain tumor segmentation from Magnetic resonance imaging (MRI) scans. However, standard U-Net models encounter challenges in accurately delineating tumor regions, especially when dealing with irregular shapes and ambiguous boundaries. Additionally, training robust segmentation models on high-resolution MRI data, such as the BraTS datasets, necessitates high computational resources and often faces challenges associated with class imbalance. This study proposes the integration of the attention mechanism into the 3D U-Net model, enabling the model to capture intricate details and prioritize informative regions during the segmentation process. Additionally, a tumor detection algorithm based on digital image processing techniques is utilized to address the issue of imbalanced training data and mitigate bias. This study aims to enhance the performance of brain tumor segmentation, ultimately improving the reliability of diagnosis. The proposed model is thoroughly evaluated and assessed on the BraTS 2020 dataset using various performance metrics to accomplish this goal. The obtained results indicate that the model outperformed related studies, exhibiting dice of 0.975, specificity of 0.988, and sensitivity of 0.995, indicating the efficacy of the proposed model in improving brain tumor segmentation, offering valuable insights for reliable diagnosis in clinical settings.

[122] LookUp3D: Data-Driven 3D Scanning

Giancarlo Pereira, Yidan Gao, Yurii Piadyk, David Fouhey, Claudio T Silva, Daniele Panozzo

Main category: cs.CV

TL;DR: A high-speed 3D scanning method achieving 450 fps at 1MP or 1,450 fps at 0.4MP using per-pixel lookup tables that map colors to depths, calibrated with a linear stage.

Details

Motivation: Enable accurate 3D scanning of deformable objects during interactions for applications in graphics, robotics, science, and medicine, overcoming limitations of existing methods that trade off resolution or accuracy.

Method: Uses per-pixel lookup tables mapping colors to depths, built using a linear stage, with imperfections like lens distortion and sensor defects baked into calibration.

Result: Successfully acquires geometry of objects undergoing high-speed deformations and oscillations, outperforming commercial sensors like Microsoft Kinect and Intel Realsense in comparison tests.

Conclusion: The method enables unprecedented high-speed, high-resolution 3D scanning capable of recovering physical properties from reconstructions of rapidly deforming objects.

Abstract: High speed, high-resolution, and accurate 3D scanning would open doors to many new applications in graphics, robotics, science, and medicine by enabling the accurate scanning of deformable objects during interactions. Past attempts to use structured light, time-of-flight, and stereo in high-speed settings have usually required tradeoffs in resolution or inaccuracy. In this paper, we introduce a method that enables, for the first time, 3D scanning at 450 frames per second at 1~~Megapixel, or 1,450 frames per second at 0.4~~Megapixel in an environment with controlled lighting. The key idea is to use a per-pixel lookup table that maps colors to depths, which is built using a linear stage. Imperfections, such as lens-distortion and sensor defects are baked into the calibration. We describe our method and test it on a novel hardware prototype. We compare the system with both ground-truth geometry as well as commercially available dynamic sensors like the Microsoft Kinect and Intel Realsense. Our results show the system acquiring geometry of objects undergoing high-speed deformations and oscillations and demonstrate the ability to recover physical properties from the reconstructions.

[123] A Novel Approach to Breast Cancer Segmentation using U-Net Model with Attention Mechanisms and FedProx

Eyad Gad, Mustafa Abou Khatwa, Mustafa A. Elattar, Sahar Selim

Main category: cs.CV

TL;DR: This paper applies Federated Proximal (FedProx) method to non-IID ultrasound breast cancer datasets for enhanced tumor segmentation while preserving patient privacy, achieving 96% accuracy with modified U-Net and attention mechanisms.

Details

Motivation: Breast cancer detection requires early diagnosis, but medical data privacy challenges limit AI model development. Federated Learning offers privacy-preserving distributed learning, but non-IID data affects model accuracy for tumor boundary delineation.

Method: Applied FedProx method to non-IID ultrasonic breast cancer datasets and incorporated modified U-Net model with attention mechanisms for enhanced tumor segmentation.

Result: Achieved 96% accuracy in the global model, demonstrating effective tumor segmentation while preserving patient privacy.

Conclusion: FedProx shows promise as an approach for training precise machine learning models on non-IID local medical datasets while maintaining privacy.

Abstract: Breast cancer is a leading cause of death among women worldwide, emphasizing the need for early detection and accurate diagnosis. As such Ultrasound Imaging, a reliable and cost-effective tool, is used for this purpose, however the sensitive nature of medical data makes it challenging to develop accurate and private artificial intelligence models. A solution is Federated Learning as it is a promising technique for distributed machine learning on sensitive medical data while preserving patient privacy. However, training on non-Independent and non-Identically Distributed (non-IID) local datasets can impact the accuracy and generalization of the trained model, which is crucial for accurate tumour boundary delineation in BC segmentation. This study aims to tackle this challenge by applying the Federated Proximal (FedProx) method to non-IID Ultrasonic Breast Cancer Imaging datasets. Moreover, we focus on enhancing tumour segmentation accuracy by incorporating a modified U-Net model with attention mechanisms. Our approach resulted in a global model with 96% accuracy, demonstrating the effectiveness of our method in enhancing tumour segmentation accuracy while preserving patient privacy. Our findings suggest that FedProx has the potential to be a promising approach for training precise machine learning models on non-IID local medical datasets.

[124] Learning Differential Pyramid Representation for Tone Mapping

Qirui Yang, Yinbo Li, Yihao Liu, Peng-Tao Jiang, Fangpu Zhang, Qihua Cheng, Huanjing Yue, Jingyu Yang

Main category: cs.CV

TL;DR: DPRNet is a new tone mapping method that uses a learnable differential pyramid to preserve fine textures and structural fidelity in HDR scenes, outperforming existing methods on benchmark datasets.

Details

Motivation: Existing tone mapping methods fail to preserve fine textures and structural fidelity in complex HDR scenes, and lack mechanisms to jointly model global tone consistency and local contrast enhancement, leading to artifacts like halos.

Method: DPRNet uses a learnable differential pyramid that generalizes traditional pyramids through content-aware differencing operations across scales. It incorporates global tone perception and local tone tuning modules, plus an iterative detail enhancement module for coarse-to-fine reconstruction.

Result: DPRNet achieves state-of-the-art results, improving PSNR by 2.39 dB on the 4K HDR+ dataset and 3.01 dB on the 4K HDRI Haven dataset, producing perceptually coherent and detail-preserving results.

Conclusion: DPRNet provides an effective end-to-end framework for high-fidelity tone mapping that adaptively captures high-frequency variations and enforces perceptual consistency across different luminance and contrast conditions.

Abstract: Existing tone mapping methods operate on downsampled inputs and rely on handcrafted pyramids to recover high-frequency details. These designs typically fail to preserve fine textures and structural fidelity in complex HDR scenes. Furthermore, most methods lack an effective mechanism to jointly model global tone consistency and local contrast enhancement, leading to globally flat or locally inconsistent outputs such as halo artifacts. We present the Differential Pyramid Representation Network (DPRNet), an end-to-end framework for high-fidelity tone mapping. At its core is a learnable differential pyramid that generalizes traditional Laplacian and Difference-of-Gaussian pyramids through content-aware differencing operations across scales. This allows DPRNet to adaptively capture high-frequency variations under diverse luminance and contrast conditions. To enforce perceptual consistency, DPRNet incorporates global tone perception and local tone tuning modules operating on downsampled inputs, enabling efficient yet expressive tone adaptation. Finally, an iterative detail enhancement module progressively restores the full-resolution output in a coarse-to-fine manner, reinforcing structure and sharpness. Experiments show that DPRNet achieves state-of-the-art results, improving PSNR by 2.39 dB on the 4K HDR+ dataset and 3.01 dB on the 4K HDRI Haven dataset, while producing perceptually coherent, detail-preserving results. \textit{We provide an anonymous online demo at https://xxxxxxdprnet.github.io/DPRNet/.

[125] X-Ego: Acquiring Team-Level Tactical Situational Awareness via Cross-Egocentric Contrastive Video Representation Learning

Yunzhe Wang, Soham Hans, Volkan Ustun

Main category: cs.CV

TL;DR: This paper introduces X-Ego-CS, a benchmark dataset of synchronized first-person gameplay footage from Counter-Strike 2, and proposes Cross-Ego Contrastive Learning (CECL) to enable agents to infer teammate and opponent positions from individual perspectives.

Details

Motivation: Existing video understanding approaches for team interactions rely on third-person views and overlook the synchronous, egocentric nature of multi-agent learning in complex 3D environments.

Method: Created X-Ego-CS dataset with 124 hours of synchronized first-person gameplay footage from 45 professional Counter-Strike 2 matches. Proposed Cross-Ego Contrastive Learning (CECL) that aligns teammates’ egocentric visual streams to develop team-level tactical awareness.

Result: CECL demonstrated effectiveness in teammate-opponent location prediction task, enhancing agents’ ability to infer both teammate and opponent positions from a single first-person view using state-of-the-art video encoders.

Conclusion: X-Ego-CS and CECL establish a foundation for cross-egocentric multi-agent benchmarking in esports and position gameplay understanding as a testbed for multi-agent modeling with implications for spatiotemporal reasoning and human-AI teaming.

Abstract: Human team tactics emerge from each player’s individual perspective and their ability to anticipate, interpret, and adapt to teammates’ intentions. While advances in video understanding have improved the modeling of team interactions in sports, most existing work relies on third-person broadcast views and overlooks the synchronous, egocentric nature of multi-agent learning. We introduce X-Ego-CS, a benchmark dataset consisting of 124 hours of gameplay footage from 45 professional-level matches of the popular e-sports game Counter-Strike 2, designed to facilitate research on multi-agent decision-making in complex 3D environments. X-Ego-CS provides cross-egocentric video streams that synchronously capture all players’ first-person perspectives along with state-action trajectories. Building on this resource, we propose Cross-Ego Contrastive Learning (CECL), which aligns teammates’ egocentric visual streams to foster team-level tactical situational awareness from an individual’s perspective. We evaluate CECL on a teammate-opponent location prediction task, demonstrating its effectiveness in enhancing an agent’s ability to infer both teammate and opponent positions from a single first-person view using state-of-the-art video encoders. Together, X-Ego-CS and CECL establish a foundation for cross-egocentric multi-agent benchmarking in esports. More broadly, our work positions gameplay understanding as a testbed for multi-agent modeling and tactical learning, with implications for spatiotemporal reasoning and human-AI teaming in both virtual and real-world domains. Code and dataset are available at https://github.com/HATS-ICT/x-ego.

[126] FootFormer: Estimating Stability from Visual Input

Keaton Kraiger, Jingjing Li, Skanda Bharadwaj, Jesse Scott, Robert T. Collins, Yanxi Liu

Main category: cs.CV

TL;DR: FootFormer is a cross-modality approach that predicts human motion dynamics from visual input, achieving superior performance in estimating foot pressure distributions, foot contact maps, and center of mass compared to existing methods.

Details

Motivation: To develop a unified method that can jointly predict multiple human motion dynamics measures from visual input, addressing limitations of existing approaches that typically generate only one or two of these measures.

Method: A cross-modality approach that directly processes visual input to predict foot pressure distributions, foot contact maps, and center of mass simultaneously.

Result: FootFormer achieves statistically significantly better or equivalent performance on multiple datasets for foot pressure distributions, foot contact maps, and center of mass estimation, and sets SOTA performance for stability-predictive components (CoP, CoM, BoS).

Conclusion: FootFormer provides an effective cross-modality solution for comprehensive human motion dynamics prediction from visual data, outperforming existing methods and achieving state-of-the-art performance in stability-related metrics.

Abstract: We propose FootFormer, a cross-modality approach for jointly predicting human motion dynamics directly from visual input. On multiple datasets, FootFormer achieves statistically significantly better or equivalent estimates of foot pressure distributions, foot contact maps, and center of mass (CoM), as compared with existing methods that generate one or two of those measures. Furthermore, FootFormer achieves SOTA performance in estimating stability-predictive components (CoP, CoM, BoS) used in classic kinesiology metrics. Code and data are available at https://github.com/keatonkraiger/Vision-to-Stability.git.

[127] Malaria Detection from Blood Cell Images Using XceptionNet

Warisa Nusrat, Mostafijur Rahman, Ayatullah Faruk Mollah

Main category: cs.CV

TL;DR: Deep learning networks applied to automatically detect malaria from blood cell images, with Residual Attention Network and XceptionNet achieving best performance (97.28% and 97.55% accuracy respectively).

Details

Motivation: Manual malaria diagnosis through microscope observation is prone to errors due to lack of expertise and manual involvement. Computer-aided automatic diagnosis is needed for reliable detection.

Method: Applied six deep convolutional networks (AlexNet, XceptionNet, VGG-19, Residual Attention Network, DenseNet-121, Custom-CNN) to extract features from blood cell images and classify them as malaria infected or healthy cells.

Result: Residual Attention Network and XceptionNet performed best with 97.28% and 97.55% average accuracy respectively, surpassing other methods on the same dataset.

Conclusion: Deep learning methods show strong potential for automatic and reliable malaria detection while minimizing manual involvement.

Abstract: Malaria, which primarily spreads with the bite of female anopheles mosquitos, often leads to death of people - specifically children in the age-group of 0-5 years. Clinical experts identify malaria by observing RBCs in blood smeared images with a microscope. Lack of adequate professional knowledge and skills, and most importantly manual involvement may cause incorrect diagnosis. Therefore, computer aided automatic diagnosis stands as a preferred substitute. In this paper, well-demonstrated deep networks have been applied to extract deep intrinsic features from blood cell images and thereafter classify them as malaria infected or healthy cells. Among the six deep convolutional networks employed in this work viz. AlexNet, XceptionNet, VGG-19, Residual Attention Network, DenseNet-121 and Custom-CNN. Residual Attention Network and XceptionNet perform relatively better than the rest on a publicly available malaria cell image dataset. They yield an average accuracy of 97.28% and 97.55% respectively, that surpasses other related methods on the same dataset. These findings highly encourage the reality of deep learning driven method for automatic and reliable detection of malaria while minimizing direct manual involvement.

Fengyuan Sun, Hui Chen, Xinhao Xu, Dandan Zheng, Jingdong Chen, Jun Zhou, Jungong Han, Guiguang Ding

Main category: cs.CV

TL;DR: PruneHal is a training-free method that uses adaptive KV cache pruning to reduce hallucinations in multi-modal large language models by enhancing focus on critical visual tokens.

Details

Motivation: Hallucinations in MLLMs are strongly associated with insufficient attention to visual tokens, where redundant visual tokens disperse the model's attention away from informative ones.

Method: Proposes PruneHal, which leverages adaptive KV cache pruning to enhance the model’s focus on critical visual information without requiring additional training.

Result: Evaluated on several hallucination benchmarks using four mainstream MLLMs, achieving robust and outstanding results that highlight effectiveness and superiority.

Conclusion: PruneHal is a simple yet effective, training-free method that incurs nearly no extra inference cost and can be seamlessly integrated with different decoding strategies.

Abstract: While multi-modal large language models (MLLMs) have made significant progress in recent years, the issue of hallucinations remains a major challenge. To mitigate this phenomenon, existing solutions either introduce additional data for further training or incorporate external or internal information during inference. However, these approaches inevitably introduce extra computational costs. In this paper, we observe that hallucinations in MLLMs are strongly associated with insufficient attention allocated to visual tokens. In particular, the presence of redundant visual tokens disperses the model’s attention, preventing it from focusing on the most informative ones. As a result, critical visual cues are often under-attended, which in turn exacerbates the occurrence of hallucinations. Building on this observation, we propose \textbf{PruneHal}, a training-free, simple yet effective method that leverages adaptive KV cache pruning to enhance the model’s focus on critical visual information, thereby mitigating hallucinations. To the best of our knowledge, we are the first to apply token pruning for hallucination mitigation in MLLMs. Notably, our method don’t require additional training and incurs nearly no extra inference cost. Moreover, PruneHal is model-agnostic and can be seamlessly integrated with different decoding strategies, including those specifically designed for hallucination mitigation. We evaluate PruneHal on several widely used hallucination evaluation benchmarks using four mainstream MLLMs, achieving robust and outstanding results that highlight the effectiveness and superiority of our method. Our code will be publicly available.

[129] Video Consistency Distance: Enhancing Temporal Consistency for Image-to-Video Generation via Reward-Based Fine-Tuning

Takehiro Aoshima, Yusuke Shinohara, Park Byeongseon

Main category: cs.CV

TL;DR: Proposes Video Consistency Distance (VCD), a novel metric for enhancing temporal consistency in image-to-video generation through frequency-domain analysis and reward-based fine-tuning.

Details

Motivation: Conventional reward functions focus on overall video quality but fail to address temporal consistency issues in image-to-video generation tasks, leading to incoherent video sequences.

Method: Defines VCD in frequency space of video frame features to capture temporal information through frequency-domain analysis, then fine-tunes video generation models using this metric in a reward-based framework.

Result: Experimental results show that fine-tuning with VCD significantly enhances temporal consistency across multiple I2V datasets without degrading other performance metrics compared to previous methods.

Conclusion: VCD effectively addresses temporal consistency limitations in I2V generation and provides a robust solution for improving video coherence while maintaining overall quality.

Abstract: Reward-based fine-tuning of video diffusion models is an effective approach to improve the quality of generated videos, as it can fine-tune models without requiring real-world video datasets. However, it can sometimes be limited to specific performances because conventional reward functions are mainly aimed at enhancing the quality across the whole generated video sequence, such as aesthetic appeal and overall consistency. Notably, the temporal consistency of the generated video often suffers when applying previous approaches to image-to-video (I2V) generation tasks. To address this limitation, we propose Video Consistency Distance (VCD), a novel metric designed to enhance temporal consistency, and fine-tune a model with the reward-based fine-tuning framework. To achieve coherent temporal consistency relative to a conditioning image, VCD is defined in the frequency space of video frame features to capture frame information effectively through frequency-domain analysis. Experimental results across multiple I2V datasets demonstrate that fine-tuning a video generation model with VCD significantly enhances temporal consistency without degrading other performance compared to the previous method.

[130] Rethinking Driving World Model as Synthetic Data Generator for Perception Tasks

Kai Zeng, Zhanqian Wu, Kaixin Xiong, Xiaobao Wei, Xiangyu Guo, Zhenxin Zhu, Kalok Ho, Lijun Zhou, Bohan Zeng, Ming Lu, Haiyang Sun, Bing Wang, Guang Chen, Hangjun Ye, Wentao Zhang

Main category: cs.CV

TL;DR: Dream4Drive is a synthetic data generation framework that creates multi-view photorealistic videos for autonomous driving perception tasks by decomposing videos into 3D-aware guidance maps and rendering 3D assets, significantly improving corner case detection.

Details

Motivation: Existing driving world models focus on generation quality metrics but overlook downstream perception task evaluation, which is crucial for autonomous driving performance. Current methods require double training epochs when using synthetic data, making the benefits negligible.

Method: Dream4Drive decomposes input videos into 3D-aware guidance maps, renders 3D assets onto these maps, and fine-tunes the driving world model to produce edited multi-view photorealistic videos for training perception models.

Result: The framework enables unprecedented flexibility in generating multi-view corner cases at scale, significantly boosting corner case perception in autonomous driving. Comprehensive experiments show Dream4Drive effectively boosts downstream perception model performance under various training epochs.

Conclusion: Dream4Drive provides a novel approach to synthetic data generation that enhances autonomous driving perception tasks, particularly for corner cases, with the additional contribution of a large-scale 3D asset dataset (DriveObj3D) for future research.

Abstract: Recent advancements in driving world models enable controllable generation of high-quality RGB videos or multimodal videos. Existing methods primarily focus on metrics related to generation quality and controllability. However, they often overlook the evaluation of downstream perception tasks, which are $\mathbf{really\ crucial}$ for the performance of autonomous driving. Existing methods usually leverage a training strategy that first pretrains on synthetic data and finetunes on real data, resulting in twice the epochs compared to the baseline (real data only). When we double the epochs in the baseline, the benefit of synthetic data becomes negligible. To thoroughly demonstrate the benefit of synthetic data, we introduce Dream4Drive, a novel synthetic data generation framework designed for enhancing the downstream perception tasks. Dream4Drive first decomposes the input video into several 3D-aware guidance maps and subsequently renders the 3D assets onto these guidance maps. Finally, the driving world model is fine-tuned to produce the edited, multi-view photorealistic videos, which can be used to train the downstream perception models. Dream4Drive enables unprecedented flexibility in generating multi-view corner cases at scale, significantly boosting corner case perception in autonomous driving. To facilitate future research, we also contribute a large-scale 3D asset dataset named DriveObj3D, covering the typical categories in driving scenarios and enabling diverse 3D-aware video editing. We conduct comprehensive experiments to show that Dream4Drive can effectively boost the performance of downstream perception models under various training epochs. Project: $\href{https://wm-research.github.io/Dream4Drive/}{this\ https\ URL}$

[131] MoE-GS: Mixture of Experts for Dynamic Gaussian Splatting

In-Hwan Jin, Hyeongju Mun, Joonsoo Kim, Kugjin Yun, Kyeongbo Kong

Main category: cs.CV

TL;DR: MoE-GS is a unified framework that integrates multiple specialized experts via a Volume-aware Pixel Router for dynamic Gaussian splatting, improving rendering quality while addressing efficiency through distillation and optimization techniques.

Details

Motivation: Existing dynamic scene reconstruction methods show inconsistent performance across diverse scenes, with no single approach effectively handling all dynamic challenges.

Method: Proposes Mixture of Experts for Dynamic Gaussian Splatting (MoE-GS) with a Volume-aware Pixel Router that adaptively blends expert outputs using differentiable weight splatting. Also includes efficiency improvements through single-pass multi-expert rendering, gate-aware Gaussian pruning, and distillation to transfer MoE performance to individual experts.

Result: MoE-GS consistently outperforms state-of-the-art methods on N3V and Technicolor datasets with improved efficiency, though with increased model capacity and reduced FPS inherent to the MoE architecture.

Conclusion: MoE-GS is the first approach to incorporate Mixture-of-Experts techniques into dynamic Gaussian splatting, providing a unified framework that achieves superior performance while offering deployment flexibility through distillation.

Abstract: Recent advances in dynamic scene reconstruction have significantly benefited from 3D Gaussian Splatting, yet existing methods show inconsistent performance across diverse scenes, indicating no single approach effectively handles all dynamic challenges. To overcome these limitations, we propose Mixture of Experts for Dynamic Gaussian Splatting (MoE-GS), a unified framework integrating multiple specialized experts via a novel Volume-aware Pixel Router. Our router adaptively blends expert outputs by projecting volumetric Gaussian-level weights into pixel space through differentiable weight splatting, ensuring spatially and temporally coherent results. Although MoE-GS improves rendering quality, the increased model capacity and reduced FPS are inherent to the MoE architecture. To mitigate this, we explore two complementary directions: (1) single-pass multi-expert rendering and gate-aware Gaussian pruning, which improve efficiency within the MoE framework, and (2) a distillation strategy that transfers MoE performance to individual experts, enabling lightweight deployment without architectural changes. To the best of our knowledge, MoE-GS is the first approach incorporating Mixture-of-Experts techniques into dynamic Gaussian splatting. Extensive experiments on the N3V and Technicolor datasets demonstrate that MoE-GS consistently outperforms state-of-the-art methods with improved efficiency. Video demonstrations are available at https://anonymous.4open.science/w/MoE-GS-68BA/.

[132] SFGFusion: Surface Fitting Guided 3D Object Detection with 4D Radar and Camera Fusion

Xiaozhi Li, Huijun Di, Jian Li, Feng Liu, Wei Liang

Main category: cs.CV

TL;DR: SFGFusion is a camera-4D imaging radar fusion network that uses surface fitting to enhance spatial representation and cross-modal interaction for 3D object detection in autonomous driving.

Details

Motivation: 4D imaging radar offers advantages like low cost and long-range detection but suffers from sparse point clouds and low resolution, limiting object geometric representation and hindering effective multi-modal fusion with cameras.

Method: The method estimates quadratic surface parameters from image and radar data to create an explicit surface fitting model. This generates dense pseudo-point clouds to mitigate radar sparsity and guides image feature transformation from perspective view to bird’s-eye view. Features from both radar and pseudo-point branches are transformed to BEV space for fusion.

Result: SFGFusion achieves superior performance on TJ4DRadSet and view-of-delft (VoD) object detection benchmarks, effectively fusing camera and 4D radar features.

Conclusion: The surface fitting approach successfully enhances spatial representation and cross-modal interaction, enabling reliable dense depth prediction and improving 3D object detection performance in autonomous driving applications.

Abstract: 3D object detection is essential for autonomous driving. As an emerging sensor, 4D imaging radar offers advantages as low cost, long-range detection, and accurate velocity measurement, making it highly suitable for object detection. However, its sparse point clouds and low resolution limit object geometric representation and hinder multi-modal fusion. In this study, we introduce SFGFusion, a novel camera-4D imaging radar detection network guided by surface fitting. By estimating quadratic surface parameters of objects from image and radar data, the explicit surface fitting model enhances spatial representation and cross-modal interaction, enabling more reliable prediction of fine-grained dense depth. The predicted depth serves two purposes: 1) in an image branch to guide the transformation of image features from perspective view (PV) to a unified bird’s-eye view (BEV) for multi-modal fusion, improving spatial mapping accuracy; and 2) in a surface pseudo-point branch to generate dense pseudo-point cloud, mitigating the radar point sparsity. The original radar point cloud is also encoded in a separate radar branch. These two point cloud branches adopt a pillar-based method and subsequently transform the features into the BEV space. Finally, a standard 2D backbone and detection head are used to predict object labels and bounding boxes from BEV features. Experimental results show that SFGFusion effectively fuses camera and 4D radar features, achieving superior performance on the TJ4DRadSet and view-of-delft (VoD) object detection benchmarks.

[133] Space Object Detection using Multi-frame Temporal Trajectory Completion Method

Xiaoqing Lan, Biqiao Xin, Bingshu Wang, Han Zhang, Laixian Zhang

Main category: cs.CV

TL;DR: A method for detecting GEO space objects in optical images using wavelet transform for feature enhancement and multi-frame trajectory completion with Hungarian algorithm for optimal matching.

Details

Motivation: Space objects in GEO are hard to detect due to weak signals, complex stellar backgrounds, and environmental interference in optical imaging.

Method: Wavelet transform for single-frame feature enhancement and noise suppression, followed by multi-frame temporal trajectory completion using Hungarian algorithm for cross-frame matching, with post-processing steps including temporal matching, interpolation, noise filtering, and trajectory refinement.

Result: Achieved 90.14% F1 score on the public SpotGEO dataset.

Conclusion: The proposed method effectively addresses GEO object detection challenges through combined spatial feature enhancement and temporal trajectory optimization.

Abstract: Space objects in Geostationary Earth Orbit (GEO) present significant detection challenges in optical imaging due to weak signals, complex stellar backgrounds, and environmental interference. In this paper, we enhance high-frequency features of GEO targets while suppressing background noise at the single-frame level through wavelet transform. Building on this, we propose a multi-frame temporal trajectory completion scheme centered on the Hungarian algorithm for globally optimal cross-frame matching. To effectively mitigate missing and false detections, a series of key steps including temporal matching and interpolation completion, temporal-consistency-based noise filtering, and progressive trajectory refinement are designed in the post-processing pipeline. Experimental results on the public SpotGEO dataset demonstrate the effectiveness of the proposed method, achieving an F_1 score of 90.14%.

[134] Background Fades, Foreground Leads: Curriculum-Guided Background Pruning for Efficient Foreground-Centric Collaborative Perception

Yuheng Wu, Xiangbo Gao, Quang Tau, Zhengzhong Tu, Dongman Lee

Main category: cs.CV

TL;DR: FadeLead is a collaborative perception framework that learns to encapsulate background context into compact foreground features, enabling efficient information sharing under bandwidth constraints without transmitting background data.

Details

Motivation: Current collaborative perception methods discard background information to save bandwidth, but background encodes essential context that improves perception reliability, especially for challenging long-tail scenarios.

Method: Uses a curricular learning strategy that initially leverages background cues but progressively prunes them away, forcing the model to internalize context into foreground representations during training.

Result: Extensive experiments on simulated and real-world benchmarks show FadeLead outperforms prior methods under different bandwidth settings.

Conclusion: Context-enriched foreground sharing through FadeLead effectively balances bandwidth efficiency and perception performance in collaborative autonomous driving systems.

Abstract: Collaborative perception enhances the reliability and spatial coverage of autonomous vehicles by sharing complementary information across vehicles, offering a promising solution to long-tail scenarios that challenge single-vehicle perception. However, the bandwidth constraints of vehicular networks make transmitting the entire feature map impractical. Recent methods, therefore, adopt a foreground-centric paradigm, transmitting only predicted foreground-region features while discarding the background, which encodes essential context. We propose FadeLead, a foreground-centric framework that overcomes this limitation by learning to encapsulate background context into compact foreground features during training. At the core of our design is a curricular learning strategy that leverages background cues early on but progressively prunes them away, forcing the model to internalize context into foreground representations without transmitting background itself. Extensive experiments on both simulated and real-world benchmarks show that FadeLead outperforms prior methods under different bandwidth settings, underscoring the effectiveness of context-enriched foreground sharing.

[135] Advances in 4D Representation: Geometry, Motion, and Interaction

Mingrui Zhao, Sauradip Nag, Kai Wang, Aditya Vora, Guangda Ji, Peter Chun, Ali Mahdavi-Amiri, Hao Zhang

Main category: cs.CV

TL;DR: A survey on 4D generation and reconstruction focusing on 4D representations for modeling 3D geometry evolving over time with motion and interaction, organized around geometry, motion, and interaction pillars.

Details

Motivation: To provide a unique perspective on 4D representations for modeling dynamic 3D scenes with motion and interaction, helping readers select appropriate representations for their tasks.

Method: Selective approach focusing on representative works to highlight desirable properties and challenges of different 4D representations under various scenarios, organized around geometry, motion, and interaction pillars.

Result: Comprehensive coverage of 4D representations including popular methods like NeRFs and 3DGS, under-explored approaches like structured models and long-range motions, and analysis of LLMs/VFMs in 4D applications.

Conclusion: The survey provides guidance on selecting and customizing 4D representations for specific tasks, identifies current dataset limitations, and discusses future directions for the subfield.

Abstract: We present a survey on 4D generation and reconstruction, a fast-evolving subfield of computer graphics whose developments have been propelled by recent advances in neural fields, geometric and motion deep learning, as well 3D generative artificial intelligence (GenAI). While our survey is not the first of its kind, we build our coverage of the domain from a unique and distinctive perspective of 4D representations/}, to model 3D geometry evolving over time while exhibiting motion and interaction. Specifically, instead of offering an exhaustive enumeration of many works, we take a more selective approach by focusing on representative works to highlight both the desirable properties and ensuing challenges of each representation under different computation, application, and data scenarios. The main take-away message we aim to convey to the readers is on how to select and then customize the appropriate 4D representations for their tasks. Organizationally, we separate the 4D representations based on three key pillars: geometry, motion, and interaction. Our discourse will not only encompass the most popular representations of today, such as neural radiance fields (NeRFs) and 3D Gaussian Splatting (3DGS), but also bring attention to relatively under-explored representations in the 4D context, such as structured models and long-range motions. Throughout our survey, we will reprise the role of large language models (LLMs) and video foundational models (VFMs) in a variety of 4D applications, while steering our discussion towards their current limitations and how they can be addressed. We also provide a dedicated coverage on what 4D datasets are currently available, as well as what is lacking, in driving the subfield forward. Project page:https://mingrui-zhao.github.io/4DRep-GMI/

[136] SCEESR: Semantic-Control Edge Enhancement for Diffusion-Based Super-Resolution

Yun Kai Zhuang

Main category: cs.CV

TL;DR: Proposes a novel super-resolution framework using ControlNet for semantic edge guidance in one-step diffusion models, achieving improved structural integrity and realism while maintaining efficiency.

Details

Motivation: Address the trade-off between computational cost and perceptual quality in real-world image super-resolution, particularly the structural inaccuracies in one-step diffusion models caused by distillation artifacts.

Method: Enhances one-step diffusion model with ControlNet mechanism for semantic edge guidance, and uses hybrid loss combining L2, LPIPS, and edge-aware AME loss for optimization.

Result: Effectively improves structural integrity and realism while maintaining one-step generation efficiency, achieving superior balance between output quality and inference speed.

Conclusion: The proposed framework successfully addresses the structural limitations of one-step diffusion models in super-resolution tasks through semantic edge guidance and hybrid loss optimization.

Abstract: Real-world image super-resolution (Real-ISR) must handle complex degradations and inherent reconstruction ambiguities. While generative models have improved perceptual quality, a key trade-off remains with computational cost. One-step diffusion models offer speed but often produce structural inaccuracies due to distillation artifacts. To address this, we propose a novel SR framework that enhances a one-step diffusion model using a ControlNet mechanism for semantic edge guidance. This integrates edge information to provide dynamic structural control during single-pass inference. We also introduce a hybrid loss combining L2, LPIPS, and an edge-aware AME loss to optimize for pixel accuracy, perceptual quality, and geometric precision. Experiments show our method effectively improves structural integrity and realism while maintaining the efficiency of one-step generation, achieving a superior balance between output quality and inference speed. The results of test datasets will be published at https://drive.google.com/drive/folders/1amddXQ5orIyjbxHgGpzqFHZ6KTolinJF?usp=drive_link and the related code will be published at https://github.com/ARBEZ-ZEBRA/SCEESR.

[137] MobiAct: Efficient MAV Action Recognition Using MobileNetV4 with Contrastive Learning and Knowledge Distillation

Zhang Nengbo, Ho Hann Woei

Main category: cs.CV

TL;DR: Proposes MobiAct, a lightweight MAV action recognition framework using MobileNetV4 backbone with knowledge distillation and attention mechanisms to achieve high accuracy with low computational cost for resource-constrained MAV platforms.

Details

Motivation: Existing MAV motion recognition approaches use large, computationally intensive models unsuitable for resource-limited MAV platforms, creating a trade-off between accuracy and inference speed.

Method: Uses MobileNetV4 backbone with Stage-wise Orthogonal Knowledge Distillation (SOKD) to transfer features from ResNet18 teacher, integrates parameter-free attention mechanism, and employs hybrid loss training strategy for stable optimization.

Result: Achieves 92.12% average accuracy across three datasets, consumes only 136.16 pJ energy, processes 8.84 actions/second, and decodes actions 2x faster than leading methods with comparable accuracy.

Conclusion: MobiAct enables efficient, low-energy MAV action recognition suitable for resource-constrained platforms while maintaining high accuracy and superior inference speed compared to existing methods.

Abstract: Accurate and efficient recognition of Micro Air Vehicle (MAV) motion is essential for enabling real-time perception and coordination in autonomous aerial swarm. However, most existing approaches rely on large, computationally intensive models that are unsuitable for resource-limited MAV platforms, which results in a trade-off between recognition accuracy and inference speed. To address these challenges, this paper proposes a lightweight MAV action recognition framework, MobiAct, designed to achieve high accuracy with low computational cost. Specifically, MobiAct adopts MobileNetV4 as the backbone network and introduces a Stage-wise Orthogonal Knowledge Distillation (SOKD) strategy to effectively transfer MAV motion features from a teacher network (ResNet18) to a student network, thereby enhancing knowledge transfer efficiency. Furthermore, a parameter-free attention mechanism is integrated into the architecture to improve recognition accuracy without increasing model complexity. In addition, a hybrid loss training strategy is developed to combine multiple loss objectives, which ensures stable and robust optimization during training. Experimental results demonstrate that the proposed MobiAct achieves low-energy and low-computation MAV action recognition, while maintaining the fastest action decoding speed among compared methods. Across all three self-collected datasets, MobiAct achieves an average recognition accuracy of 92.12%, while consuming only 136.16 pJ of energy and processing recognition at a rate of 8.84 actions per second. Notably, MobiAct decodes actions up to 2 times faster than the leading method, with highly comparable recognition accuracy, highlighting its superior efficiency in MAV action recognition.

[138] D2D: Detector-to-Differentiable Critic for Improved Numeracy in Text-to-Image Generation

Nobline Yoo, Olga Russakovsky, Ye Zhu

Main category: cs.CV

TL;DR: D2D is a framework that transforms non-differentiable detection models into differentiable critics to improve object counting accuracy in text-to-image generation without compromising image quality.

Details

Motivation: Existing T2I diffusion models struggle with generating the correct number of objects specified in prompts, and current approaches are limited to regression-based models, excluding superior detector-based models due to their non-differentiable nature.

Method: Proposes Detector-to-Differentiable (D2D) framework with custom activation functions to convert detector logits into soft binary indicators, which optimize noise prior at inference time with pre-trained T2I models.

Result: Substantial improvements in object counting accuracy across multiple benchmarks (up to 13.7% on D2D-Small), with minimal degradation in image quality and computational overhead.

Conclusion: D2D successfully bridges the gap between superior counting ability of detection models and the differentiability requirements of T2I generation, enabling more accurate object counting in diffusion models.

Abstract: Text-to-image (T2I) diffusion models have achieved strong performance in semantic alignment, yet they still struggle with generating the correct number of objects specified in prompts. Existing approaches typically incorporate auxiliary counting networks as external critics to enhance numeracy. However, since these critics must provide gradient guidance during generation, they are restricted to regression-based models that are inherently differentiable, thus excluding detector-based models with superior counting ability, whose count-via-enumeration nature is non-differentiable. To overcome this limitation, we propose Detector-to-Differentiable (D2D), a novel framework that transforms non-differentiable detection models into differentiable critics, thereby leveraging their superior counting ability to guide numeracy generation. Specifically, we design custom activation functions to convert detector logits into soft binary indicators, which are then used to optimize the noise prior at inference time with pre-trained T2I models. Our extensive experiments on SDXL-Turbo, SD-Turbo, and Pixart-DMD across four benchmarks of varying complexity (low-density, high-density, and multi-object scenarios) demonstrate consistent and substantial improvements in object counting accuracy (e.g., boosting up to 13.7% on D2D-Small, a 400-prompt, low-density benchmark), with minimal degradation in overall image quality and computational overhead.

[139] Enhancing Early Alzheimer Disease Detection through Big Data and Ensemble Few-Shot Learning

Safa Ben Atitallah, Maha Driss, Wadii Boulila, Anis Koubaa

Main category: cs.CV

TL;DR: The paper proposes an ensemble approach using Few-Shot Learning with pre-trained CNNs and Prototypical Networks for Alzheimer’s disease detection, achieving over 99% accuracy on two datasets despite limited labeled data.

Details

Motivation: Alzheimer's disease detection faces challenges due to limited labeled medical data, disease complexity, and data privacy constraints, requiring effective methods that work with scarce labeled data.

Method: Ensemble approach based on Prototypical Network (ProtoNet) integrating various pre-trained CNNs as encoders, combined with class-aware loss and entropy loss for precise classification of Alzheimer’s progression levels.

Result: Achieved 99.72% accuracy on Kaggle Alzheimer dataset and 99.86% accuracy on ADNI dataset, outperforming state-of-the-art methods.

Conclusion: The proposed approach demonstrates superior accuracy and potential for real-world applications in early Alzheimer’s disease detection, particularly effective in few-shot learning scenarios with limited labeled data.

Abstract: Alzheimer disease is a severe brain disorder that causes harm in various brain areas and leads to memory damage. The limited availability of labeled medical data poses a significant challenge for accurate Alzheimer disease detection. There is a critical need for effective methods to improve the accuracy of Alzheimer disease detection, considering the scarcity of labeled data, the complexity of the disease, and the constraints related to data privacy. To address this challenge, our study leverages the power of big data in the form of pre-trained Convolutional Neural Networks (CNNs) within the framework of Few-Shot Learning (FSL) and ensemble learning. We propose an ensemble approach based on a Prototypical Network (ProtoNet), a powerful method in FSL, integrating various pre-trained CNNs as encoders. This integration enhances the richness of features extracted from medical images. Our approach also includes a combination of class-aware loss and entropy loss to ensure a more precise classification of Alzheimer disease progression levels. The effectiveness of our method was evaluated using two datasets, the Kaggle Alzheimer dataset and the ADNI dataset, achieving an accuracy of 99.72% and 99.86%, respectively. The comparison of our results with relevant state-of-the-art studies demonstrated that our approach achieved superior accuracy and highlighted its validity and potential for real-world applications in early Alzheimer disease detection.

[140] Vision-Based Mistake Analysis in Procedural Activities: A Review of Advances and Challenges

Konstantinos Bacharidis, Antonis A. Argyros

Main category: cs.CV

TL;DR: This paper reviews vision-based methods for detecting and predicting mistakes in procedural activities, covering applications in industrial automation, rehabilitation, education, and human-robot collaboration.

Details

Motivation: Mistake analysis in procedural activities is critical for enhancing safety, efficiency, and task performance across diverse domains like industrial automation, physical rehabilitation, education, and human-robot collaboration.

Method: The paper reviews vision-based approaches leveraging computer vision advancements including action recognition, anticipation, and activity understanding to detect deviations in task execution such as incorrect sequencing, improper techniques, or timing errors.

Result: Provides a comprehensive overview of existing datasets, evaluation metrics, and state-of-the-art methods, categorizing approaches based on procedural structure usage, supervision levels, and learning strategies.

Conclusion: Establishes a unified perspective on vision-based mistake analysis, discusses open challenges like distinguishing permissible variations from true mistakes, and identifies future directions including neuro-symbolic reasoning and counterfactual state modeling.

Abstract: Mistake analysis in procedural activities is a critical area of research with applications spanning industrial automation, physical rehabilitation, education and human-robot collaboration. This paper reviews vision-based methods for detecting and predicting mistakes in structured tasks, focusing on procedural and executional errors. By leveraging advancements in computer vision, including action recognition, anticipation and activity understanding, vision-based systems can identify deviations in task execution, such as incorrect sequencing, use of improper techniques, or timing errors. We explore the challenges posed by intra-class variability, viewpoint differences and compositional activity structures, which complicate mistake detection. Additionally, we provide a comprehensive overview of existing datasets, evaluation metrics and state-of-the-art methods, categorizing approaches based on their use of procedural structure, supervision levels and learning strategies. Open challenges, such as distinguishing permissible variations from true mistakes and modeling error propagation are discussed alongside future directions, including neuro-symbolic reasoning and counterfactual state modeling. This work aims to establish a unified perspective on vision-based mistake analysis in procedural activities, highlighting its potential to enhance safety, efficiency and task performance across diverse domains.

[141] Unified Reinforcement and Imitation Learning for Vision-Language Models

Byung-Kwan Lee, Ryo Hachiuma, Yong Man Ro, Yu-Chiang Frank Wang, Yueh-Hua Wu

Main category: cs.CV

TL;DR: RIL is a unified training algorithm combining reinforcement learning and adversarial imitation learning to create efficient, lightweight vision-language models that can compete with larger models.

Details

Motivation: Large-scale VLMs are impractical for resource-constrained environments, creating a need for powerful but lightweight alternatives.

Method: Combines reinforcement learning with adversarial imitation learning using an LLM-based discriminator and guidance from multiple large teacher VLMs to enable smaller student models to mimic and improve upon teacher outputs.

Result: Significant performance gains making student models competitive with leading closed-source VLMs, narrowing performance gaps and surpassing state-of-the-art models in several instances.

Conclusion: The unified RIL approach effectively creates powerful lightweight VLMs that can compete with much larger models while being more practical for resource-constrained environments.

Abstract: Vision-Language Models (VLMs) have achieved remarkable progress, yet their large scale often renders them impractical for resource-constrained environments. This paper introduces Unified Reinforcement and Imitation Learning (RIL), a novel and efficient training algorithm designed to create powerful, lightweight VLMs. RIL distinctively combines the strengths of reinforcement learning with adversarial imitation learning. This enables smaller student VLMs not only to mimic the sophisticated text generation of large teacher models but also to systematically improve their generative capabilities through reinforcement signals. Key to our imitation framework is an LLM-based discriminator that adeptly distinguishes between student and teacher outputs, complemented by guidance from multiple large teacher VLMs to ensure diverse learning. This unified learning strategy, leveraging both reinforcement and imitation, empowers student models to achieve significant performance gains, making them competitive with leading closed-source VLMs. Extensive experiments on diverse vision-language benchmarks demonstrate that RIL significantly narrows the performance gap with state-of-the-art open- and closed-source VLMs and, in several instances, surpasses them.

[142] Online Handwritten Signature Verification Based on Temporal-Spatial Graph Attention Transformer

Hai-jie Yuan, Heng Zhang, Fei Yin

Main category: cs.CV

TL;DR: TS-GATR is a novel dynamic signature verification method that combines Graph Attention Network and GRU to model spatial-temporal dependencies, achieving state-of-the-art performance on benchmark datasets.

Details

Motivation: Handwritten signature verification is crucial for identity authentication but faces challenges due to intra-user variability and forgery risks, requiring more accurate verification methods.

Method: TS-GATR represents signatures as graphs with dynamic features, uses Dual-Graph Attention Transformer for local/global spatial modeling, and integrates GRU for long-term temporal dependencies.

Result: Comprehensive experiments on MSDS and DeepSignDB datasets show TS-GATR surpasses current state-of-the-art approaches with consistently lower Equal Error Rates across various scenarios.

Conclusion: The proposed TS-GATR method effectively addresses signature verification challenges by modeling both spatial and temporal dependencies, demonstrating superior performance over existing methods.

Abstract: Handwritten signature verification is a crucial aspect of identity authentication, with applications in various domains such as finance and e-commerce. However, achieving high accuracy in signature verification remains challenging due to intra-user variability and the risk of forgery. This paper introduces a novel approach for dynamic signature verification: the Temporal-Spatial Graph Attention Transformer (TS-GATR). TS-GATR combines the Graph Attention Network (GAT) and the Gated Recurrent Unit (GRU) to model both spatial and temporal dependencies in signature data. TS-GATR enhances verification performance by representing signatures as graphs, where each node captures dynamic features (e.g. position, velocity, pressure), and by using attention mechanisms to model their complex relationships. The proposed method further employs a Dual-Graph Attention Transformer (DGATR) module, which utilizes k-step and k-nearest neighbor adjacency graphs to model local and global spatial features, respectively. To capture long-term temporal dependencies, the model integrates GRU, thereby enhancing its ability to learn dynamic features during signature verification. Comprehensive experiments conducted on benchmark datasets such as MSDS and DeepSignDB show that TS-GATR surpasses current state-of-the-art approaches, consistently achieving lower Equal Error Rates (EER) across various scenarios.

[143] Seabed-Net: A multi-task network for joint bathymetry estimation and seabed classification from remote sensing imagery in shallow waters

Panagiotis Agrafiotis, Begüm Demir

Main category: cs.CV

TL;DR: Seabed-Net is a unified multi-task framework that simultaneously predicts bathymetry and seabed classification from remote sensing imagery, outperforming traditional methods and single-task approaches through cross-task feature integration and dynamic task weighting.

Details

Motivation: Existing approaches treat depth estimation and seabed classification as isolated tasks, missing the synergistic benefits of their interaction and hindering adoption of deep learning methods for shallow-water mapping.

Method: Uses dual-branch encoders for bathymetry and seabed classification, integrates cross-task features via Attention Feature Fusion module and windowed Swin-Transformer fusion block, and employs dynamic task uncertainty weighting to balance objectives.

Result: Achieves up to 75% lower RMSE than traditional methods, reduces bathymetric RMSE by 10-30% compared to state-of-the-art baselines, improves seabed classification accuracy up to 8%, and demonstrates enhanced spatial consistency and sharper habitat boundaries.

Conclusion: Jointly modeling depth with substrate and seabed habitats yields synergistic gains, providing a robust open solution for integrated shallow-water mapping that addresses climatological and anthropogenic pressures.

Abstract: Accurate, detailed, and regularly updated bathymetry, coupled with complex semantic content, is essential for under-mapped shallow-water environments facing increasing climatological and anthropogenic pressures. However, existing approaches that derive either depth or seabed classes from remote sensing imagery treat these tasks in isolation, forfeiting the mutual benefits of their interaction and hindering the broader adoption of deep learning methods. To address these limitations, we introduce Seabed-Net, a unified multi-task framework that simultaneously predicts bathymetry and pixel-based seabed classification from remote sensing imagery of various resolutions. Seabed-Net employs dual-branch encoders for bathymetry estimation and pixel-based seabed classification, integrates cross-task features via an Attention Feature Fusion module and a windowed Swin-Transformer fusion block, and balances objectives through dynamic task uncertainty weighting. In extensive evaluations at two heterogeneous coastal sites, it consistently outperforms traditional empirical models and traditional machine learning regression methods, achieving up to 75% lower RMSE. It also reduces bathymetric RMSE by 10-30% compared to state-of-the-art single-task and multi-task baselines and improves seabed classification accuracy up to 8%. Qualitative analyses further demonstrate enhanced spatial consistency, sharper habitat boundaries, and corrected depth biases in low-contrast regions. These results confirm that jointly modeling depth with both substrate and seabed habitats yields synergistic gains, offering a robust, open solution for integrated shallow-water mapping. Code and pretrained weights are available at https://github.com/pagraf/Seabed-Net.

[144] Exploring Scale Shift in Crowd Localization under the Context of Domain Generalization

Juncheng Wang, Lei Shang, Ziqi Liu, Wang Lu, Xixu Hu, Zhe Hu, Jindong Wang, Shujun Wang

Main category: cs.CV

TL;DR: This paper investigates scale shift (discrepancies in head scale distributions) in crowd localization domain generalization, establishes a benchmark called ScaleBench, analyzes the problem theoretically, and proposes a solution called Catto (Causal Feature Decomposition and Anisotropic Processing) to mitigate scale shift effects.

Details

Motivation: Existing crowd localization approaches suffer from performance degradation due to scale shift between training and testing data in domain generalization scenarios. The paper aims to understand how scale shift affects crowd localization and develop methods to address this challenge.

Method: The authors conduct systematic examination of scale shift effects, establish ScaleBench benchmark with 20 DG algorithms, provide theoretical analysis, and propose Catto algorithm that uses causal feature decomposition and anisotropic processing to mitigate scale shift influence.

Result: The study demonstrates limitations of existing DG algorithms, reveals the importance and complexity of scale shift, and shows that the proposed Catto algorithm effectively mitigates scale shift effects in crowd localization domain generalization.

Conclusion: Scale shift is a critical but under-explored challenge in crowd localization domain generalization. The paper establishes a new research direction called Scale Shift Domain Generalization and provides four significant insights for future research through extensive analytical experiments.

Abstract: Crowd localization plays a crucial role in visual scene understanding towards predicting each pedestrian location in a crowd, thus being applicable to various downstream tasks. However, existing approaches suffer from significant performance degradation due to discrepancies in head scale distributions (scale shift) between training and testing data, a challenge known as domain generalization (DG). This paper aims to comprehend the nature of scale shift within the context of domain generalization for crowd localization models. To this end, we address four critical questions: (i) How does scale shift influence crowd localization in a DG scenario? (ii) How can we quantify this influence? (iii) What causes this influence? (iv) How to mitigate the influence? Initially, we conduct a systematic examination of how crowd localization performance varies with different levels of scale shift. Then, we establish a benchmark, ScaleBench, and reproduce 20 advanced DG algorithms to quantify the influence. Through extensive experiments, we demonstrate the limitations of existing algorithms and underscore the importance and complexity of scale shift, a topic that remains insufficiently explored. To deepen our understanding, we provide a rigorous theoretical analysis on scale shift. Building on these insights, we further propose an effective algorithm called Causal Feature Decomposition and Anisotropic Processing (Catto) to mitigate the influence of scale shift in DG settings. Later, we also provide extensive analytical experiments, revealing four significant insights for future research. Our results emphasize the importance of this novel and applicable research direction, which we term Scale Shift Domain Generalization.

[145] BrainMCLIP: Brain Image Decoding with Multi-Layer feature Fusion of CLIP

Tian Xia, Zihan Ma, Xinlong Wang, Qing Liu, Xiaowei He, Tianming Liu, Yudan Ren

Main category: cs.CV

TL;DR: BrainMCLIP is a parameter-efficient fMRI decoding method that aligns brain activity to multiple CLIP layers guided by visual hierarchy, eliminating the need for VAE pipelines while achieving competitive performance.

Details

Motivation: Existing fMRI decoding methods either use only CLIP's final semantic layer (missing visual details) or add parameter-intensive VAE pipelines, both overlooking CLIP's intermediate layers and contradicting the brain's functional hierarchy.

Method: Aligns fMRI signals from distinct visual areas to corresponding intermediate and final CLIP layers, uses Cross-Reconstruction strategy and multi-granularity loss, guided by human visual system’s functional hierarchy.

Result: Achieves highly competitive performance, excels on high-level semantic metrics matching/surpassing SOTA methods, with 71.7% fewer parameters than VAE-based methods by avoiding VAE pathway.

Conclusion: BrainMCLIP effectively captures visual details missed by CLIP-only approaches, balancing semantic accuracy and detail fidelity without requiring separate VAE pipelines.

Abstract: Decoding images from fMRI often involves mapping brain activity to CLIP’s final semantic layer. To capture finer visual details, many approaches add a parameter-intensive VAE-based pipeline. However, these approaches overlook rich object information within CLIP’s intermediate layers and contradicts the brain’s functionally hierarchical. We introduce BrainMCLIP, which pioneers a parameter-efficient, multi-layer fusion approach guided by human visual system’s functional hierarchy, eliminating the need for such a separate VAE pathway. BrainMCLIP aligns fMRI signals from functionally distinct visual areas (low-/high-level) to corresponding intermediate and final CLIP layers, respecting functional hierarchy. We further introduce a Cross-Reconstruction strategy and a novel multi-granularity loss. Results show BrainMCLIP achieves highly competitive performance, particularly excelling on high-level semantic metrics where it matches or surpasses SOTA(state-of-the-art) methods, including those using VAE pipelines. Crucially, it achieves this with substantially fewer parameters, demonstrating a reduction of 71.7%(Table.\ref{tab:compare_clip_vae}) compared to top VAE-based SOTA methods, by avoiding the VAE pathway. By leveraging intermediate CLIP features, it effectively captures visual details often missed by CLIP-only approaches, striking a compelling balance between semantic accuracy and detail fidelity without requiring a separate VAE pipeline.

[146] A Training-Free Framework for Open-Vocabulary Image Segmentation and Recognition with EfficientNet and CLIP

Ying Dai, Wei Yu Chen

Main category: cs.CV

TL;DR: Training-free framework for open-vocabulary image segmentation and recognition using EfficientNetB0 for unsupervised segmentation and CLIP for recognition via vision-language alignment.

Details

Motivation: To develop a training-free approach for open-vocabulary segmentation and recognition that doesn't require labeled data, leveraging pre-trained models for unsupervised segmentation and cross-modal alignment.

Method: Two-stage pipeline: 1) Unsupervised segmentation using EfficientNetB0 features with SVD decomposition and hierarchical clustering, 2) Segment recognition using CLIP’s vision-language alignment with projected embeddings in shared latent space.

Result: Achieves state-of-the-art performance on COCO, ADE20K, and PASCAL VOC benchmarks in terms of Hungarian mIoU, precision, recall, and F1-score.

Conclusion: The framework demonstrates effectiveness, flexibility, and generalizability for open-vocabulary segmentation and recognition without requiring training.

Abstract: This paper presents a novel training-free framework for open-vocabulary image segmentation and object recognition (OVSR), which leverages EfficientNetB0, a convolutional neural network, for unsupervised segmentation and CLIP, a vision-language model, for open-vocabulary object recognition. The proposed framework adopts a two stage pipeline: unsupervised image segmentation followed by segment-level recognition via vision-language alignment. In the first stage, pixel-wise features extracted from EfficientNetB0 are decomposed using singular value decomposition to obtain latent representations, which are then clustered using hierarchical clustering to segment semantically meaningful regions. The number of clusters is adaptively determined by the distribution of singular values. In the second stage, the segmented regions are localized and encoded into image embeddings using the Vision Transformer backbone of CLIP. Text embeddings are precomputed using CLIP’s text encoder from category-specific prompts, including a generic something else prompt to support open set recognition. The image and text embeddings are concatenated and projected into a shared latent feature space via SVD to enhance cross-modal alignment. Recognition is performed by computing the softmax over the similarities between the projected image and text embeddings. The proposed method is evaluated on standard benchmarks, including COCO, ADE20K, and PASCAL VOC, achieving state-of-the-art performance in terms of Hungarian mIoU, precision, recall, and F1-score. These results demonstrate the effectiveness, flexibility, and generalizability of the proposed framework.

[147] DaMo: Data Mixing Optimizer in Fine-tuning Multimodal LLMs for Mobile Phone Agents

Kai Shi, Jun Yang, Ni Yang, Binqiang Pan, Qingsong Xie, Chao Zhang, Zhenyu Yang, Tianhuang Su, Haonan Lu

Main category: cs.CV

TL;DR: DaMo is a data mixture optimization method that uses a trainable network to predict optimal dataset ratios for multitask learning in Mobile Phone Agents, achieving significant performance improvements across multiple benchmarks.

Details

Motivation: Current Multimodal Large Language Models struggle with handling multiple mobile phone tasks simultaneously, and existing multitask supervised fine-tuning approaches cannot determine optimal training data compositions for peak performance.

Method: Proposed DaMo (Data Mixture Optimizer) - a trainable network that predicts optimal data mixtures by forecasting downstream task performance for any given dataset ratio. Also introduced PhoneAgentBench, a specialized benchmark with 1235 QA pairs for evaluating MLLMs on multimodal mobile phone tasks.

Result: DaMo achieved 3.38% performance improvement on PhoneAgentBench compared to alternative methods, and showed superior generalization across established benchmarks (2.57% average improvement). When used solely for MLLM optimization on BFCL-v3, it improved metrics by 12.47%. Demonstrated strong predictive capability (R^2=0.81) and maintained robust scalability across different model architectures.

Conclusion: DaMo effectively addresses the challenge of determining optimal data mixtures for multitask learning in Mobile Phone Agents, demonstrating significant performance improvements, strong generalization capabilities, and robust scalability across various benchmarks and model architectures.

Abstract: Mobile Phone Agents (MPAs) have emerged as a promising research direction due to their broad applicability across diverse scenarios. While Multimodal Large Language Models (MLLMs) serve as the foundation for MPAs, their effectiveness in handling multiple mobile phone tasks simultaneously remains limited. Although multitask supervised fine-tuning (SFT) is widely adopted for multitask learning, existing approaches struggle to determine optimal training data compositions for peak performance. To address this challenge, we propose DaMo (Data Mixture Optimizer) - a novel solution employing a trainable network that predicts optimal data mixtures by forecasting downstream task performance for any given dataset ratio. To support comprehensive evaluation, we introduce PhoneAgentBench, the first specialized benchmark to evaluate MLLMs on multimodal mobile phone tasks, comprising 1235 QA pairs spanning diverse real-world industrial mobile application scenarios. Demonstrating strong predictive capability (R^2=0.81) in small-scale pilot experiments, DaMo efficiently extrapolates optimal data mixing configurations. Our results show DaMo achieves a 3.38% performance improvement on PhoneAgentBench compared to alternative methods. Furthermore, extensive experiments across established benchmarks including BFCL-v3, MME-Reasoning, MME-Perception, and OCRBench reveal DaMo’s superior generalization, outperforming other approaches by 2.57% in terms of average score. When used solely for MLLM optimization on the BFCL-v3 task, DaMo improves the metrics by 12.47% than other methods. Notably, DaMo maintains robust scalability, preserving its effectiveness when applied to other model architectures. The code and dataset are available at https://github.com/OPPO-Mente-Lab/DaMo.git

[148] DARE: A Deformable Adaptive Regularization Estimator for Learning-Based Medical Image Registration

Ahsan Raza Siyal, Markus Haltmeier, Ruth Steiger, Malik Galijasevic, Elke Ruth Gizewski, Astrid Ellen Grams

Main category: cs.CV

TL;DR: DARE is a deformable medical image registration framework that dynamically adjusts elastic regularization based on deformation field gradients, integrating adaptive strain/shear energy terms and folding-prevention mechanisms for robust, anatomically plausible registration.

Details

Motivation: Current deep learning-based registration methods often overlook regularization's critical role in ensuring robustness and anatomical plausibility, leading to non-physical artifacts like folding and over-smoothing issues.

Method: Proposes DARE framework with dynamic elastic regularization adjustment based on deformation field gradient norm, integrating adaptive strain and shear energy terms, and a folding-prevention mechanism that penalizes negative deformation Jacobian regions.

Result: The approach mitigates non-physical artifacts, avoids over-smoothing, and improves both registration accuracy and anatomical plausibility compared to traditional methods.

Conclusion: DARE provides a robust registration framework that balances stability and flexibility through adaptive regularization, ensuring physically realistic transformations in medical image registration.

Abstract: Deformable medical image registration is a fundamental task in medical image analysis. While deep learning-based methods have demonstrated superior accuracy and computational efficiency compared to traditional techniques, they often overlook the critical role of regularization in ensuring robustness and anatomical plausibility. We propose DARE (Deformable Adaptive Regularization Estimator), a novel registration framework that dynamically adjusts elastic regularization based on the gradient norm of the deformation field. Our approach integrates strain and shear energy terms, which are adaptively modulated to balance stability and flexibility. To ensure physically realistic transformations, DARE includes a folding-prevention mechanism that penalizes regions with negative deformation Jacobian. This strategy mitigates non-physical artifacts such as folding, avoids over-smoothing, and improves both registration accuracy and anatomical plausibility

[149] AegisRF: Adversarial Perturbations Guided with Sensitivity for Protecting Intellectual Property of Neural Radiance Fields

Woo Jae Kim, Kyu Beom Han, Yoonki Cho, Youngju Na, Junsik Jung, Sooel Son, Sung-eui Yoon

Main category: cs.CV

TL;DR: AegisRF protects Neural Radiance Fields (NeRFs) intellectual property by injecting adversarial perturbations that disrupt unauthorized downstream applications while preserving rendering quality through adaptive geometric constraints.

Details

Motivation: As NeRFs become powerful tools for 3D scene representation, protecting their intellectual property from unauthorized use is crucial. Existing methods avoid geometric perturbations due to quality degradation concerns.

Method: Proposes AegisRF framework with two components: Perturbation Field that injects adversarial perturbations into pre-rendering outputs, and Sensitivity Field that learns spatially varying sensitivity to constrain geometric perturbations adaptively.

Result: Experimental evaluations show generalized applicability across diverse downstream tasks (multi-view image classification, voxel-based 3D localization) while maintaining high visual fidelity.

Conclusion: AegisRF effectively protects NeRF IP by disrupting unauthorized applications through adversarial perturbations while preserving rendering quality via adaptive geometric constraints.

Abstract: As Neural Radiance Fields (NeRFs) have emerged as a powerful tool for 3D scene representation and novel view synthesis, protecting their intellectual property (IP) from unauthorized use is becoming increasingly crucial. In this work, we aim to protect the IP of NeRFs by injecting adversarial perturbations that disrupt their unauthorized applications. However, perturbing the 3D geometry of NeRFs can easily deform the underlying scene structure and thus substantially degrade the rendering quality, which has led existing attempts to avoid geometric perturbations or restrict them to explicit spaces like meshes. To overcome this limitation, we introduce a learnable sensitivity to quantify the spatially varying impact of geometric perturbations on rendering quality. Building upon this, we propose AegisRF, a novel framework that consists of a Perturbation Field, which injects adversarial perturbations into the pre-rendering outputs (color and volume density) of NeRF models to fool an unauthorized downstream target model, and a Sensitivity Field, which learns the sensitivity to adaptively constrain geometric perturbations, preserving rendering quality while disrupting unauthorized use. Our experimental evaluations demonstrate the generalized applicability of AegisRF across diverse downstream tasks and modalities, including multi-view image classification and voxel-based 3D localization, while maintaining high visual fidelity. Codes are available at https://github.com/wkim97/AegisRF.

[150] Seeing Across Views: Benchmarking Spatial Reasoning of Vision-Language Models in Robotic Scenes

Zhiyuan Feng, Zhaolu Kang, Qijie Wang, Zhiying Du, Jiongrui Yan, Shubin Shi, Chengbo Yuan, Huizhi Liang, Yu Deng, Qixiu Li, Rushuai Yang, Arctanx An, Leqi Zheng, Weijie Wang, Shawn Chen, Sicheng Xu, Yaobo Liang, Jiaolong Yang, Baining Guo

Main category: cs.CV

TL;DR: MV-RoboBench is a new benchmark for evaluating multi-view spatial reasoning in vision-language models for robotics, showing current models lag significantly behind human performance despite their success in single-view settings.

Details

Motivation: Current VLM evaluations focus on single-view settings, while multi-camera setups are standard in robotics to handle occlusion and depth ambiguity. The ability of VLMs to effectively use multi-view inputs for robotic reasoning remains unexplored.

Method: Created MV-RoboBench with 1.7k manually curated QA items across 8 subtasks in two categories: spatial understanding and robotic execution. Evaluated diverse VLMs including open/closed-source models and CoT-enhanced versions.

Result: State-of-the-art models perform far below human level. Key findings: (1) spatial intelligence and robotic task execution are positively correlated in multi-view scenarios; (2) strong single-view spatial benchmark performance doesn’t translate to robotic spatial tasks.

Conclusion: MV-RoboBench reveals substantial challenges VLMs face in multi-view robotic perception and provides an open resource with standardized evaluation protocol to advance spatially grounded VLMs and VLAs.

Abstract: Vision-language models (VLMs) are essential to Embodied AI, enabling robots to perceive, reason, and act in complex environments. They also serve as the foundation for the recent Vision-Language-Action (VLA) models. Yet most evaluations of VLMs focus on single-view settings, leaving their ability to integrate multi-view information underexplored. At the same time, multi-camera setups are increasingly standard in robotic platforms, as they provide complementary perspectives to mitigate occlusion and depth ambiguity. Whether VLMs can effectively leverage such multi-view inputs for robotic reasoning therefore remains an open question. To bridge this gap, we introduce MV-RoboBench, a benchmark specifically designed to evaluate the multi-view spatial reasoning capabilities of VLMs in robotic manipulation. MV-RoboBench consists of 1.7k manually curated QA items across eight subtasks, divided into two primary categories: spatial understanding and robotic execution. We evaluate a diverse set of existing VLMs, including both open-source and closed-source models, along with enhanced versions incorporating CoT-inspired techniques. The results show that state-of-the-art models remain far below human performance, underscoring the substantial challenges VLMs face in multi-view robotic perception. Additionally, our analysis uncovers two key findings: (i) spatial intelligence and robotic task execution are positively correlated in multi-view robotic scenarios; and (ii) strong performance on existing general-purpose single-view spatial understanding benchmarks does not reliably translate to success in the robotic spatial tasks assessed by our benchmark. We release MV-RoboBench as an open resource to foster progress in spatially grounded VLMs and VLAs, providing not only data but also a standardized evaluation protocol for multi-view embodied reasoning.

[151] Multi-Camera Worker Tracking in Logistics Warehouse Considering Wide-Angle Distortion

Yuki Mori, Kazuma Kano, Yusuke Asai, Shin Katayama, Kenta Urano, Takuro Yonezawa, Nobuo Kawaguchi

Main category: cs.CV

TL;DR: Improved worker tracking in warehouses using 19 ceiling-mounted wide-angle cameras with foot position alignment to reduce distortion effects.

Details

Motivation: With e-commerce growth, warehouse efficiency is crucial. Digital twins require accurate worker position tracking, but single cameras have limited field of view, necessitating multi-camera systems.

Method: Used 19 wide-angle ceiling cameras looking down. Aligned camera coordinates to warehouse positions via floor surface. Addressed wide-angle distortion by aligning detected worker positions based on foot positions rather than full body.

Result: Achieved over 20% improvement in tracking accuracy. Compared multiple appearance feature utilization methods and validated proposed approach effectiveness.

Conclusion: Foot position-based alignment effectively reduces wide-angle camera distortion, enabling accurate multi-camera worker tracking for warehouse digital twin applications.

Abstract: With the spread of e-commerce, the logistics market is growing around the world. Therefore, improving the efficiency of warehouse operations is essential. To achieve this, various approaches have been explored, and among them, the use of digital twins is gaining attention. To make this approach possible, it is necessary to accurately collect the positions of workers in a warehouse and reflect them in a virtual space. However, a single camera has limitations in its field of view, therefore sensing with multiple cameras is necessary. In this study, we explored a method to track workers using 19 wide-angle cameras installed on the ceiling, looking down at the floor of the logistics warehouse. To understand the relationship between the camera coordinates and the actual positions in the warehouse, we performed alignment based on the floor surface. However, due to the characteristics of wide-angle cameras, significant distortion occurs at the edges of the image, particularly in the vertical direction. To address this, the detected worker positions from each camera were aligned based on foot positions, reducing the effects of image distortion, and enabling accurate position alignment across cameras. As a result, we confirmed an improvement of over 20% in tracking accuracy. Furthermore, we compared multiple methods for utilizing appearance features and validated the effectiveness of the proposed approach.

[152] Exploring “Many in Few” and “Few in Many” Properties in Long-Tailed, Highly-Imbalanced IC Defect Classification

Hao-Chiang Shao, Chun-Hao Chang, Yu-Hsien Lin, Chia-Wen Lin, Shao-Yun Fang, Yan-Hsiu Liu

Main category: cs.CV

TL;DR: This paper introduces IC-Defect-14 dataset for real-world IC defect classification and proposes ReCAME-Net to address challenges of extreme data imbalance and complex feature patterns in industrial settings.

Details

Motivation: Real-world IC defect classification faces challenges due to extreme data imbalance from high yield-rate requirements and complex feature patterns mixing class-specific and domain-related attributes, which existing imbalanced classification methods fail to handle effectively.

Method: Proposed ReCAME-Net with multi-expert classifier framework integrating regional channel attention module, metric learning losses, hard category mining strategy, and knowledge distillation procedure.

Result: ReCAME-Net outperforms previous state-of-the-art models on the IC-Defect-14 dataset while maintaining comparable performance on general public datasets.

Conclusion: The proposed approach effectively addresses the unique challenges of real-world IC defect classification with extreme data imbalance and complex feature patterns, demonstrating superior performance on industrial datasets while maintaining competitiveness on standard benchmarks.

Abstract: Despite significant advancements in deep classification techniques and in-lab automatic optical inspection models for long-tailed or highly imbalanced data, applying these approaches to real-world IC defect classification tasks remains challenging. This difficulty stems from two primary factors. First, real-world conditions, such as the high yield-rate requirements in the IC industry, result in data distributions that are far more skewed than those found in general public imbalanced datasets. Consequently, classifiers designed for open imbalanced datasets often fail to perform effectively in real-world scenarios. Second, real-world samples exhibit a mix of class-specific attributes and class-agnostic, domain-related features. This complexity adds significant difficulty to the classification process, particularly for highly imbalanced datasets. To address these challenges, this paper introduces the IC-Defect-14 dataset, a large, highly imbalanced IC defect image dataset sourced from AOI systems deployed in real-world IC production lines. This dataset is characterized by its unique “intra-class clusters” property, which presents two major challenges: large intra-class diversity and high inter-class similarity. These characteristics, rarely found simultaneously in existing public datasets, significantly degrade the performance of current state-of-the-art classifiers for highly imbalanced data. To tackle this challenge, we propose ReCAME-Net, which follows a multi-expert classifier framework and integrates a regional channel attention module, metric learning losses, a hard category mining strategy, and a knowledge distillation procedure. Extensive experimental evaluations demonstrate that ReCAME-Net outperforms previous state-of-the-art models on the IC-Defect-14 dataset while maintaining comparable performance and competitiveness on general public datasets.

[153] PCP-GAN: Property-Constrained Pore-scale image reconstruction via conditional Generative Adversarial Networks

Ali Sadeghkhani, Brandon Bennett, Masoud Babaei, Arash Rabbani

Main category: cs.CV

TL;DR: A multi-conditional GAN framework generates representative pore-scale images with controlled porosity and depth parameters, solving representativeness and data scarcity issues in subsurface characterization.

Details

Motivation: Natural spatial heterogeneity causes extracted sub-images to deviate from core-measured values, compounded by data scarcity where physical samples are only available at sparse well locations.

Method: Multi-conditional Generative Adversarial Network (cGAN) trained on thin section samples from carbonate formation, simultaneously conditioning on porosity values and depth parameters within a unified model.

Result: Achieved exceptional porosity control (R^2=0.95) with mean absolute errors of 0.0099-0.0197. Generated images showed superior representativeness with dual-constraint errors of 1.9-11.3% vs 36.4-578% for real sub-images.

Conclusion: The framework provides transformative tools for subsurface characterization, particularly valuable for carbon storage, geothermal energy, and groundwater management where representative pore morphology is critical for digital rock physics.

Abstract: Obtaining truly representative pore-scale images that match bulk formation properties remains a fundamental challenge in subsurface characterization, as natural spatial heterogeneity causes extracted sub-images to deviate significantly from core-measured values. This challenge is compounded by data scarcity, where physical samples are only available at sparse well locations. This study presents a multi-conditional Generative Adversarial Network (cGAN) framework that generates representative pore-scale images with precisely controlled properties, addressing both the representativeness challenge and data availability constraints. The framework was trained on thin section samples from four depths (1879.50-1943.50 m) of a carbonate formation, simultaneously conditioning on porosity values and depth parameters within a single unified model. This approach captures both universal pore network principles and depth-specific geological characteristics, from grainstone fabrics with interparticle-intercrystalline porosity to crystalline textures with anhydrite inclusions. The model achieved exceptional porosity control (R^2=0.95) across all formations with mean absolute errors of 0.0099-0.0197. Morphological validation confirmed preservation of critical pore network characteristics including average pore radius, specific surface area, and tortuosity, with statistical differences remaining within acceptable geological tolerances. Most significantly, generated images demonstrated superior representativeness with dual-constraint errors of 1.9-11.3% compared to 36.4-578% for randomly extracted real sub-images. This capability provides transformative tools for subsurface characterization, particularly valuable for carbon storage, geothermal energy, and groundwater management applications where knowing the representative morphology of the pore space is critical for implementing digital rock physics.

[154] Predicting before Reconstruction: A generative prior framework for MRI acceleration

Juhyung Park, Rokgi Hong, Roh-Eul Yoo, Jaehyeon Koo, Se Young Chun, Seung Hong Choi, Jongho Lee

Main category: cs.CV

TL;DR: A novel MRI acceleration framework using generative AI to predict target contrast images as data-driven priors for reconstructing highly under-sampled data, outperforming traditional reconstruction methods.

Details

Motivation: MRI's lengthy acquisition times limit clinical throughput, creating a need for faster imaging methods while maintaining quality.

Method: Predict target contrast images using generative models conditioned on diverse data sources (other contrasts, previous scans, parameters, patient info), then use these predictions as informative priors for reconstructing under-sampled k-space data.

Result: Significantly outperformed other approaches including those with alternative or no prior information across multiple datasets (14,921 scans) at high acceleration factors (x4, x8, x12).

Conclusion: Introduces a fundamental shift from image reconstruction to predictive imaging paradigm, enabling faster MRI acquisition without compromising quality.

Abstract: Recent advancements in artificial intelligence have created transformative capabilities in image synthesis and generation, enabling diverse research fields to innovate at revolutionary speed and spectrum. In this study, we leverage this generative power to introduce a new paradigm for accelerating Magnetic Resonance Imaging (MRI), introducing a shift from image reconstruction to proactive predictive imaging. Despite being a cornerstone of modern patient care, MRI’s lengthy acquisition times limit clinical throughput. Our novel framework addresses this challenge by first predicting a target contrast image, which then serves as a data-driven prior for reconstructing highly under-sampled data. This informative prior is predicted by a generative model conditioned on diverse data sources, such as other contrast images, previously scanned images, acquisition parameters, patient information. We demonstrate this approach with two key applications: (1) reconstructing FLAIR images using predictions from T1w and/or T2w scans, and (2) reconstructing T1w images using predictions from previously acquired T1w scans. The framework was evaluated on internal and multiple public datasets (total 14,921 scans; 1,051,904 slices), including multi-channel k-space data, for a range of high acceleration factors (x4, x8 and x12). The results demonstrate that our prediction-prior reconstruction method significantly outperforms other approaches, including those with alternative or no prior information. Through this framework we introduce a fundamental shift from image reconstruction towards a new paradigm of predictive imaging.

[155] PRGCN: A Graph Memory Network for Cross-Sequence Pattern Reuse in 3D Human Pose Estimation

Zhuoyang Xie, Yibo Zhao, Hui Huang, Riwei Wang, Zan Gao

Main category: cs.CV

TL;DR: PRGCN introduces a pattern reuse framework for monocular 3D human pose estimation that leverages cross-sequence motion patterns through a graph memory bank, achieving state-of-the-art performance on standard benchmarks.

Details

Motivation: Current video-based methods process each sequence in isolation, failing to exploit structural regularities and repetitive motion patterns across sequences, which limits their ability to resolve the fundamental depth ambiguity in 2D-to-3D lifting.

Method: PRGCN uses a graph memory bank storing pose prototypes encoded as relational graphs, dynamically retrieved via attention and fused with anatomical constraints through memory-driven graph convolution. It employs a dual-stream hybrid architecture combining Mamba-based state-space models for local temporal modeling with self-attention for global relational capacity.

Result: Achieved MPJPE of 37.1mm on Human3.6M and 13.4mm on MPI-INF-3DHP, establishing new state-of-the-art performance while demonstrating enhanced cross-domain generalization capability.

Conclusion: Cross-sequence pattern reuse is pivotal for advancing 3D human pose estimation, shifting the paradigm from per-sequence optimization towards cumulative knowledge learning.

Abstract: Monocular 3D human pose estimation remains a fundamentally ill-posed inverse problem due to the inherent depth ambiguity in 2D-to-3D lifting. While contemporary video-based methods leverage temporal context to enhance spatial reasoning, they operate under a critical paradigm limitation: processing each sequence in isolation, thereby failing to exploit the strong structural regularities and repetitive motion patterns that pervade human movement across sequences. This work introduces the Pattern Reuse Graph Convolutional Network (PRGCN), a novel framework that formalizes pose estimation as a problem of pattern retrieval and adaptation. At its core, PRGCN features a graph memory bank that learns and stores a compact set of pose prototypes, encoded as relational graphs, which are dynamically retrieved via an attention mechanism to provide structured priors. These priors are adaptively fused with hard-coded anatomical constraints through a memory-driven graph convolution, ensuring geometrical plausibility. To underpin this retrieval process with robust spatiotemporal features, we design a dual-stream hybrid architecture that synergistically combines the linear-complexity, local temporal modeling of Mamba-based state-space models with the global relational capacity of self-attention. Extensive evaluations on Human3.6M and MPI-INF-3DHP benchmarks demonstrate that PRGCN establishes a new state-of-the-art, achieving an MPJPE of 37.1mm and 13.4mm, respectively, while exhibiting enhanced cross-domain generalization capability. Our work posits that the long-overlooked mechanism of cross-sequence pattern reuse is pivotal to advancing the field, shifting the paradigm from per-sequence optimization towards cumulative knowledge learning.

[156] Mitigating representation bias caused by missing pixels in methane plume detection

Julia Wąsala, Joannes D. Maasakkers, Ilse Aben, Rochelle Schneider, Holger Hoos, Mitra Baratchi

Main category: cs.CV

TL;DR: The paper addresses representation bias in methane plume detection from satellite images with systematically missing pixels, showing that models can associate coverage (percentage of valid pixels) with labels, leading to under-detection in low-coverage images.

Details

Motivation: Satellite images often have missing pixels due to clouds and other factors (MNAR), which can cause representation bias in automated feature extraction models, particularly in methane plume detection where models may learn spurious associations between coverage and labels.

Method: Evaluated multiple imputation approaches and proposed a weighted resampling scheme during training that enforces class balance in each coverage bin to remove the association between label and coverage.

Result: Both resampling and imputation significantly reduced representation bias without hurting balanced accuracy, precision, or recall. Debiased models showed higher chance of detecting plumes in low-coverage images in operational scenarios.

Conclusion: The proposed techniques effectively mitigate representation bias caused by systematically missing pixels in satellite imagery, improving methane plume detection performance particularly in challenging low-coverage conditions.

Abstract: Most satellite images have systematically missing pixels (i.e., missing data not at random (MNAR)) due to factors such as clouds. If not addressed, these missing pixels can lead to representation bias in automated feature extraction models. In this work, we show that spurious association between the label and the number of missing values in methane plume detection can cause the model to associate the coverage (i.e., the percentage of valid pixels in an image) with the label, subsequently under-detecting plumes in low-coverage images. We evaluate multiple imputation approaches to remove the dependence between the coverage and a label. Additionally, we propose a weighted resampling scheme during training that removes the association between the label and the coverage by enforcing class balance in each coverage bin. Our results show that both resampling and imputation can significantly reduce the representation bias without hurting balanced accuracy, precision, or recall. Finally, we evaluate the capability of the debiased models using these techniques in an operational scenario and demonstrate that the debiased models have a higher chance of detecting plumes in low-coverage images.

[157] Towards Single-Source Domain Generalized Object Detection via Causal Visual Prompts

Chen Li, Huiying Xu, Changxin Gao, Zeyu Wang, Yun Liu, Xinzhong Zhu

Main category: cs.CV

TL;DR: Cauvis method improves single-source domain generalization in object detection by addressing spurious correlations through cross-attention visual prompts and dual-branch feature disentanglement.

Details

Motivation: Current SDGOD methods suffer from spurious correlations where models over-rely on simplistic features like color rather than domain-invariant representations, limiting generalization to unseen domains.

Method: Proposes Cauvis with: 1) Cross-Attention Prompts module to mitigate spurious feature bias, 2) Dual-branch adapter that disentangles causal-spurious features and achieves domain adaptation via high-frequency feature extraction.

Result: Achieves state-of-the-art performance with 15.9-31.4% gains over existing methods on SDGOD datasets, and exhibits significant robustness in complex interference environments.

Conclusion: Cauvis effectively addresses spurious correlation issues in single-source domain generalization through causal visual prompts and feature disentanglement, demonstrating superior generalization capability.

Abstract: Single-source Domain Generalized Object Detection (SDGOD), as a cutting-edge research topic in computer vision, aims to enhance model generalization capability in unseen target domains through single-source domain training. Current mainstream approaches attempt to mitigate domain discrepancies via data augmentation techniques. However, due to domain shift and limited domain-specific knowledge, models tend to fall into the pitfall of spurious correlations. This manifests as the model’s over-reliance on simplistic classification features (e.g., color) rather than essential domain-invariant representations like object contours. To address this critical challenge, we propose the Cauvis (Causal Visual Prompts) method. First, we introduce a Cross-Attention Prompts module that mitigates bias from spurious features by integrating visual prompts with cross-attention. To address the inadequate domain knowledge coverage and spurious feature entanglement in visual prompts for single-domain generalization, we propose a dual-branch adapter that disentangles causal-spurious features while achieving domain adaptation via high-frequency feature extraction. Cauvis achieves state-of-the-art performance with 15.9-31.4% gains over existing domain generalization methods on SDGOD datasets, while exhibiting significant robustness advantages in complex interference environments.

[158] CARES: Context-Aware Resolution Selector for VLMs

Moshe Kimhi, Nimrod Shabtay, Raja Giryes, Chaim Baskin, Eli Schwartz

Main category: cs.CV

TL;DR: CARES is a lightweight module that predicts the minimal sufficient image resolution for VLMs, reducing compute by up to 80% while maintaining performance.

Details

Motivation: High-resolution images in VLMs inflate visual tokens to 97-99% of total tokens, causing high compute and latency even when low-resolution images would suffice.

Method: CARES uses a compact VLM (350M) to extract features and predict when a target VLM’s response converges to its peak ability to answer correctly. It interpolates continuous resolutions at inference for fine-grained control.

Result: Across five multimodal benchmarks spanning documents and natural images, CARES preserves task performance while reducing compute by up to 80%.

Conclusion: CARES effectively reduces computational costs in VLMs by selecting minimal sufficient image resolutions without compromising performance.

Abstract: Large vision-language models (VLMs) commonly process images at native or high resolution to remain effective across tasks. This inflates visual tokens ofter to 97-99% of total tokens, resulting in high compute and latency, even when low-resolution images would suffice. We introduce \emph{CARES}-a \textbf{C}ontext-\textbf{A}ware \textbf{R}esolution \textbf{S}elector, a lightweight preprocessing module that, given an image-query pair, predicts the \emph{minimal} sufficient input resolution. CARES uses a compact VLM (350M) to extract features and predict when a target pretrained VLM’s response converges to its peak ability to answer correctly. Though trained as a discrete classifier over a set of optional resolutions, CARES interpolates continuous resolutions at inference for fine-grained control. Across five multimodal benchmarks spanning documents and natural images, as well as diverse target VLMs, CARES preserves task performance while reducing compute by up to 80%.

[159] PoseCrafter: Extreme Pose Estimation with Hybrid Video Synthesis

Qing Mao, Tianxin Huang, Yu Zhu, Jinqiu Sun, Yanning Zhang, Gim Hee Lee

Main category: cs.CV

TL;DR: PoseCrafter improves pairwise camera pose estimation for sparsely overlapping images using hybrid video generation and feature matching selection to synthesize clearer intermediate frames.

Details

Motivation: Existing methods struggle with image pairs that have small or no overlap, and current approaches using video interpolation produce blurry frames with inefficient selection strategies.

Method: Proposes Hybrid Video Generation (HVG) combining video interpolation with pose-conditioned novel view synthesis, and Feature Matching Selector (FMS) to select optimal intermediate frames for pose estimation.

Result: Extensive experiments show PoseCrafter significantly enhances pose estimation performance, especially on examples with small or no overlap, outperforming SOTA methods.

Conclusion: The proposed framework effectively addresses the challenge of pairwise camera pose estimation for sparsely overlapping image pairs through improved intermediate frame synthesis and selection.

Abstract: Pairwise camera pose estimation from sparsely overlapping image pairs remains a critical and unsolved challenge in 3D vision. Most existing methods struggle with image pairs that have small or no overlap. Recent approaches attempt to address this by synthesizing intermediate frames using video interpolation and selecting key frames via a self-consistency score. However, the generated frames are often blurry due to small overlap inputs, and the selection strategies are slow and not explicitly aligned with pose estimation. To solve these cases, we propose Hybrid Video Generation (HVG) to synthesize clearer intermediate frames by coupling a video interpolation model with a pose-conditioned novel view synthesis model, where we also propose a Feature Matching Selector (FMS) based on feature correspondence to select intermediate frames appropriate for pose estimation from the synthesized results. Extensive experiments on Cambridge Landmarks, ScanNet, DL3DV-10K, and NAVI demonstrate that, compared to existing SOTA methods, PoseCrafter can obviously enhance the pose estimation performances, especially on examples with small or no overlap.

[160] [De|Re]constructing VLMs’ Reasoning in Counting

Simone Alghisi, Gabriel Roccabruna, Massimo Rizzoli, Seyed Mahed Mousavi, Giuseppe Riccardi

Main category: cs.CV

TL;DR: VLMs struggle with visual reasoning tasks like counting objects due to sensitivity to object characteristics and spatial arrangements. Fine-tuning just the output layer can improve counting accuracy by up to 21%.

Details

Motivation: Vision-Language Models have limitations in visual reasoning tasks such as identifying relations, understanding temporal sequences, and counting objects. The paper aims to investigate the underlying causes of these failures and improve VLMs' reasoning capabilities.

Method: Study reasoning skills of 7 state-of-the-art VLMs in counting tasks under controlled conditions, perform layer-wise analysis to identify error sources, and conduct targeted training by fine-tuning only the output layer.

Result: VLMs are highly sensitive to object number, type, spatial arrangement, and distractors. Layer analysis shows errors occur in mapping last-layer representations to output. Fine-tuning just the output layer improves accuracy by up to 21%, with consistent improvements on real-world datasets.

Conclusion: Targeted fine-tuning of the output layer significantly improves VLM performance on counting tasks, addressing core limitations in visual reasoning without requiring full model retraining.

Abstract: Vision-Language Models (VLMs) have recently gained attention due to their competitive performance on multiple downstream tasks, achieved by following user-input instructions. However, VLMs still exhibit several limitations in visual reasoning, such as difficulties in identifying relations (e.g., spatial, temporal, and among objects), understanding temporal sequences (e.g., frames), and counting objects. In this work, we go beyond score-level benchmark evaluations of VLMs by investigating the underlying causes of their failures and proposing a targeted approach to improve their reasoning capabilities. We study the reasoning skills of seven state-of-the-art VLMs in the counting task under controlled experimental conditions. Our experiments show that VLMs are highly sensitive to the number and type of objects, their spatial arrangement, and the co-occurrence of distractors. A layer-wise analysis reveals that errors are due to incorrect mapping of the last-layer representation into the output space. Our targeted training shows that fine-tuning just the output layer improves accuracy by up to 21%. We corroborate these findings by achieving consistent improvements on real-world datasets.

[161] The Intricate Dance of Prompt Complexity, Quality, Diversity, and Consistency in T2I Models

Xiaofeng Zhang, Aaron Courville, Michal Drozdzal, Adriana Romero-Soriano

Main category: cs.CV

TL;DR: This paper investigates how prompt complexity affects synthetic data utility from text-to-image models, showing that increased complexity reduces diversity and consistency but improves distribution alignment with real data.

Details

Motivation: While text-to-image models can generate unlimited synthetic data, the systematic impact of prompt complexity on data quality, diversity, and consistency remains underexplored despite prompt engineering being the primary interaction method.

Method: Conducted synthetic experiments with theoretical derivations, introduced a new evaluation framework comparing real and synthetic data utility, and analyzed prompt complexity effects across multiple datasets (CC12M, ImageNet-1k, DCI) using various inference-time intervention methods.

Result: Increasing prompt complexity reduces conditional diversity and prompt consistency but decreases synthetic-to-real distribution shift. Prompt expansion consistently achieves highest performance in image diversity and aesthetics, even surpassing real data.

Conclusion: Generalizing to more general conditions is challenging for diffusion models, and current inference-time interventions can enhance diversity but may move outside real data support. Prompt expansion emerges as the most effective method for balancing utility metrics.

Abstract: Text-to-image (T2I) models offer great potential for creating virtually limitless synthetic data, a valuable resource compared to fixed and finite real datasets. Previous works evaluate the utility of synthetic data from T2I models on three key desiderata: quality, diversity, and consistency. While prompt engineering is the primary means of interacting with T2I models, the systematic impact of prompt complexity on these critical utility axes remains underexplored. In this paper, we first conduct synthetic experiments to motivate the difficulty of generalization w.r.t. prompt complexity and explain the observed difficulty with theoretical derivations. Then, we introduce a new evaluation framework that can compare the utility of real data and synthetic data, and present a comprehensive analysis of how prompt complexity influences the utility of synthetic data generated by commonly used T2I models. We conduct our study across diverse datasets, including CC12M, ImageNet-1k, and DCI, and evaluate different inference-time intervention methods. Our synthetic experiments show that generalizing to more general conditions is harder than the other way round, since the former needs an estimated likelihood that is not learned by diffusion models. Our large-scale empirical experiments reveal that increasing prompt complexity results in lower conditional diversity and prompt consistency, while reducing the synthetic-to-real distribution shift, which aligns with the synthetic experiments. Moreover, current inference-time interventions can augment the diversity of the generations at the expense of moving outside the support of real data. Among those interventions, prompt expansion, by deliberately using a pre-trained language model as a likelihood estimator, consistently achieves the highest performance in both image diversity and aesthetics, even higher than that of real data.

[162] HAD: Hierarchical Asymmetric Distillation to Bridge Spatio-Temporal Gaps in Event-Based Object Tracking

Yao Deng, Xian Zhong, Wenxuan Liu, Zhaofei Yu, Jingling Yuan, Tiejun Huang

Main category: cs.CV

TL;DR: HAD is a multi-modal knowledge distillation framework that addresses spatio-temporal asymmetries between RGB and event cameras to enhance object tracking in challenging conditions.

Details

Motivation: RGB and event cameras have complementary strengths (texture details vs temporal resolution/HDR), but their different imaging mechanisms create spatio-temporal asymmetries that hinder effective multi-modal integration for object tracking.

Method: Proposes Hierarchical Asymmetric Distillation (HAD) with hierarchical alignment strategy to minimize information loss while maintaining computational efficiency and parameter compactness in the student network.

Result: Extensive experiments show HAD consistently outperforms state-of-the-art methods, with ablation studies validating the effectiveness of each component.

Conclusion: HAD successfully addresses spatio-temporal asymmetries between RGB and event cameras, providing superior object tracking performance in challenging conditions like high-speed motion and HDR environments.

Abstract: RGB cameras excel at capturing rich texture details with high spatial resolution, whereas event cameras offer exceptional temporal resolution and a high dynamic range (HDR). Leveraging their complementary strengths can substantially enhance object tracking under challenging conditions, such as high-speed motion, HDR environments, and dynamic background interference. However, a significant spatio-temporal asymmetry exists between these two modalities due to their fundamentally different imaging mechanisms, hindering effective multi-modal integration. To address this issue, we propose {Hierarchical Asymmetric Distillation} (HAD), a multi-modal knowledge distillation framework that explicitly models and mitigates spatio-temporal asymmetries. Specifically, HAD proposes a hierarchical alignment strategy that minimizes information loss while maintaining the student network’s computational efficiency and parameter compactness. Extensive experiments demonstrate that HAD consistently outperforms state-of-the-art methods, and comprehensive ablation studies further validate the effectiveness and necessity of each designed component. The code will be released soon.

[163] Can You Trust What You See? Alpha Channel No-Box Attacks on Video Object Detection

Ariana Yi, Ce Zhou, Liyang Xiao, Qiben Yan

Main category: cs.CV

TL;DR: Alpha-Cloak is the first no-box adversarial attack on object detectors using the alpha channel of RGBA videos, achieving 100% attack success rate without model access or visible artifacts.

Details

Motivation: Object detection models are widely deployed in cyber-physical systems like autonomous vehicles, but video domain adversarial attacks, especially in no-box settings, remain largely unexplored.

Method: Exploits the alpha channel to fuse malicious target videos with benign videos, creating fused videos that appear normal to humans but consistently fool object detectors without requiring model architecture, parameters, or outputs.

Result: Achieved 100% attack success rate on five state-of-the-art object detectors, a vision-language model, and Gemini-2.0-Flash, with no perceptible artifacts and compatibility across common video formats.

Conclusion: Reveals a previously unexplored vulnerability in video-based perception systems, highlighting the urgent need for defenses that account for the alpha channel in adversarial settings.

Abstract: As object detection models are increasingly deployed in cyber-physical systems such as autonomous vehicles (AVs) and surveillance platforms, ensuring their security against adversarial threats is essential. While prior work has explored adversarial attacks in the image domain, those attacks in the video domain remain largely unexamined, especially in the no-box setting. In this paper, we present {\alpha}-Cloak, the first no-box adversarial attack on object detectors that operates entirely through the alpha channel of RGBA videos. {\alpha}-Cloak exploits the alpha channel to fuse a malicious target video with a benign video, resulting in a fused video that appears innocuous to human viewers but consistently fools object detectors. Our attack requires no access to model architecture, parameters, or outputs, and introduces no perceptible artifacts. We systematically study the support for alpha channels across common video formats and playback applications, and design a fusion algorithm that ensures visual stealth and compatibility. We evaluate {\alpha}-Cloak on five state-of-the-art object detectors, a vision-language model, and a multi-modal large language model (Gemini-2.0-Flash), demonstrating a 100% attack success rate across all scenarios. Our findings reveal a previously unexplored vulnerability in video-based perception systems, highlighting the urgent need for defenses that account for the alpha channel in adversarial settings.

[164] VGD: Visual Geometry Gaussian Splatting for Feed-Forward Surround-view Driving Reconstruction

Junhong Lin, Kangli Wang, Shunzhou Wang, Songlin Fan, Ge Li, Wei Gao

Main category: cs.CV

TL;DR: VGD is a feed-forward framework for surround-view autonomous driving scene reconstruction that uses geometric priors from VGGT and Gaussian parameters to improve novel view rendering quality while maintaining geometric consistency.

Details

Motivation: Existing methods fail to ensure geometric consistency and reconstruction quality for novel views in surround-view autonomous driving due to minimal overlap regions between views.

Method: Uses lightweight VGGT variant for geometric priors, Gaussian Head for novel view rendering, and semantic refinement with multi-scale feature fusion from both geometry and Gaussian branches.

Result: Significantly outperforms state-of-the-art methods on nuScenes dataset in both objective metrics and subjective quality under various settings.

Conclusion: VGD demonstrates scalability and high-fidelity surround-view reconstruction for autonomous driving applications.

Abstract: Feed-forward surround-view autonomous driving scene reconstruction offers fast, generalizable inference ability, which faces the core challenge of ensuring generalization while elevating novel view quality. Due to the surround-view with minimal overlap regions, existing methods typically fail to ensure geometric consistency and reconstruction quality for novel views. To tackle this tension, we claim that geometric information must be learned explicitly, and the resulting features should be leveraged to guide the elevating of semantic quality in novel views. In this paper, we introduce \textbf{Visual Gaussian Driving (VGD)}, a novel feed-forward end-to-end learning framework designed to address this challenge. To achieve generalizable geometric estimation, we design a lightweight variant of the VGGT architecture to efficiently distill its geometric priors from the pre-trained VGGT to the geometry branch. Furthermore, we design a Gaussian Head that fuses multi-scale geometry tokens to predict Gaussian parameters for novel view rendering, which shares the same patch backbone as the geometry branch. Finally, we integrate multi-scale features from both geometry and Gaussian head branches to jointly supervise a semantic refinement model, optimizing rendering quality through feature-consistent learning. Experiments on nuScenes demonstrate that our approach significantly outperforms state-of-the-art methods in both objective metrics and subjective quality under various settings, which validates VGD’s scalability and high-fidelity surround-view reconstruction.

Francisco Mena, Dino Ienco, Cassio F. Dantas, Roberto Interdonato, Andreas Dengel

Main category: cs.CV

TL;DR: A multi-modal co-learning framework for Earth Observation that enables single-modality inference by structuring model manifolds into modality-shared and modality-specific information through contrastive and modality discriminative learning.

Details

Motivation: Address the challenge of limited sensor modality access at inference time in Earth Observation, where training data is abundant but real-world constraints restrict available modalities during deployment.

Method: Combines contrastive learning and modality discriminative learning to guide single-modality models to separate modality-shared and modality-specific information in the internal model manifold.

Result: Demonstrates consistent predictive improvements over state-of-the-art approaches on four EO benchmarks across classification and regression tasks, with only one modality available at inference time.

Conclusion: The framework effectively generalizes across various EO tasks without targeting specific inference modalities, validating its utility in single-modality inference scenarios for diverse Earth Observation applications.

Abstract: Multi-modal co-learning is emerging as an effective paradigm in machine learning, enabling models to collaboratively learn from different modalities to enhance single-modality predictions. Earth Observation (EO) represents a quintessential domain for multi-modal data analysis, wherein diverse remote sensors collect data to sense our planet. This unprecedented volume of data introduces novel challenges. Specifically, the access to the same sensor modalities at both training and inference stages becomes increasingly complex based on real-world constraints affecting remote sensing platforms. In this context, multi-modal co-learning presents a promising strategy to leverage the vast amount of sensor-derived data available at the training stage to improve single-modality models for inference-time deployment. Most current research efforts focus on designing customized solutions for either particular downstream tasks or specific modalities available at the inference stage. To address this, we propose a novel multi-modal co-learning framework capable of generalizing across various tasks without targeting a specific modality for inference. Our approach combines contrastive and modality discriminative learning together to guide single-modality models to structure the internal model manifold into modality-shared and modality-specific information. We evaluate our framework on four EO benchmarks spanning classification and regression tasks across different sensor modalities, where only one of the modalities available during training is accessible at inference time. Our results demonstrate consistent predictive improvements over state-of-the-art approaches from the recent machine learning and computer vision literature, as well as EO-specific methods. The obtained findings validate our framework in the single-modality inference scenarios across a diverse range of EO applications.

[166] Addressing the Depth-of-Field Constraint: A New Paradigm for High Resolution Multi-Focus Image Fusion

Luca Piano, Peng Huanwen, Radu Ciprian Bilcu

Main category: cs.CV

TL;DR: VAEEDOF is a novel multi-focus image fusion method using a distilled variational autoencoder that processes up to 7 images simultaneously, achieving state-of-the-art results with seamless artifact-free fusion.

Details

Motivation: Addresses depth-of-field limitations in optical lenses where only objects within specific ranges appear sharp, and overcomes challenges like limited training data, domain gaps from synthetic datasets, and difficulties with information-lacking regions.

Method: Uses a distilled variational autoencoder for high-fidelity, efficient image reconstruction. Introduces MattingMFIF, a new synthetic 4K dataset simulating realistic DOF effects from real photographs. Fusion module processes up to seven images simultaneously.

Result: Achieves state-of-the-art results, generates seamless artifact-free fused images, and bridges the gap between synthetic and real-world scenarios.

Conclusion: Offers a significant step forward in addressing complex multi-focus image fusion challenges by providing robust fusion across diverse focus points and overcoming data scarcity issues.

Abstract: Multi-focus image fusion (MFIF) addresses the depth-of-field (DOF) limitations of optical lenses, where only objects within a specific range appear sharp. Although traditional and deep learning methods have advanced the field, challenges persist, including limited training data, domain gaps from synthetic datasets, and difficulties with regions lacking information. We propose VAEEDOF, a novel MFIF method that uses a distilled variational autoencoder for high-fidelity, efficient image reconstruction. Our fusion module processes up to seven images simultaneously, enabling robust fusion across diverse focus points. To address data scarcity, we introduce MattingMFIF, a new syntetic 4K dataset, simulating realistic DOF effects from real photographs. Our method achieves state-of-the-art results, generating seamless artifact-free fused images and bridging the gap between synthetic and real-world scenarios, offering a significant step forward in addressing complex MFIF challenges. The code, and weights are available here:

[167] Uncertainty evaluation of segmentation models for Earth observation

Melanie Rey, Andriy Mnih, Maxim Neumann, Matt Overlan, Drew Purves

Main category: cs.CV

TL;DR: This paper benchmarks uncertainty estimation methods for semantic segmentation in satellite imagery, evaluating their practical utility for identifying prediction errors and noise-corrupted regions across two remote sensing datasets.

Details

Motivation: Semantic segmentation uncertainty estimation presents unique challenges compared to standard classification, with limited research focused on remote sensing applications. The authors aim to address this gap by benchmarking methods specifically for Earth observation tasks.

Method: Extensive evaluation using Stochastic Segmentation Networks and ensembles with various neural architectures and uncertainty metrics on two remote sensing datasets (PASTIS and ForTy) that differ in scale, geographic coverage, and label confidence.

Result: The study provides practical recommendations based on findings about which uncertainty estimation methods work best for identifying prediction errors and noise-corrupted regions in satellite imagery segmentation.

Conclusion: The paper establishes benchmarks for uncertainty estimation in remote sensing semantic segmentation and offers practical guidance for implementing these methods in Earth observation applications.

Abstract: This paper investigates methods for estimating uncertainty in semantic segmentation predictions derived from satellite imagery. Estimating uncertainty for segmentation presents unique challenges compared to standard image classification, requiring scalable methods producing per-pixel estimates. While most research on this topic has focused on scene understanding or medical imaging, this work benchmarks existing methods specifically for remote sensing and Earth observation applications. Our evaluation focuses on the practical utility of uncertainty measures, testing their ability to identify prediction errors and noise-corrupted input image regions. Experiments are conducted on two remote sensing datasets, PASTIS and ForTy, selected for their differences in scale, geographic coverage, and label confidence. We perform an extensive evaluation featuring several models, such as Stochastic Segmentation Networks and ensembles, in combination with a number of neural architectures and uncertainty metrics. We make a number of practical recommendations based on our findings.

[168] Digitizing Paper ECGs at Scale: An Open-Source Algorithm for Clinical Research

Elias Stenhede, Agnar Martin Bjørnstad, Arian Ranjbar

Main category: cs.CV

TL;DR: A fully automated framework converts scanned ECG paper images into digital signals, achieving state-of-the-art performance on large datasets and enabling access to retrospective ECG archives.

Details

Motivation: Millions of clinical ECGs exist only as paper scans, making them unusable for modern automated diagnostics and AI-driven analysis.

Method: A fully automated, modular framework that processes scanned or photographed ECG images, handling common artifacts like perspective distortion, wrinkles, and stains.

Result: Validated on 37,191 ECG images with mean SNR of 19.65 dB on scanned papers. Outperforms state-of-the-art on Emory Paper Digitization ECG Dataset (35,595 images) across all subcategories.

Conclusion: The open-source software enables conversion of paper ECG archives to digital signals, promoting reproducibility and democratizing access to AI-driven diagnostics.

Abstract: Millions of clinical ECGs exist only as paper scans, making them unusable for modern automated diagnostics. We introduce a fully automated, modular framework that converts scanned or photographed ECGs into digital signals, suitable for both clinical and research applications. The framework is validated on 37,191 ECG images with 1,596 collected at Akershus University Hospital, where the algorithm obtains a mean signal-to-noise ratio of 19.65 dB on scanned papers with common artifacts. It is further evaluated on the Emory Paper Digitization ECG Dataset, comprising 35,595 images, including images with perspective distortion, wrinkles, and stains. The model improves on the state-of-the-art in all subcategories. The full software is released as open-source, promoting reproducibility and further development. We hope the software will contribute to unlocking retrospective ECG archives and democratize access to AI-driven diagnostics.

[169] Decomposed Attention Fusion in MLLMs for Training-Free Video Reasoning Segmentation

Su Ho Han, Jeongseok Hyun, Pilhyeon Lee, Minho Shim, Dongyoon Wee, Seon Joo Kim

Main category: cs.CV

TL;DR: DecAF is a training-free method that converts MLLM attention maps into video segmentation masks using decomposed attention fusion and SAM2 prompting, achieving performance comparable to training-based methods.

Details

Motivation: To enable video reasoning segmentation without retraining MLLMs by refining noisy raw attention maps that are poorly aligned with object regions.

Method: Proposes Decomposed Attention Fusion (DecAF) with contrastive object-background fusion and complementary video-frame fusion, plus attention-guided SAM2 prompting for fine-grained masks.

Result: Outperforms training-free methods and achieves comparable performance to training-based methods on referring and reasoning VOS benchmarks.

Conclusion: DecAF enables effective video reasoning segmentation without MLLM retraining by refining attention maps through decomposition and fusion mechanisms.

Abstract: Multimodal large language models (MLLMs) demonstrate strong video understanding by attending to visual tokens relevant to textual queries. To directly adapt this for localization in a training-free manner, we cast video reasoning segmentation as a video QA task and extract attention maps via rollout mechanism. However, raw attention maps are noisy and poorly aligned with object regions. We propose Decomposed Attention Fusion (DecAF), which refines these maps through two mechanisms: (1) contrastive object-background fusion and (2) complementary video-frame fusion. This method suppresses irrelevant activations and enhances object-focused cues, enabling direct conversion of attention maps into coarse segmentation masks. In addition, we introduce attention-guided SAM2 prompting for obtaining fine-grained masks. Unlike existing methods that jointly train MLLMs with SAM, our method operates entirely without retraining. DecAF outperforms training-free methods and achieves performance comparable to training-based methods on both referring and reasoning VOS benchmarks. The code will be available at https://github.com/HYUNJS/DecAF.

[170] CBDiff:Conditional Bernoulli Diffusion Models for Image Forgery Localization

Zhou Lei, Pan Gang, Wang Jiahao, Sun Di

Main category: cs.CV

TL;DR: CBDiff introduces a conditional Bernoulli diffusion model for image forgery localization that generates multiple diverse localization maps instead of a single deterministic one, improving reliability and addressing uncertainty in tampered regions.

Details

Motivation: Existing methods produce single deterministic localization maps that lack precision and reliability for high-stakes applications like forensic analysis and security surveillance.

Method: CBDiff uses a conditional Bernoulli diffusion model with Bernoulli noise to reflect binary/sparse properties of forgery masks, and incorporates Time-Step Cross-Attention (TSCAttention) for semantic feature guidance with temporal steps.

Result: Extensive experiments on eight benchmark datasets show CBDiff significantly outperforms state-of-the-art methods.

Conclusion: CBDiff demonstrates strong potential for real-world deployment by enhancing prediction credibility and mitigating error risks through multiple diverse localization maps.

Abstract: Image Forgery Localization (IFL) is a crucial task in image forensics, aimed at accurately identifying manipulated or tampered regions within an image at the pixel level. Existing methods typically generate a single deterministic localization map, which often lacks the precision and reliability required for high-stakes applications such as forensic analysis and security surveillance. To enhance the credibility of predictions and mitigate the risk of errors, we introduce an advanced Conditional Bernoulli Diffusion Model (CBDiff). Given a forged image, CBDiff generates multiple diverse and plausible localization maps, thereby offering a richer and more comprehensive representation of the forgery distribution. This approach addresses the uncertainty and variability inherent in tampered regions. Furthermore, CBDiff innovatively incorporates Bernoulli noise into the diffusion process to more faithfully reflect the inherent binary and sparse properties of forgery masks. Additionally, CBDiff introduces a Time-Step Cross-Attention (TSCAttention), which is specifically designed to leverage semantic feature guidance with temporal steps to improve manipulation detection. Extensive experiments on eight publicly benchmark datasets demonstrate that CBDiff significantly outperforms existing state-of-the-art methods, highlighting its strong potential for real-world deployment.

[171] XBench: A Comprehensive Benchmark for Visual-Language Explanations in Chest Radiography

Haozhe Luo, Shelley Zixin Shu, Ziyu Zhou, Sebastian Otalora, Mauricio Reyes

Main category: cs.CV

TL;DR: This paper presents XBench, the first systematic benchmark for evaluating cross-modal interpretability in chest X-rays across seven CLIP-style vision-language models, revealing limitations in reliable grounding for medical applications.

Details

Motivation: Vision-language models show strong zero-shot performance in medical image understanding, but their grounding ability (alignment between textual concepts and visual evidence) remains underexplored, which is essential for clinical interpretability and adoption.

Method: Generated visual explanations using cross-attention and similarity-based localization maps, and quantitatively assessed their alignment with radiologist-annotated regions across multiple pathologies using seven CLIP-style VLM variants.

Result: (1) VLMs show reasonable localization for large/well-defined pathologies but degrade for small/diffuse lesions; (2) Chest X-ray-specific pretraining improves alignment; (3) Recognition and grounding abilities are strongly correlated.

Conclusion: Current VLMs fall short in clinically reliable grounding despite strong recognition ability, highlighting the need for targeted interpretability benchmarks before medical deployment.

Abstract: Vision-language models (VLMs) have recently shown remarkable zero-shot performance in medical image understanding, yet their grounding ability, the extent to which textual concepts align with visual evidence, remains underexplored. In the medical domain, however, reliable grounding is essential for interpretability and clinical adoption. In this work, we present the first systematic benchmark for evaluating cross-modal interpretability in chest X-rays across seven CLIP-style VLM variants. We generate visual explanations using cross-attention and similarity-based localization maps, and quantitatively assess their alignment with radiologist-annotated regions across multiple pathologies. Our analysis reveals that: (1) while all VLM variants demonstrate reasonable localization for large and well-defined pathologies, their performance substantially degrades for small or diffuse lesions; (2) models that are pretrained on chest X-ray-specific datasets exhibit improved alignment compared to those trained on general-domain data. (3) The overall recognition ability and grounding ability of the model are strongly correlated. These findings underscore that current VLMs, despite their strong recognition ability, still fall short in clinically reliable grounding, highlighting the need for targeted interpretability benchmarks before deployment in medical practice. XBench code is available at https://github.com/Roypic/Benchmarkingattention

[172] Beyond sparse denoising in frames: minimax estimation with a scattering transform

Nathanaël Cuvelle–Magar, Stéphane Mallat

Main category: cs.CV

TL;DR: The paper introduces a denoising estimator using scattering coefficients that reaches minimax asymptotic bounds for cartoon images with unknown Lipschitz exponents α ≤ 2, bridging harmonic analysis and deep learning approaches.

Details

Motivation: Traditional sparse estimators in frames are suboptimal for complex signal regularities, particularly for cartoon images with unknown Lipschitz exponents. Deep neural networks show better results but lack theoretical understanding.

Method: Jointly minimize and maximize ℓ¹ norms of different subsets of scattering coefficients, which are computed by transforming modulus of wavelet coefficients with a second wavelet transform.

Result: The denoising estimator reaches minimax asymptotic bounds for cartoon images for all Lipschitz exponents α ≤ 2, as demonstrated through numerical experiments.

Conclusion: This approach provides a harmonic analysis method for noise suppression and geometric regularity specification, creating a mathematical bridge between harmonic analysis and deep convolutional network denoising.

Abstract: A considerable amount of research in harmonic analysis has been devoted to non-linear estimators of signals contaminated by additive Gaussian noise. They are implemented by thresholding coefficients in a frame, which provide a sparse signal representation, or by minimising their $\ell^1$ norm. However, sparse estimators in frames are not sufficiently rich to adapt to complex signal regularities. For cartoon images whose edges are piecewise $\bf C^\alpha$ curves, wavelet, curvelet and Xlet frames are suboptimal if the Lipschitz exponent $\alpha \leq 2$ is an unknown parameter. Deep convolutional neural networks have recently obtained much better numerical results, which reach the minimax asymptotic bounds for all $\alpha$. Wavelet scattering coefficients have been introduced as simplified convolutional neural network models. They are computed by transforming the modulus of wavelet coefficients with a second wavelet transform. We introduce a denoising estimator by jointly minimising and maximising the $\ell^1$ norms of different subsets of scattering coefficients. We prove that these $\ell^1$ norms capture different types of geometric image regularity. Numerical experiments show that this denoising estimator reaches the minimax asymptotic bound for cartoon images for all Lipschitz exponents $\alpha \leq 2$. We state this numerical result as a mathematical conjecture. It provides a different harmonic analysis approach to suppress noise from signals, and to specify the geometric regularity of functions. It also opens a mathematical bridge between harmonic analysis and denoising estimators with deep convolutional network.

[173] Pragmatic Heterogeneous Collaborative Perception via Generative Communication Mechanism

Junfei Zhou, Penglin Dai, Quanmin Wei, Bingyi Liu, Xiao Wu, Jianping Wang

Main category: cs.CV

TL;DR: GenComm enables seamless perception across heterogeneous multi-agent systems through feature generation without retraining, achieving 81% reduction in computational cost and parameters when adding new agents.

Details

Motivation: Address domain gaps in heterogeneous multi-agent collaboration caused by sensor/model differences, overcoming limitations of intrusive retraining and high computational costs in existing methods.

Method: Uses Generative Communication with Deformable Message Extractor for spatial message extraction, Spatial-Aware Feature Generator with conditional diffusion model for aligned feature generation, and Channel Enhancer for feature refinement.

Result: Outperforms state-of-the-art methods on OPV2V-H, DAIR-V2X and V2X-Real datasets with 81% reduction in computational cost and parameter count when incorporating new agents.

Conclusion: GenComm provides an effective solution for heterogeneous multi-agent collaboration through non-intrusive feature generation and lightweight spatial alignment, enabling scalable and efficient perception.

Abstract: Multi-agent collaboration enhances the perception capabilities of individual agents through information sharing. However, in real-world applications, differences in sensors and models across heterogeneous agents inevitably lead to domain gaps during collaboration. Existing approaches based on adaptation and reconstruction fail to support pragmatic heterogeneous collaboration due to two key limitations: (1) Intrusive retraining of the encoder or core modules disrupts the established semantic consistency among agents; and (2) accommodating new agents incurs high computational costs, limiting scalability. To address these challenges, we present a novel Generative Communication mechanism (GenComm) that facilitates seamless perception across heterogeneous multi-agent systems through feature generation, without altering the original network, and employs lightweight numerical alignment of spatial information to efficiently integrate new agents at minimal cost. Specifically, a tailored Deformable Message Extractor is designed to extract spatial message for each collaborator, which is then transmitted in place of intermediate features. The Spatial-Aware Feature Generator, utilizing a conditional diffusion model, generates features aligned with the ego agent’s semantic space while preserving the spatial information of the collaborators. These generated features are further refined by a Channel Enhancer before fusion. Experiments conducted on the OPV2V-H, DAIR-V2X and V2X-Real datasets demonstrate that GenComm outperforms existing state-of-the-art methods, achieving an 81% reduction in both computational cost and parameter count when incorporating new agents. Our code is available at https://github.com/jeffreychou777/GenComm.

[174] Augmenting Moment Retrieval: Zero-Dependency Two-Stage Learning

Zhengxuan Wei, Jiajin Tang, Sibei Yang

Main category: cs.CV

TL;DR: AMR is a zero-external-dependency Augmented Moment Retrieval framework that addresses data scarcity, boundary ambiguity, and fine-grained semantic discrimination issues in moment retrieval without requiring additional manual labeling.

Details

Motivation: To overcome three critical bottlenecks in existing Moment Retrieval methods: data scarcity forcing shallow keyword-feature associations, boundary ambiguity in transition regions, and insufficient discrimination of fine-grained semantics.

Method: Proposes a two-stage training framework: (1) Cold-start stage with curriculum learning on augmented data to build foundational boundary/semantic awareness, (2) Distillation stage with dual query sets (Original Queries for DETR-based localization and Active Queries for dynamic adaptation) and cross-stage distillation loss for consistency.

Result: Experiments on multiple benchmarks show that AMR achieves improved performance over prior state-of-the-art approaches.

Conclusion: AMR successfully resolves boundary ambiguity and semantic confusion without additional data, preserves enhanced discrimination capabilities, and enables real-world generalization while preventing knowledge forgetting.

Abstract: Existing Moment Retrieval methods face three critical bottlenecks: (1) data scarcity forces models into shallow keyword-feature associations; (2) boundary ambiguity in transition regions between adjacent events; (3) insufficient discrimination of fine-grained semantics (e.g., distinguishing kicking" vs. throwing" a ball). In this paper, we propose a zero-external-dependency Augmented Moment Retrieval framework, AMR, designed to overcome local optima caused by insufficient data annotations and the lack of robust boundary and semantic discrimination capabilities. AMR is built upon two key insights: (1) it resolves ambiguous boundary information and semantic confusion in existing annotations without additional data (avoiding costly manual labeling), and (2) it preserves boundary and semantic discriminative capabilities enhanced by training while generalizing to real-world scenarios, significantly improving performance. Furthermore, we propose a two-stage training framework with cold-start and distillation adaptation. The cold-start stage employs curriculum learning on augmented data to build foundational boundary/semantic awareness. The distillation stage introduces dual query sets: Original Queries maintain DETR-based localization using frozen Base Queries from the cold-start model, while Active Queries dynamically adapt to real-data distributions. A cross-stage distillation loss enforces consistency between Original and Base Queries, preventing knowledge forgetting while enabling real-world generalization. Experiments on multiple benchmarks show that AMR achieves improved performance over prior state-of-the-art approaches.

[175] MedReason-R1: Learning to Reason for CT Diagnosis with Reinforcement Learning and Local Zoom

Yifan Li, Fenghe Tang, Yingtai Li, Shaohua Kevin Zhou

Main category: cs.CV

TL;DR: MedReason-R1 is a medical VLM with explicit reasoning process for CT disease diagnosis, using zoom-in disease ROI areas and GRPO reinforcement learning to achieve state-of-the-art performance.

Details

Motivation: General-purpose VLMs perform poorly in medical domain due to lack of specialized medical datasets and neglect of diagnostic process from coarse to fine-grained.

Method: Constructed CT-RATE-VQA dataset (84K QA pairs) and proposed MedReason-R1 with zoom-in disease ROI embedding strategy and GRPO reinforcement learning framework for reasoning without manual annotations.

Result: MedReason-R1 achieves state-of-the-art performance in CT disease diagnosis while retaining generalization, outperforming recent general-purpose and medical VLMs.

Conclusion: The proposed approach effectively addresses medical VLM limitations through specialized dataset construction and explicit reasoning processes, demonstrating superior diagnostic capabilities.

Abstract: General-purpose large Vision-Language Models (VLMs) demonstrate strong capabilities in generating detailed descriptions for natural images. However, their performance in the medical domain remains suboptimal, even for relatively straightforward tasks, primarily due to the lack of large-scale, high-quality, specialized medical imaging datasets and the neglect of the diagnostic process that progresses from coarse to fine-grained. To address the first issue, we construct the CT-RATE-VQA dataset, which has 84K QA pairs. For the second issue, we propose MedReason-R1, a medical VLM with explicit reasoning process for disease diagnosis. MedReason-R1 incorporates a novel strategy that embeds zoom-in disease region-of-interest areas into the image, highlighting the crucial role of both global localization and disease-specific details in enhancing the model’s diagnostic performance. Furthermore, we introduce the GRPO reinforcement learning framework to MedReason-R1, which enables effective reasoning without relying on costly manual annotations. Compared to recent general-purpose and medical VLMs, MedReason-R1 achieves state-of-the-art performance in CT disease diagnosis while retaining generalization. The code, checkpoints, and dataset are available at: https://github.com/Leevan001/MedReason-R1

[176] Re-Activating Frozen Primitives for 3D Gaussian Splatting

Yuxin Cheng, Binxiao Huang, Wenyong Zhou, Taiqiang Wu, Zhengwu Liu, Graziano Chesi, Ngai Wong

Main category: cs.CV

TL;DR: ReAct-GS addresses over-reconstruction artifacts in 3D Gaussian Splatting by introducing importance-aware densification and parameter perturbation mechanisms to reactivate frozen primitives in complex regions.

Details

Motivation: 3D-GS struggles with complex scenes due to over-reconstruction artifacts like local blurring and needle-shape distortions, caused by gradient magnitude dilution and primitive frozen phenomenon where essential Gaussian densification is inhibited.

Method: Introduces ReAct-GS with two key components: (1) importance-aware densification criterion using α-blending weights from multiple viewpoints, and (2) re-activation mechanism with adaptive parameter perturbations to revitalize frozen primitives.

Result: Effectively eliminates over-reconstruction artifacts, achieves state-of-the-art performance on novel view synthesis metrics while preserving geometric details, and shows consistent improvements when integrated with other 3D-GS variants like Pixel-GS.

Conclusion: ReAct-GS successfully addresses fundamental limitations in 3D-GS through re-activation principles, demonstrating broad applicability and improved performance in complex scene reconstruction.

Abstract: 3D Gaussian Splatting (3D-GS) achieves real-time photorealistic novel view synthesis, yet struggles with complex scenes due to over-reconstruction artifacts, manifesting as local blurring and needle-shape distortions. While recent approaches attribute these issues to insufficient splitting of large-scale Gaussians, we identify two fundamental limitations: gradient magnitude dilution during densification and the primitive frozen phenomenon, where essential Gaussian densification is inhibited in complex regions while suboptimally scaled Gaussians become trapped in local optima. To address these challenges, we introduce ReAct-GS, a method founded on the principle of re-activation. Our approach features: (1) an importance-aware densification criterion incorporating $\alpha$-blending weights from multiple viewpoints to re-activate stalled primitive growth in complex regions, and (2) a re-activation mechanism that revitalizes frozen primitives through adaptive parameter perturbations. Comprehensive experiments across diverse real-world datasets demonstrate that ReAct-GS effectively eliminates over-reconstruction artifacts and achieves state-of-the-art performance on standard novel view synthesis metrics while preserving intricate geometric details. Additionally, our re-activation mechanism yields consistent improvements when integrated with other 3D-GS variants such as Pixel-GS, demonstrating its broad applicability.

[177] From Forecasting to Planning: Policy World Model for Collaborative State-Action Prediction

Zhida Zhao, Talas Fu, Yifan Wang, Lijun Wang, Huchuan Lu

Main category: cs.CV

TL;DR: Policy World Model (PWM) integrates world modeling and trajectory planning in autonomous driving, using action-free future state forecasting to enhance planning with learned world knowledge.

Details

Motivation: Current driving world models are mostly used for simulation and decoupled from planning, lacking synergistic integration between world modeling and trajectory planning.

Method: Proposes PWM with collaborative state-action prediction, action-free future state forecasting, dynamically enhanced parallel token generation, context-guided tokenizer, and adaptive dynamic focal loss.

Result: Matches or exceeds state-of-the-art approaches using only front camera input, outperforming methods that rely on multi-view and multi-modal inputs.

Conclusion: PWM successfully integrates world modeling and planning, enabling human-like anticipatory perception and more reliable planning performance in autonomous driving.

Abstract: Despite remarkable progress in driving world models, their potential for autonomous systems remains largely untapped: the world models are mostly learned for world simulation and decoupled from trajectory planning. While recent efforts aim to unify world modeling and planning in a single framework, the synergistic facilitation mechanism of world modeling for planning still requires further exploration. In this work, we introduce a new driving paradigm named Policy World Model (PWM), which not only integrates world modeling and trajectory planning within a unified architecture, but is also able to benefit planning using the learned world knowledge through the proposed action-free future state forecasting scheme. Through collaborative state-action prediction, PWM can mimic the human-like anticipatory perception, yielding more reliable planning performance. To facilitate the efficiency of video forecasting, we further introduce a dynamically enhanced parallel token generation mechanism, equipped with a context-guided tokenizer and an adaptive dynamic focal loss. Despite utilizing only front camera input, our method matches or exceeds state-of-the-art approaches that rely on multi-view and multi-modal inputs. Code and model weights will be released at https://github.com/6550Zhao/Policy-World-Model.

[178] I Spy With My Model’s Eye: Visual Search as a Behavioural Test for MLLMs

John Burden, Jonathan Prunty, Ben Slater, Matthieu Tehenan, Greg Davis, Lucy Cheke

Main category: cs.CV

TL;DR: The paper adapts visual search paradigms from cognitive psychology to test whether multimodal large language models (MLLMs) exhibit human-like “pop-out” effects in visual processing, finding they show similar patterns in feature detection and capacity limits.

Details

Motivation: Current black-box evaluations of MLLMs focus on task accuracy but reveal little about underlying visual processing mechanisms. The researchers wanted to understand if MLLMs exhibit human-like perceptual phenomena like pop-out effects.

Method: Used controlled experiments with classic visual search paradigms targeting color, size and lighting features. Applied disjunctive (single feature) and conjunctive (multiple feature) search tasks. Also used targeted fine-tuning and mechanistic interpretability analyses.

Result: Advanced MLLMs exhibit human-like pop-out effects in color or size-based disjunctive search, show capacity limits for conjunctive search, and incorporate natural scene priors like lighting direction into object representations.

Conclusion: Visual search can serve as a cognitively grounded diagnostic tool for evaluating perceptual capabilities in MLLMs, revealing similarities to human visual processing mechanisms.

Abstract: Multimodal large language models (MLLMs) achieve strong performance on vision-language tasks, yet their visual processing is opaque. Most black-box evaluations measure task accuracy, but reveal little about underlying mechanisms. Drawing on cognitive psychology, we adapt classic visual search paradigms – originally developed to study human perception – to test whether MLLMs exhibit the ``pop-out’’ effect, where salient visual features are detected independently of distractor set size. Using controlled experiments targeting colour, size and lighting features, we find that advanced MLLMs exhibit human-like pop-out effects in colour or size-based disjunctive (single feature) search, as well as capacity limits for conjunctive (multiple feature) search. We also find evidence to suggest that MLLMs, like humans, incorporate natural scene priors such as lighting direction into object representations. We reinforce our findings using targeted fine-tuning and mechanistic interpretability analyses. Our work shows how visual search can serve as a cognitively grounded diagnostic tool for evaluating perceptual capabilities in MLLMs.

[179] Curvilinear Structure-preserving Unpaired Cross-domain Medical Image Translation

Zihao Chen, Yi Zhou, Xudong Jiang, Li Chen, Leopold Schmetterer, Bingyao Tan, Jun Cheng

Main category: cs.CV

TL;DR: CST is a framework that preserves fine curvilinear structures during unpaired image-to-image translation in medical imaging by integrating structure consistency into training.

Details

Motivation: Existing unpaired image translation methods often distort fine curvilinear structures like microvasculature, which is critical in ophthalmic and vascular imaging where subtle morphological changes have clinical significance.

Method: CST augments baseline models with a curvilinear extraction module for topological supervision. It can be seamlessly integrated into existing methods like CycleGAN and UNSB as representative backbones.

Result: Comprehensive evaluation across three imaging modalities (optical coherence tomography angiography, color fundus, and X-ray coronary angiography) shows CST improves translation fidelity and achieves state-of-the-art performance.

Conclusion: By reinforcing geometric integrity in learned mappings, CST establishes a principled pathway toward curvilinear structure-aware cross-domain translation in medical imaging.

Abstract: Unpaired image-to-image translation has emerged as a crucial technique in medical imaging, enabling cross-modality synthesis, domain adaptation, and data augmentation without costly paired datasets. Yet, existing approaches often distort fine curvilinear structures, such as microvasculature, undermining both diagnostic reliability and quantitative analysis. This limitation is consequential in ophthalmic and vascular imaging, where subtle morphological changes carry significant clinical meaning. We propose Curvilinear Structure-preserving Translation (CST), a general framework that explicitly preserves fine curvilinear structures during unpaired translation by integrating structure consistency into the training. Specifically, CST augments baseline models with a curvilinear extraction module for topological supervision. It can be seamlessly incorporated into existing methods. We integrate it into CycleGAN and UNSB as two representative backbones. Comprehensive evaluation across three imaging modalities: optical coherence tomography angiography, color fundus and X-ray coronary angiography demonstrates that CST improves translation fidelity and achieves state-of-the-art performance. By reinforcing geometric integrity in learned mappings, CST establishes a principled pathway toward curvilinear structure-aware cross-domain translation in medical imaging.

[180] Explainable Face Presentation Attack Detection via Ensemble-CAM

Rashik Shadman, M G Sarwar Murshed, Faraz Hussain

Main category: cs.CV

TL;DR: Proposes Ensemble-CAM, a novel technique for providing visual explanations in deep learning-based face presentation attack detection systems to improve transparency and trustworthiness.

Details

Motivation: Presentation attacks using fake biometric data pose security threats, and while deep learning models are effective for detection, they operate as black boxes with opaque decisions. Explainability techniques are needed to understand model behavior and identify key regions that influence decisions.

Method: Developed Ensemble-CAM, a novel technique for generating visual explanations of decisions made by deep learning-based face presentation attack detection systems.

Result: The proposed method provides visual explanations that help understand why biometric images are classified as real or fake, highlighting key regions influencing the system’s decisions.

Conclusion: Ensemble-CAM enhances the transparency and trustworthiness of DL-based face PAD systems by providing better understanding of their behavior through visual explanations.

Abstract: Presentation attacks represent a critical security threat where adversaries use fake biometric data, such as face, fingerprint, or iris images, to gain unauthorized access to protected systems. Various presentation attack detection (PAD) systems have been designed leveraging deep learning (DL) models to mitigate this type of threat. Despite their effectiveness, most of the DL models function as black boxes - their decisions are opaque to their users. The purpose of explainability techniques is to provide detailed information about the reason behind the behavior or decision of DL models. In particular, visual explanation is necessary to better understand the decisions or predictions of DL-based PAD systems and determine the key regions due to which a biometric image is considered real or fake by the system. In this work, a novel technique, Ensemble-CAM, is proposed for providing visual explanations for the decisions made by deep learning-based face PAD systems. Our goal is to improve DL-based face PAD systems by providing a better understanding of their behavior. Our provided visual explanations will enhance the transparency and trustworthiness of DL-based face PAD systems.

[181] LyTimeT: Towards Robust and Interpretable State-Variable Discovery

Kuai Yu, Crystal Su, Xiang Liu, Judah Goldfeder, Mingyuan Shao, Hod Lipson

Main category: cs.CV

TL;DR: LyTimeT is a two-phase framework that extracts interpretable dynamical variables from videos by combining spatio-temporal attention with stability constraints, achieving robust latent representations and accurate long-term predictions.

Details

Motivation: Extracting true dynamical variables from high-dimensional video is challenging due to distracting visual factors like background motion, occlusions, and texture changes that obscure the underlying dynamics.

Method: Two-phase approach: Phase 1 uses a TimeSformer-based autoencoder with global attention to focus on dynamically relevant regions. Phase 2 selects meaningful dimensions via linear correlation analysis and refines dynamics with Lyapunov-based stability regularizer for contraction.

Result: Achieves mutual information and intrinsic dimension estimates closest to ground truth, remains invariant under background perturbations, and delivers lowest analytical mean squared error compared to CNN-based and transformer-only baselines on synthetic and real-world systems.

Conclusion: Combining spatio-temporal attention with stability constraints yields predictive models that are both accurate and physically interpretable for dynamical system analysis from video data.

Abstract: Extracting the true dynamical variables of a system from high-dimensional video is challenging due to distracting visual factors such as background motion, occlusions, and texture changes. We propose LyTimeT, a two-phase framework for interpretable variable extraction that learns robust and stable latent representations of dynamical systems. In Phase 1, LyTimeT employs a spatio-temporal TimeSformer-based autoencoder that uses global attention to focus on dynamically relevant regions while suppressing nuisance variation, enabling distraction-robust latent state learning and accurate long-horizon video prediction. In Phase 2, we probe the learned latent space, select the most physically meaningful dimensions using linear correlation analysis, and refine the transition dynamics with a Lyapunov-based stability regularizer to enforce contraction and reduce error accumulation during roll-outs. Experiments on five synthetic benchmarks and four real-world dynamical systems, including chaotic phenomena, show that LyTimeT achieves mutual information and intrinsic dimension estimates closest to ground truth, remains invariant under background perturbations, and delivers the lowest analytical mean squared error among CNN-based (TIDE) and transformer-only baselines. Our results demonstrate that combining spatio-temporal attention with stability constraints yields predictive models that are not only accurate but also physically interpretable.

[182] Adaptive Distribution-aware Quantization for Mixed-Precision Neural Networks

Shaohang Jia, Zhiyong Huang, Zhi Yu, Mingyang Hou, Shuai Miao, Han Yang

Main category: cs.CV

TL;DR: ADQ is a mixed-precision quantization framework that addresses non-uniform activation distributions and static weight codebooks through adaptive weight quantization with quantile-based initialization, EMA-based codebook adaptation, and sensitivity-informed mixed-precision allocation.

Details

Motivation: Existing QAT methods face challenges with highly non-uniform activation distributions and static, mismatched weight quantization codebooks that limit deployment on resource-constrained devices.

Method: Proposes Adaptive Distribution-aware Quantization (ADQ) with three key innovations: quantile-based codebook initialization aligned with weight distribution, EMA-based online codebook adaptation to track distribution shifts, and sensitivity-informed mixed-precision allocation, plus hardware-friendly non-uniform-to-uniform mapping for activations.

Result: On ImageNet, ADQ achieves 71.512% Top-1 accuracy for ResNet-18 with only 2.81 bits average bit-width, outperforming state-of-the-art methods. Ablation studies on CIFAR-10 validate individual component contributions.

Conclusion: ADQ effectively addresses key quantization challenges through adaptive distribution-aware techniques, enabling high-performance neural network deployment on resource-constrained devices with significantly reduced bit-widths.

Abstract: Quantization-Aware Training (QAT) is a critical technique for deploying deep neural networks on resource-constrained devices. However, existing methods often face two major challenges: the highly non-uniform distribution of activations and the static, mismatched codebooks used in weight quantization. To address these challenges, we propose Adaptive Distribution-aware Quantization (ADQ), a mixed-precision quantization framework that employs a differentiated strategy. The core of ADQ is a novel adaptive weight quantization scheme comprising three key innovations: (1) a quantile-based initialization method that constructs a codebook closely aligned with the initial weight distribution; (2) an online codebook adaptation mechanism based on Exponential Moving Average (EMA) to dynamically track distributional shifts; and (3) a sensitivity-informed strategy for mixed-precision allocation. For activations, we integrate a hardware-friendly non-uniform-to-uniform mapping scheme. Comprehensive experiments validate the effectiveness of our method. On ImageNet, ADQ enables a ResNet-18 to achieve 71.512% Top-1 accuracy with an average bit-width of only 2.81 bits, outperforming state-of-the-art methods under comparable conditions. Furthermore, detailed ablation studies on CIFAR-10 systematically demonstrate the individual contributions of each innovative component, validating the rationale and effectiveness of our design.

[183] OmniMotion-X: Versatile Multimodal Whole-Body Motion Generation

Guowei Xu, Yuxuan Bian, Ailing Zeng, Mingyi Shi, Shaoli Huang, Wen Li, Lixin Duan, Qiang Xu

Main category: cs.CV

TL;DR: OmniMotion-X is a multimodal framework for whole-body human motion generation using an autoregressive diffusion transformer. It supports diverse tasks like text-to-motion, music-to-dance, and speech-to-gesture, with enhanced consistency through reference motion conditioning and progressive training strategy.

Details

Motivation: To create a unified framework that can handle diverse multimodal motion generation tasks efficiently while maintaining realistic animations with consistent content, style, and temporal dynamics.

Method: Uses autoregressive diffusion transformer in sequence-to-sequence manner; introduces reference motion as conditioning signal; employs progressive weak-to-strong mixed-condition training strategy; constructs OmniMoCap-X dataset with 28 MoCap sources standardized to SMPL-X format; uses GPT-4o for automatic hierarchical captioning.

Result: Significantly surpasses existing methods with state-of-the-art performance across multiple multimodal tasks; enables interactive generation of realistic, coherent, and controllable long-duration motions.

Conclusion: OmniMotion-X provides a versatile and effective solution for multimodal human motion generation, demonstrating superior performance through unified architecture, novel conditioning strategies, and comprehensive dataset construction.

Abstract: This paper introduces OmniMotion-X, a versatile multimodal framework for whole-body human motion generation, leveraging an autoregressive diffusion transformer in a unified sequence-to-sequence manner. OmniMotion-X efficiently supports diverse multimodal tasks, including text-to-motion, music-to-dance, speech-to-gesture, and global spatial-temporal control scenarios (e.g., motion prediction, in-betweening, completion, and joint/trajectory-guided synthesis), as well as flexible combinations of these tasks. Specifically, we propose the use of reference motion as a novel conditioning signal, substantially enhancing the consistency of generated content, style, and temporal dynamics crucial for realistic animations. To handle multimodal conflicts, we introduce a progressive weak-to-strong mixed-condition training strategy. To enable high-quality multimodal training, we construct OmniMoCap-X, the largest unified multimodal motion dataset to date, integrating 28 publicly available MoCap sources across 10 distinct tasks, standardized to the SMPL-X format at 30 fps. To ensure detailed and consistent annotations, we render sequences into videos and use GPT-4o to automatically generate structured and hierarchical captions, capturing both low-level actions and high-level semantics. Extensive experimental evaluations confirm that OmniMotion-X significantly surpasses existing methods, demonstrating state-of-the-art performance across multiple multimodal tasks and enabling the interactive generation of realistic, coherent, and controllable long-duration motions.

[184] Class-Aware Prototype Learning with Negative Contrast for Test-Time Adaptation of Vision-Language Models

Xiaozhen Qiao, Jingkai Zhao, Yuqiu Jiang, Xianda Guo, Zhe Sun, Hongyuan Zhang, Xuelong Li

Main category: cs.CV

TL;DR: CPL-NC is a lightweight test-time adaptation framework for Vision-Language Models that addresses prototype degradation in long-tailed distributions and confusion between similar classes through class-aware prototype caching and negative contrastive learning.

Details

Motivation: Vision-Language Models suffer performance drops when deployment distributions diverge from training, and existing TTA methods fail to address prototype degradation in long-tailed distributions and confusion between semantically similar classes.

Method: Proposes CPL-NC with Class-Aware Prototype Cache Module that dynamically adjusts per-class capacity based on test-time frequency and activation history, and Negative Contrastive Learning Mechanism that identifies and constrains hard visual-textual negatives. Uses asymmetric optimization to refine only textual prototypes while anchoring on stable visual features.

Result: Experiments on 15 benchmarks show CPL-NC consistently outperforms prior TTA methods across both ResNet-50 and ViT-B/16 backbones.

Conclusion: CPL-NC effectively enhances VLMs’ generalization under distribution shifts by addressing key challenges of prototype degradation and class confusion through adaptive prototype management and negative contrastive learning.

Abstract: Vision-Language Models (VLMs) demonstrate impressive zero-shot generalization through large-scale image-text pretraining, yet their performance can drop once the deployment distribution diverges from the training distribution. To address this, Test-Time Adaptation (TTA) methods update models using unlabeled target data. However, existing approaches often ignore two key challenges: prototype degradation in long-tailed distributions and confusion between semantically similar classes. To tackle these issues, we propose \textbf{C}lass-Aware \textbf{P}rototype \textbf{L}earning with \textbf{N}egative \textbf{C}ontrast(\textbf{CPL-NC}), a lightweight TTA framework designed specifically for VLMs to enhance generalization under distribution shifts. CPL-NC introduces a \textit{Class-Aware Prototype Cache} Module that dynamically adjusts per-class capacity based on test-time frequency and activation history, with a rejuvenation mechanism for inactive classes to retain rare-category knowledge. Additionally, a \textit{Negative Contrastive Learning} Mechanism identifies and constrains hard visual-textual negatives to improve class separability. The framework employs asymmetric optimization, refining only textual prototypes while anchoring on stable visual features. Experiments on 15 benchmarks show that CPL-NC consistently outperforms prior TTA methods across both ResNet-50 and ViT-B/16 backbones.

[185] Pico-Banana-400K: A Large-Scale Dataset for Text-Guided Image Editing

Yusu Qian, Eli Bocek-Rivele, Liangchen Song, Jialing Tong, Yinfei Yang, Jiasen Lu, Wenze Hu, Zhe Gan

Main category: cs.CV

TL;DR: Pico-Banana-400K is a large-scale 400K-image dataset for instruction-based image editing, created using Nano-Banana to generate diverse edit pairs from real OpenImages photographs, with systematic quality control and specialized subsets for multi-turn editing, preference learning, and instruction rewriting.

Details

Motivation: The research community lacks large-scale, high-quality, open-access datasets built from real images for text-guided image editing, which constrains progress despite advances in multimodal models like GPT-4o and Nano-Banana.

Method: Leveraged Nano-Banana to generate diverse edit pairs from real photographs in OpenImages collection, using fine-grained image editing taxonomy for comprehensive coverage, and employed MLLM-based quality scoring and careful curation to ensure content preservation and instruction faithfulness.

Result: Created Pico-Banana-400K with three specialized subsets: 72K multi-turn examples for sequential editing, 56K preference examples for alignment research, and paired long-short instructions for instruction rewriting and summarization capabilities.

Conclusion: Pico-Banana-400K provides a robust foundation for training and benchmarking next-generation text-guided image editing models by offering large-scale, high-quality, and task-rich resources.

Abstract: Recent advances in multimodal models have demonstrated remarkable text-guided image editing capabilities, with systems like GPT-4o and Nano-Banana setting new benchmarks. However, the research community’s progress remains constrained by the absence of large-scale, high-quality, and openly accessible datasets built from real images. We introduce Pico-Banana-400K, a comprehensive 400K-image dataset for instruction-based image editing. Our dataset is constructed by leveraging Nano-Banana to generate diverse edit pairs from real photographs in the OpenImages collection. What distinguishes Pico-Banana-400K from previous synthetic datasets is our systematic approach to quality and diversity. We employ a fine-grained image editing taxonomy to ensure comprehensive coverage of edit types while maintaining precise content preservation and instruction faithfulness through MLLM-based quality scoring and careful curation. Beyond single turn editing, Pico-Banana-400K enables research into complex editing scenarios. The dataset includes three specialized subsets: (1) a 72K-example multi-turn collection for studying sequential editing, reasoning, and planning across consecutive modifications; (2) a 56K-example preference subset for alignment research and reward model training; and (3) paired long-short editing instructions for developing instruction rewriting and summarization capabilities. By providing this large-scale, high-quality, and task-rich resource, Pico-Banana-400K establishes a robust foundation for training and benchmarking the next generation of text-guided image editing models.

[186] How to Evaluate Monocular Depth Estimation?

Siyang Wu, Jack Nugent, Willow Yang, Jia Deng

Main category: cs.CV

TL;DR: This paper analyzes evaluation metrics for monocular depth estimation, revealing their limitations and proposing a new metric based on relative surface normals to address under-sensitivity to curvature perturbations.

Details

Motivation: There is a lack of standardization in evaluating monocular depth estimation, with many metrics whose trade-offs and behaviors are not well understood, particularly in relation to human judgment.

Method: The authors conducted quantitative analysis of existing metrics’ sensitivity to ground truth perturbations, compared them to human judgment, and introduced a new metric based on relative surface normals along with visualization tools and composite metric methods.

Result: Analysis revealed that existing metrics are severely under-sensitive to curvature perturbations (like making flat surfaces wavy), and the proposed new metric addresses this limitation.

Conclusion: The paper provides a principled approach to creating better composite metrics that align with human judgment and offers new tools for more comprehensive depth estimation evaluation.

Abstract: Monocular depth estimation is an important task with rapid progress, but how to evaluate it remains an open question, as evidenced by a lack of standardization in existing literature and a large selection of evaluation metrics whose trade-offs and behaviors are not well understood. This paper contributes a novel, quantitative analysis of existing metrics in terms of their sensitivity to various types of perturbations of ground truth, emphasizing comparison to human judgment. Our analysis reveals that existing metrics are severely under-sensitive to curvature perturbation such as making flat surfaces wavy. To remedy this, we introduce a new metric based on relative surface normals, along with new depth visualization tools and a principled method to create composite metrics with better human alignment. Code and data are available at: https://github.com/princeton-vl/evalmde.

[187] olmOCR 2: Unit Test Rewards for Document OCR

Jake Poznanski, Luca Soldaini, Kyle Lo

Main category: cs.CV

TL;DR: olmOCR 2 is a state-of-the-art OCR system using a 7B vision language model trained with RLVR (reinforcement learning with verifiable rewards) on synthetic documents, achieving best performance on English OCR tasks especially for math formulas, tables, and multi-column layouts.

Details

Motivation: To create a powerful OCR system that can convert digitized print documents into clean, naturally ordered plain text with improved handling of complex document elements.

Method: Used a 7B vision language model (olmOCR-2-7B-1025) trained with reinforcement learning with verifiable rewards (RLVR) using binary unit tests as rewards. Developed a pipeline for generating synthetic documents with diverse layouts and ground-truth HTML for scalable test case creation.

Result: Achieved state-of-the-art performance on olmOCR-Bench benchmark, with largest improvements in math formula conversion, table parsing, and multi-column layouts compared to previous versions.

Conclusion: The RLVR training approach with synthetic document generation enables superior OCR performance for complex document elements, and the model, data, and code are released under permissive open licenses.

Abstract: We present olmOCR 2, the latest in our family of powerful OCR systems for converting digitized print documents, like PDFs, into clean, naturally ordered plain text. olmOCR 2 is powered by olmOCR-2-7B-1025, a specialized, 7B vision language model (VLM) trained using reinforcement learning with verifiable rewards (RLVR), where our rewards are a diverse set of binary unit tests. To scale unit test creation, we develop a pipeline for generating synthetic documents with diverse and challenging layouts, known ground-truth HTML source code, and extracted test cases. We show that RL training on these test cases results in state-of-the-art performance on olmOCR-Bench, our English-language OCR benchmark, with the largest improvements in math formula conversion, table parsing, and multi-column layouts compared to previous versions. We release our model, data and code under permissive open licenses.

[188] Is This Tracker On? A Benchmark Protocol for Dynamic Tracking

Ilona Demler, Saumya Chauhan, Georgia Gkioxari

Main category: cs.CV

TL;DR: ITTO is a challenging benchmark suite for evaluating point tracking methods using real-world videos with complex motion, occlusions, and object diversity, revealing that current trackers struggle particularly with re-identification after occlusion.

Details

Motivation: Current point tracking benchmarks lack the motion complexity, occlusion patterns, and object diversity found in real-world scenes, limiting their ability to evaluate tracking methods under realistic conditions.

Method: Created ITTO benchmark using videos from existing datasets and egocentric recordings with high-quality human annotations collected through a multi-stage pipeline, then conducted rigorous analysis of state-of-the-art tracking methods.

Result: Existing trackers struggle with ITTO’s challenges, particularly in re-identifying points after occlusion, revealing critical failure modes in current tracking approaches.

Conclusion: ITTO serves as a foundation testbed to advance point tracking and guide development of more robust algorithms tailored to real-world dynamics.

Abstract: We introduce ITTO, a challenging new benchmark suite for evaluating and diagnosing the capabilities and limitations of point tracking methods. Our videos are sourced from existing datasets and egocentric real-world recordings, with high-quality human annotations collected through a multi-stage pipeline. ITTO captures the motion complexity, occlusion patterns, and object diversity characteristic of real-world scenes – factors that are largely absent in current benchmarks. We conduct a rigorous analysis of state-of-the-art tracking methods on ITTO, breaking down performance along key axes of motion complexity. Our findings reveal that existing trackers struggle with these challenges, particularly in re-identifying points after occlusion, highlighting critical failure modes. These results point to the need for new modeling approaches tailored to real-world dynamics. We envision ITTO as a foundation testbed for advancing point tracking and guiding the development of more robust tracking algorithms.

[189] LBL: Logarithmic Barrier Loss Function for One-class Classification

Xiaofeng Guo, Ziyang Jiang, Tianlei Wang, Shichen Zhang, Dinghan Hu, Jiuwen Cao

Main category: cs.CV

TL;DR: Proposes two novel one-class classification loss functions: LBL using logarithmic barrier for compact hypersphere boundaries, and LBLSig with Sigmoid relaxation for stable optimization.

Details

Motivation: One-class classification lacks effective loss functions for deep learning, and existing approaches need improvements in boundary optimization stability.

Method: First proposed LBL loss using logarithmic barrier function to assign large gradients to margin samples. Then developed LBLSig loss with unilateral relaxation Sigmoid function to address LBL’s optimization instability.

Result: Experiments on different networks demonstrate the effectiveness of both proposed loss functions.

Conclusion: The proposed LBL and LBLSig losses provide effective solutions for one-class classification in deep learning with improved boundary optimization.

Abstract: One-class classification (OCC) aims to train a classifier solely on target data and attracts increasing attention due to its applicability in practice. Despite OCC has obtained many advances, it still lacks the effective OCC loss functions for deep learning. In this paper, a novel logarithmic barrier function based OCC loss (LBL) that assigns large gradients to margin samples and thus derives more compact hypersphere is first proposed by approximating the OCC objective smoothly. But the optimization of LBL may be instability especially when samples lie on the boundary leading to the infinity value. To address this issue, a smoother LBLSig loss is further proposed by utilizing a unilateral relaxation Sigmoid function. Experiments on different networks demonstrate the effectiveness of the proposed LBL and LBLSig. The source code can be found at https://github.com/ML-HDU/LBL_LBLSig.

[190] Brain3D: Generating 3D Objects from fMRI

Yuankun Yang, Li Zhang, Ziyang Xie, Zhiyuan Yuan, Jianfeng Feng, Xiatian Zhu, Yu-Gang Jiang

Main category: cs.CV

TL;DR: Brain3D is a novel method that generates 3D objects from fMRI brain signals, addressing limitations of existing 2D image generation approaches and providing biologically meaningful insights into human visual perception.

Details

Motivation: Current fMRI analysis methods are limited to generating 2D images and lack biological meaningfulness. fMRI analysis is challenging, costly, and requires professional training, creating a need for more sophisticated and practical approaches.

Method: Reformulates the task as fMRI-conditioned 3D object generation. Uses a two-stage architecture with progressive high-level information integration to handle noise and semantic signals. Takes fMRI data from subjects viewing 2D images and generates corresponding 3D object images.

Result: Superior performance over state-of-the-art 3D object generation methods. Captures distinct functionalities of human vision system regions (V1-V4, MTL) and their intricate interplay, aligning with neuroscience discoveries. Successfully identifies disordered brain regions in simulated scenarios.

Conclusion: Brain3D enables more sophisticated fMRI data modeling and provides biologically meaningful 3D outputs that reveal insights into human visual perception mechanisms, with potential applications in neuroscience research and brain disorder identification.

Abstract: Understanding the hidden mechanisms behind human’s visual perception is a fundamental question in neuroscience. To that end, investigating into the neural responses of human mind activities, such as functional Magnetic Resonance Imaging (fMRI), has been a significant research vehicle. However, analyzing fMRI signals is challenging, costly, daunting, and demanding for professional training. Despite remarkable progress in fMRI analysis, existing approaches are limited to generating 2D images and far away from being biologically meaningful and practically useful. Under this insight, we propose to generate visually plausible and functionally more comprehensive 3D outputs decoded from brain signals, enabling more sophisticated modeling of fMRI data. Conceptually, we reformulate this task as a {\em fMRI conditioned 3D object generation} problem. We design a novel 3D object representation learning method, Brain3D, that takes as input the fMRI data of a subject who was presented with a 2D image, and yields as output the corresponding 3D object images. The key capabilities of this model include tackling the noises with high-level semantic signals and a two-stage architecture design for progressive high-level information integration. Extensive experiments validate the superior capability of our model over previous state-of-the-art 3D object generation methods. Importantly, we show that our model captures the distinct functionalities of each region of human vision system as well as their intricate interplay relationships, aligning remarkably with the established discoveries in neuroscience. Further, preliminary evaluations indicate that Brain3D can successfully identify the disordered brain regions in simulated scenarios, such as V1, V2, V3, V4, and the medial temporal lobe (MTL) within the human visual system. Our data and code will be available at https://brain-3d.github.io/.

[191] ComDrive: Comfort-Oriented End-to-End Autonomous Driving

Junming Wang, Xingyu Zhang, Zebin Xing, Songen Gu, Xiaoyang Guo, Yang Hu, Ziying Song, Qian Zhang, Xiaoxiao Long, Wei Yin

Main category: cs.CV

TL;DR: ComDrive is the first comfort-oriented end-to-end autonomous driving system that generates temporally consistent and comfortable trajectories using a Conditional DDPM-based motion planner and dual-stream adaptive trajectory scorer.

Details

Motivation: Current imitation learning-based planners and learning-based trajectory scorers generate temporally inconsistent and uncomfortable trajectories, despite being able to produce safe trajectories that mimic expert demonstrations.

Method: Uses sparse perception to extract 3D spatial representations as conditional inputs for a Conditional Denoising Diffusion Probabilistic Model (DDPM)-based motion planner to generate multi-modal trajectories, followed by a dual-stream adaptive trajectory scorer to select the most comfortable trajectory.

Result: Achieves state-of-the-art performance in both comfort and safety, outperforming UniAD by 17% in driving comfort and reducing collision rates by 25% compared to SparseDrive.

Conclusion: ComDrive successfully addresses the challenge of generating temporally consistent and comfortable trajectories in autonomous driving systems through its novel conditional DDPM-based approach and adaptive scoring mechanism.

Abstract: We propose ComDrive: the first comfort-oriented end-to-end autonomous driving system to generate temporally consistent and comfortable trajectories. Recent studies have demonstrated that imitation learning-based planners and learning-based trajectory scorers can effectively generate and select safety trajectories that closely mimic expert demonstrations. However, such trajectory planners and scorers face the challenge of generating temporally inconsistent and uncomfortable trajectories. To address these issues, ComDrive first extracts 3D spatial representations through sparse perception, which then serves as conditional inputs. These inputs are used by a Conditional Denoising Diffusion Probabilistic Model (DDPM)-based motion planner to generate temporally consistent multi-modal trajectories. A dual-stream adaptive trajectory scorer subsequently selects the most comfortable trajectory from these candidates to control the vehicle. Experiments demonstrate that ComDrive achieves state-of-the-art performance in both comfort and safety, outperforming UniAD by 17% in driving comfort and reducing collision rates by 25% compared to SparseDrive. More results are available on our project page: https://jmwang0117.github.io/ComDrive/.

[192] Adversarial Attacks on LiDAR-Based Tracking Across Road Users: Robustness Evaluation and Target-Aware Black-Box Method

Shengjing Tian, Xiantong Zhao, Yuhao Bian, Yinan Han, Bin Liu

Main category: cs.CV

TL;DR: This paper investigates the vulnerability of neural network-based LiDAR point cloud tracking models to adversarial attacks, proposing a unified attack framework and a novel black-box attack method called TAPG that achieves effective attacks while maintaining perturbation concealment.

Details

Motivation: Current LiDAR point cloud tracking models focus on performance enhancement but neglect robustness against adversarial attacks, domain shifts, and data corruption. The study aims to address this critical gap by systematically evaluating model vulnerability.

Method: Established a unified adversarial attack framework for 3D object tracking, extending white-box attacks (FGSM, C&W, PGD) to point clouds, and developed TAPG algorithm for black-box attacks using heuristic sparse constraints and random sub-vector factorization for better transferability.

Result: Experiments revealed significant vulnerability in advanced tracking methods to both black-box and white-box attacks. TAPG demonstrated optimal balance between attack effectiveness and perturbation concealment compared to existing methods.

Conclusion: There is an urgent need to incorporate robustness against adversarial attacks into the design of LiDAR point cloud tracking models, as current state-of-the-art methods show critical security vulnerabilities.

Abstract: In this study, we delve into the robustness of neural network-based LiDAR point cloud tracking models under adversarial attacks, a critical aspect often overlooked in favor of performance enhancement. These models, despite incorporating advanced architectures like Transformer or Bird’s Eye View (BEV), tend to neglect robustness in the face of challenges such as adversarial attacks, domain shifts, or data corruption. We instead focus on the robustness of the tracking models under the threat of adversarial attacks. We begin by establishing a unified framework for conducting adversarial attacks within the context of 3D object tracking, which allows us to thoroughly investigate both white-box and black-box attack strategies. For white-box attacks, we tailor specific loss functions to accommodate various tracking paradigms and extend existing methods such as FGSM, C&W, and PGD to the point cloud domain. In addressing black-box attack scenarios, we introduce a novel transfer-based approach, the Target-aware Perturbation Generation (TAPG) algorithm, with the dual objectives of achieving high attack performance and maintaining low perceptibility. This method employs a heuristic strategy to enforce sparse attack constraints and utilizes random sub-vector factorization to bolster transferability. Our experimental findings reveal a significant vulnerability in advanced tracking methods when subjected to both black-box and white-box attacks, underscoring the necessity for incorporating robustness against adversarial attacks into the design of LiDAR point cloud tracking models. Notably, compared to existing methods, the TAPG also strikes an optimal balance between the effectiveness of the attack and the concealment of the perturbations.

[193] VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models

Byung-Kwan Lee, Ryo Hachiuma, Yu-Chiang Frank Wang, Yong Man Ro, Yueh-Hua Wu

Main category: cs.CV

TL;DR: VLsI is a new family of vision-language models (2B and 7B sizes) that uses layer-wise distillation with verbalizers to achieve efficiency without sacrificing accuracy, outperforming GPT-4V by 11.0-17.4% without scaling or architectural changes.

Details

Motivation: Address computational challenges of scaling VLMs for resource-constrained devices while maintaining performance, overcoming training instability in output imitation methods.

Method: Layer-wise distillation with intermediate verbalizers that map features from each layer to natural language space, aligning small VLMs’ reasoning processes with larger ones.

Result: Achieved 11.0% improvement for 2B model and 17.4% for 7B model over GPT-4V across ten vision-language benchmarks without model scaling, merging, or architectural changes.

Conclusion: VLsI demonstrates that efficient VLMs can achieve superior performance through layer-wise distillation with verbalizers, providing a scalable solution for resource-constrained deployment.

Abstract: The recent surge in high-quality visual instruction tuning samples from closed-source vision-language models (VLMs) such as GPT-4V has accelerated the release of open-source VLMs across various model sizes. However, scaling VLMs to improve performance using larger models brings significant computational challenges, especially for deployment on resource-constrained devices like mobile platforms and robots. To address this, we propose VLsI: Verbalized Layers-to-Interactions, a new VLM family in 2B and 7B model sizes, which prioritizes efficiency without compromising accuracy. VLsI leverages a unique, layer-wise distillation process, introducing intermediate “verbalizers” that map features from each layer to natural language space, allowing smaller VLMs to flexibly align with the reasoning processes of larger VLMs. This approach mitigates the training instability often encountered in output imitation and goes beyond typical final-layer tuning by aligning the small VLMs’ layer-wise progression with that of the large ones. We validate VLsI across ten challenging vision-language benchmarks, achieving notable performance gains (11.0% for 2B and 17.4% for 7B) over GPT-4V without the need for model scaling, merging, or architectural changes.

[194] PixelWorld: How Far Are We from Perceiving Everything as Pixels?

Zhiheng Lyu, Xueguang Ma, Wenhu Chen

Main category: cs.CV

TL;DR: PEAP (Perceive Everything as Pixels) explores unified pixel-based perception for multimodal AI, showing comparable semantic understanding to token-based methods but degraded reasoning performance, with PixelWorld benchmark enabling systematic evaluation.

Details

Motivation: Agentic language models need to interact with real-world environments containing intertwined visual and textual information through raw pixels, highlighting the need for a unified perception paradigm beyond separate image and text processing.

Method: Introduces PEAP (Perceive Everything as Pixels) and PixelWorld benchmark that renders natural-language, tabular, mathematical, and diagrammatic inputs into shared pixel space, using vision transformers to capture semantics without explicit tokenization.

Result: PEAP achieves comparable performance to token-based approaches on semantic understanding tasks, but reasoning-intensive tasks like mathematics and code show notable degradation. Chain-of-Thought prompting helps mitigate this gap by compensating for missing symbolic structure.

Conclusion: Pixel-based representation simplifies preprocessing and avoids cross-modal misalignment when visual/textual information are integrated, providing a practical framework for evaluating unified vision-language models and facilitating pixel-based multimodal learning exploration.

Abstract: Recent agentic language models increasingly need to interact with real-world environments that contain tightly intertwined visual and textual information, often through raw camera pixels rather than separately processed images and tokenized text. This shift highlights the need for a unified perception paradigm. To investigate this idea, we explore Perceive Everything as Pixels (PEAP) and introduce PixelWorld, a benchmark that renders natural-language, tabular, mathematical, and diagrammatic inputs into a shared pixel space. Experiments across multiple benchmarks show that PEAP achieves comparable performance to token-based approaches on semantic understanding tasks, suggesting that vision transformers can partially capture global textual semantics without explicit tokenization. In contrast, reasoning-intensive tasks such as mathematics and code show notable performance degradation, although Chain-of-Thought prompting helps mitigate this gap by compensating for missing symbolic structure. We further find that when visual and textual information are closely integrated, representing everything as pixels simplifies preprocessing and avoids cross-modal misalignment. PixelWorld thus provides a systematic and practical framework for evaluating unified vision–language models and facilitates further exploration of pixel-based multimodal learning.

[195] MsEdF: A Multi-stream Encoder-decoder Framework for Remote Sensing Image Captioning

Swadhin Das, Raksha Sharma

Main category: cs.CV

TL;DR: Proposes a Multi-stream Encoder-decoder Framework (MsEdF) for remote sensing image captioning that improves performance by optimizing spatial representation and language generation through complementary image encoders and enhanced semantic modeling.

Details

Motivation: Existing single-stream architectures for remote sensing image captioning struggle with complex spatial patterns and semantic structures, limiting their ability to accurately describe images with high intraclass similarity or contextual ambiguity.

Method: Uses a multi-stream encoder that fuses information from two complementary image encoders to promote feature diversity through multiscale and structurally distinct cues. On the decoder side, employs a stacked GRU architecture with element-wise aggregation for improved semantic modeling.

Result: Experiments on three benchmark RSIC datasets show that MsEdF outperforms several baseline models.

Conclusion: The proposed multi-stream framework effectively addresses the limitations of single-stream architectures in remote sensing image captioning by improving both spatial feature extraction and semantic modeling.

Abstract: Remote sensing images contain complex spatial patterns and semantic structures, which makes the captioning model difficult to accurately describe. Encoder-decoder architectures have become the widely used approach for RSIC by translating visual content into descriptive text. However, many existing methods rely on a single-stream architecture, which weakens the model to accurately describe the image. Such single-stream architectures typically struggle to extract diverse spatial features or capture complex semantic relationships, limiting their effectiveness in scenes with high intraclass similarity or contextual ambiguity. In this work, we propose a novel Multi-stream Encoder-decoder Framework (MsEdF) which improves the performance of RSIC by optimizing both the spatial representation and language generation of encoder-decoder architecture. The encoder fuses information from two complementary image encoders, thereby promoting feature diversity through the integration of multiscale and structurally distinct cues. To improve the capture of context-aware descriptions, we refine the input sequence’s semantic modeling on the decoder side using a stacked GRU architecture with an element-wise aggregation scheme. Experiments on three benchmark RSIC datasets show that MsEdF outperforms several baseline models.

[196] Probing Perceptual Constancy in Large Vision-Language Models

Haoran Sun, Bingyang Wang, Suyang Yu, Yijiang Li, Qingying Gao, Haiyun Lyu, Hokin Deng, Dezhi Luo

Main category: cs.CV

TL;DR: The paper evaluates perceptual constancy abilities in 155 Vision Language Models (VLMs) across color, size, and shape constancy domains using 236 experiments including classic cognitive tasks and real-world scenarios.

Details

Motivation: To explore whether current Vision Language Models possess perceptual constancy - the ability to maintain stable object perceptions despite changes in sensory input like distance, angle, or lighting conditions, which is crucial for visual understanding in dynamic environments.

Method: Evaluated 155 VLMs using 236 experiments across three domains: color, size, and shape constancy. Experiments included single-image and video adaptations of classic cognitive tasks, along with novel tasks in in-the-wild conditions.

Result: Found significant variability in VLM performance across domains, with model performance in shape constancy clearly dissociated from that of color and size constancy.

Conclusion: Current VLMs exhibit varying levels of perceptual constancy abilities, with shape constancy performance showing distinct patterns compared to color and size constancy, highlighting areas for improvement in visual understanding capabilities.

Abstract: Perceptual constancy is the ability to maintain stable perceptions of objects despite changes in sensory input, such as variations in distance, angle, or lighting. This ability is crucial for visual understanding in a dynamic world. Here, we explored such ability in current Vision Language Models (VLMs). In this study, we evaluated 155 VLMs using 236 experiments across three domains: color, size, and shape constancy. The experiments included single-image and video adaptations of classic cognitive tasks, along with novel tasks in in-the-wild conditions. We found significant variability in VLM performance across these domains, with model performance in shape constancy clearly dissociated from that of color and size constancy.

[197] FLARE: Feed-forward Geometry, Appearance and Camera Estimation from Uncalibrated Sparse Views

Shangzan Zhang, Jianyuan Wang, Yinghao Xu, Nan Xue, Christian Rupprecht, Xiaowei Zhou, Yujun Shen, Gordon Wetzstein

Main category: cs.CV

TL;DR: FLARE is a feed-forward model that estimates camera poses and reconstructs 3D geometry from sparse-view images (2-8 inputs) using a cascaded learning approach with camera pose as the central bridge between geometry and appearance learning.

Details

Motivation: Address the challenging practical problem of inferring camera poses and 3D geometry from uncalibrated sparse-view images, which is common in real-world applications but difficult due to limited input data.

Method: Uses a cascaded learning paradigm where camera pose estimation comes first and conditions subsequent geometric structure and appearance learning. Optimized through geometry reconstruction and novel-view synthesis objectives, trained on large-scale public datasets.

Result: Achieves state-of-the-art performance in pose estimation, geometry reconstruction, and novel view synthesis while maintaining high inference efficiency (less than 0.5 seconds).

Conclusion: FLARE provides an effective solution for sparse-view 3D reconstruction by leveraging camera pose as a critical bridge in a cascaded learning framework, demonstrating strong performance across multiple tasks with efficient inference.

Abstract: We present FLARE, a feed-forward model designed to infer high-quality camera poses and 3D geometry from uncalibrated sparse-view images (i.e., as few as 2-8 inputs), which is a challenging yet practical setting in real-world applications. Our solution features a cascaded learning paradigm with camera pose serving as the critical bridge, recognizing its essential role in mapping 3D structures onto 2D image planes. Concretely, FLARE starts with camera pose estimation, whose results condition the subsequent learning of geometric structure and appearance, optimized through the objectives of geometry reconstruction and novel-view synthesis. Utilizing large-scale public datasets for training, our method delivers state-of-the-art performance in the tasks of pose estimation, geometry reconstruction, and novel view synthesis, while maintaining the inference efficiency (i.e., less than 0.5 seconds). The project page and code can be found at: https://zhanghe3z.github.io/FLARE/

[198] Rethinking Multimodal Learning from the Perspective of Mitigating Classification Ability Disproportion

QingYuan Jiang, Longfei Huang, Yang Yang

Main category: cs.CV

TL;DR: Proposes a novel multimodal learning approach using boosting principles to dynamically balance classification ability between weak and strong modalities, addressing modality imbalance by optimizing classification and residual errors simultaneously.

Details

Motivation: Existing multimodal learning approaches overlook the inherent disproportion in model classification ability as the primary cause of modality imbalance, leading to suboptimal performance.

Method: Uses sustained boosting algorithm to simultaneously optimize classification and residual errors, with adaptive classifier assignment strategy to dynamically improve weak modality performance. Theoretically analyzes convergence of cross-modal gap function.

Result: Empirical experiments on widely used datasets show superiority over various state-of-the-art multimodal learning baselines.

Conclusion: The proposed method effectively balances classification ability between strong and weak modalities, mitigating the modality imbalance issue in multimodal learning.

Abstract: Multimodal learning (MML) is significantly constrained by modality imbalance, leading to suboptimal performance in practice. While existing approaches primarily focus on balancing the learning of different modalities to address this issue, they fundamentally overlook the inherent disproportion in model classification ability, which serves as the primary cause of this phenomenon. In this paper, we propose a novel multimodal learning approach to dynamically balance the classification ability of weak and strong modalities by incorporating the principle of boosting. Concretely, we first propose a sustained boosting algorithm in multimodal learning by simultaneously optimizing the classification and residual errors. Subsequently, we introduce an adaptive classifier assignment strategy to dynamically facilitate the classification performance of the weak modality. Furthermore, we theoretically analyze the convergence property of the cross-modal gap function, ensuring the effectiveness of the proposed boosting scheme. To this end, the classification ability of strong and weak modalities is expected to be balanced, thereby mitigating the imbalance issue. Empirical experiments on widely used datasets reveal the superiority of our method through comparison with various state-of-the-art (SOTA) multimodal learning baselines. The source code is available at https://github.com/njustkmg/NeurIPS25-AUG.

Yi Wang, Mushui Liu, Wanggui He, Hanyang Yuan, Longxiang Zhang, Ziwei Huang, Guanghao Zhang, Wenkai Fang, Haoze Jiang, Shengxuming Zhang, Dong She, Jinlong Liu, Weilong Dai, Mingli Song, Hao Jiang, Jie Song

Main category: cs.CV

TL;DR: The paper introduces FoX, a unified generative model that uses Functionality-oriented eXperts (FoXperts) and Multimodal Chain of Thought (MCoT) to improve complex image generation by emulating human artistic workflows.

Details

Motivation: Direct text-to-image generation struggles with complex compositional instructions in real-world scenarios. Existing models focus on basic image generation but fail to adequately handle complex instructions.

Method: Proposes FoX with Functionality-oriented eXperts (FoXperts) architecture and Multimodal Chain of Thought (MCoT) approach that follows planning, acting, reflection, and correction steps. Uses multi-task joint training to equip the model with all MCoT capabilities.

Result: FoX consistently outperforms existing unified models on various T2I benchmarks, showing notable improvements in complex image generation.

Conclusion: The proposed FoX model with FoXperts architecture and MCoT approach effectively addresses complex image generation challenges that direct T2I methods cannot solve.

Abstract: Unified generative models have shown remarkable performance in text and image generation. For image synthesis tasks, they adopt straightforward text-to-image (T2I) generation. However, direct T2I generation limits the models in handling complex compositional instructions, which frequently occur in real-world scenarios. Although this issue is vital, existing works mainly focus on improving the basic image generation capability of the models. While such improvements help to some extent, they still fail to adequately resolve the problem. Inspired by Chain of Thought (CoT) solving complex problems step by step, this work aims to introduce CoT into unified generative models to address the challenges of complex image generation that direct T2I generation cannot effectively solve, thereby endowing models with enhanced image generation ability. To achieve this, we first propose Functionality-oriented eXperts (FoXperts), an expert-parallel architecture in our model FoX, which assigns experts by function. FoXperts disentangles potential conflicts in mainstream modality-oriented designs and provides a solid foundation for CoT. When introducing CoT, the first question is how to design it for complex image generation. To this end, we emulate a human-like artistic workflow – planning, acting, reflection, and correction – and propose the Multimodal Chain of Thought (MCoT) approach, as the data involves both text and image. To address the subsequent challenge – designing an effective MCoT training paradigm – we develop a multi-task joint training scheme that equips the model with all capabilities required for each MCoT step in a disentangled manner. This paradigm avoids the difficulty of collecting consistent multi-step data tuples. Extensive experiments show that FoX consistently outperforms existing unified models on various T2I benchmarks, delivering notable improvements in complex image generation.

[200] DitHub: A Modular Framework for Incremental Open-Vocabulary Object Detection

Chiara Cappellino, Gianluca Mancusi, Matteo Mosconi, Angelo Porrello, Simone Calderara, Rita Cucchiara

Main category: cs.CV

TL;DR: DitHub is a modular framework for open-vocabulary object detection that manages adaptation modules like version control branches, enabling flexible composition and achieving SOTA performance on benchmarks.

Details

Motivation: To address the need for adapting open-vocabulary detectors to rare classes and specialized domains, overcoming limitations of monolithic adaptation strategies with single weight sets.

Method: Uses modular deep learning with a version control system-inspired approach, managing expert modules as branches that can be fetched and merged as needed.

Result: Achieves state-of-the-art performance on ODinW-13 benchmark and newly introduced ODinW-O benchmark for class reappearance assessment.

Conclusion: DitHub’s modular approach enables effective adaptation composition and sets new performance standards in open-vocabulary object detection.

Abstract: Open-Vocabulary object detectors can generalize to an unrestricted set of categories through simple textual prompting. However, adapting these models to rare classes or reinforcing their abilities on multiple specialized domains remains essential. While recent methods rely on monolithic adaptation strategies with a single set of weights, we embrace modular deep learning. We introduce DitHub, a framework designed to build and maintain a library of efficient adaptation modules. Inspired by Version Control Systems, DitHub manages expert modules as branches that can be fetched and merged as needed. This modular approach allows us to conduct an in-depth exploration of the compositional properties of adaptation modules, marking the first such study in Object Detection. Our method achieves state-of-the-art performance on the ODinW-13 benchmark and ODinW-O, a newly introduced benchmark designed to assess class reappearance. For more details, visit our project page: https://aimagelab.github.io/DitHub/

[201] Video-R1: Reinforcing Video Reasoning in MLLMs

Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, Xiangyu Yue

Main category: cs.CV

TL;DR: Video-R1 applies rule-based RL (R1 paradigm) to video reasoning in MLLMs, addressing temporal modeling and data scarcity with T-GRPO algorithm and mixed image-video datasets, achieving state-of-the-art performance.

Details

Motivation: Inspired by DeepSeek-R1's success in eliciting reasoning through RL, the authors aim to extend the R1 paradigm to video reasoning in multimodal LLMs, addressing challenges in temporal modeling and data scarcity.

Method: Proposed T-GRPO algorithm for temporal modeling in video reasoning, and used mixed image-video datasets (Video-R1-CoT-165k for SFT, Video-R1-260k for RL) to overcome data scarcity.

Result: Video-R1 achieves significant improvements on video reasoning benchmarks (VideoMMMU, VSI-Bench) and general video benchmarks (MVBench, TempCompass). Video-R1-7B attains 37.1% accuracy on VSI-bench, surpassing GPT-4o.

Conclusion: The R1 paradigm can be successfully extended to video reasoning through temporal modeling and mixed data training, achieving state-of-the-art performance with open-source release of code, models, and data.

Abstract: Inspired by DeepSeek-R1’s success in eliciting reasoning abilities through rule-based reinforcement learning (RL), we introduce Video-R1 as the first attempt to systematically explore the R1 paradigm for incentivizing video reasoning within multimodal large language models (MLLMs). However, directly applying RL training with the GRPO algorithm to video reasoning presents two primary challenges: (i) a lack of temporal modeling for video reasoning, and (ii) the scarcity of high-quality video-reasoning data. To address these issues, we first propose the T-GRPO algorithm, which encourages models to utilize temporal information in videos for reasoning. Additionally, instead of relying solely on video data, we incorporate high-quality image-reasoning data into the training process. We have constructed two datasets: Video-R1-CoT-165k for SFT cold start and Video-R1-260k for RL training, both comprising image and video data. Experimental results demonstrate that Video-R1 achieves significant improvements on video reasoning benchmarks such as VideoMMMU and VSI-Bench, as well as on general video benchmarks including MVBench and TempCompass, etc. Notably, Video-R1-7B attains a 37.1% accuracy on video spatial reasoning benchmark VSI-bench, surpassing the commercial proprietary model GPT-4o. All code, models, and data are released in: https://github.com/tulerfeng/Video-R1.

[202] WikiVideo: Article Generation from Multiple Videos

Alexander Martin, Reno Kriz, William Gantt Walden, Kate Sanders, Hannah Recknor, Eugene Yang, Francis Ferraro, Benjamin Van Durme

Main category: cs.CV

TL;DR: WikiVideo introduces a benchmark for generating Wikipedia-style articles from multiple videos, with all claims grounded in video evidence, and proposes Collaborative Article Generation (CAG) method that outperforms existing VideoLLMs.

Details

Motivation: To address the gap in video-based RAG systems that focus on low-level scene understanding rather than high-level event semantics, enabling creation of in-depth content grounded in multimodal sources.

Method: Proposes Collaborative Article Generation (CAG) - an iterative interaction between an r1-style reasoning model and a VideoLLM to draw higher-level inferences about events from multiple videos.

Result: CAG consistently outperforms state-of-the-art VideoLLMs in both oracle retrieval and RAG settings, achieving better article generation from video evidence.

Conclusion: The approach enables effective grounded article generation from videos and suggests promising directions for future multimodal RAG research.

Abstract: We introduce the task of grounded article generation with the goal of creating a Wikipedia-style article from multiple diverse videos about real-world events – from natural disasters to political elections – where all the information in the article is supported by video evidence. Videos are intuitive sources for retrieval-augmented generation (RAG), but most contemporary RAG workflows focus heavily on text while existing methods for video-based summarization focus on low-level scene understanding rather than high-level event semantics. To close this gap, we introduce WikiVideo, a benchmark consisting of expert-written articles and densely annotated videos that provide evidence for articles’ claims, facilitating the integration of video into RAG pipelines and enabling the creation of in-depth content that is grounded in multimodal sources. We further propose Collaborative Article Generation (CAG), a novel interactive method for article creation from multiple videos. CAG leverages an iterative interaction between an r1-style reasoning model and a VideoLLM to draw higher-level inferences about the target event than is possible with VideoLLMs alone, which fixate on low-level visual features. We benchmark state-of-the-art VideoLLMs and CAG in both oracle retrieval and RAG settings and find that CAG consistently outperforms alternative methods, while suggesting intriguing avenues for future work.

[203] MMLA: Multi-Environment, Multi-Species, Low-Altitude Drone Dataset

Jenna Kline, Samuel Stevens, Guy Maalouf, Camille Rondeau Saint-Jean, Dat Nguyen Ngoc, Majid Mirmehdi, David Guerin, Tilo Burghardt, Elzbieta Pastucha, Blair Costelloe, Matthew Watson, Thomas Richardson, Ulrik Pagh Schultz Lundquist

Main category: cs.CV

TL;DR: MMLA dataset enables robust wildlife detection in drone imagery across multiple environments and species, with fine-tuned YOLOv11m achieving 82% mAP50.

Details

Motivation: Standard detection models like YOLO fail to generalize across locations and struggle with rare species, limiting automated drone deployments for ecological monitoring.

Method: Created MMLA dataset with 811K annotations from 37 high-resolution videos across three sites, featuring six species. Fine-tuned YOLOv11m on this diverse dataset.

Result: Fine-tuned YOLOv11m achieved 82% mAP50, a 52-point improvement over baseline models that showed performance disparities across locations.

Conclusion: Diverse training data is essential for robust animal detection in autonomous drone systems to overcome location-specific and species-specific generalization challenges.

Abstract: Real-time wildlife detection in drone imagery supports critical ecological and conservation monitoring. However, standard detection models like YOLO often fail to generalize across locations and struggle with rare species, limiting their use in automated drone deployments. We present MMLA, a novel multi-environment, multi-species, low-altitude drone dataset collected across three sites (Ol Pejeta Conservancy and Mpala Research Centre in Kenya, and The Wilds in Ohio), featuring six species (zebras, giraffes, onagers, and African wild dogs). The dataset contains 811K annotations from 37 high-resolution videos. Baseline YOLO models show performance disparities across locations while fine-tuning YOLOv11m on MMLA improves mAP50 to 82%, a 52-point gain over baseline. Our results underscore the need for diverse training data to enable robust animal detection in autonomous drone systems.

[204] REPA-E: Unlocking VAE for End-to-End Tuning with Latent Diffusion Transformers

Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, Liang Zheng

Main category: cs.CV

TL;DR: End-to-end training of latent diffusion models with VAE tokenizer is enabled through representation-alignment (REPA) loss, achieving 17-45x speedup and state-of-the-art performance on ImageNet.

Details

Motivation: Traditional wisdom suggests end-to-end training is preferable, but standard diffusion loss fails for joint VAE-diffusion training, causing performance degradation. The paper aims to unlock effective end-to-end training.

Method: Proposes REPA-E training recipe using representation-alignment loss instead of standard diffusion loss to jointly train VAE and diffusion model end-to-end.

Result: Achieves 17x and 45x training speedup over REPA and vanilla methods respectively. Sets new SOTA with FID 1.12 and 1.69 on ImageNet 256x256. Also improves VAE latent space structure.

Conclusion: End-to-end training of latent diffusion models with VAE is feasible and beneficial when using REPA loss, leading to significant performance improvements and training efficiency gains.

Abstract: In this paper we tackle a fundamental question: “Can we train latent diffusion models together with the variational auto-encoder (VAE) tokenizer in an end-to-end manner?” Traditional deep-learning wisdom dictates that end-to-end training is often preferable when possible. However, for latent diffusion transformers, it is observed that end-to-end training both VAE and diffusion-model using standard diffusion-loss is ineffective, even causing a degradation in final performance. We show that while diffusion loss is ineffective, end-to-end training can be unlocked through the representation-alignment (REPA) loss – allowing both VAE and diffusion model to be jointly tuned during the training process. Despite its simplicity, the proposed training recipe (REPA-E) shows remarkable performance; speeding up diffusion model training by over 17x and 45x over REPA and vanilla training recipes, respectively. Interestingly, we observe that end-to-end tuning with REPA-E also improves the VAE itself; leading to improved latent space structure and downstream generation performance. In terms of final performance, our approach sets a new state-of-the-art; achieving FID of 1.12 and 1.69 with and without classifier-free guidance on ImageNet 256 x 256. Code is available at https://end2end-diffusion.github.io.

[205] 3D Visual Illusion Depth Estimation

Chengtang Yao, Zhidan Liu, Jiaxi Zeng, Lidong Yu, Yuwei Wu, Yunde Jia

Main category: cs.CV

TL;DR: This paper reveals that machine vision systems are fooled by 3D visual illusions in depth estimation tasks, and proposes a framework that uses vision language model common sense to adaptively fuse binocular and monocular depth information, achieving state-of-the-art performance.

Details

Motivation: To investigate whether machine visual systems are susceptible to 3D visual illusions like human perception, and to develop methods that can overcome these illusions in depth estimation tasks.

Method: Collected a large dataset with 3k scenes and 200k images, trained and evaluated SOTA depth estimation methods, and proposed a 3D visual illusion depth estimation framework that adaptively fuses depth from binocular disparity and monocular depth using common sense from vision language models.

Result: Experiments showed that SOTA monocular, binocular, and multi-view depth estimation approaches are all fooled by various 3D visual illusions, while the proposed method achieves state-of-the-art performance.

Conclusion: Machine visual systems are indeed susceptible to 3D visual illusions in depth estimation, but the proposed adaptive fusion framework using vision language model common sense can effectively overcome these illusions and achieve superior performance.

Abstract: 3D visual illusion is a perceptual phenomenon where a two-dimensional plane is manipulated to simulate three-dimensional spatial relationships, making a flat artwork or object look three-dimensional in the human visual system. In this paper, we reveal that the machine visual system is also seriously fooled by 3D visual illusions, including monocular and binocular depth estimation. In order to explore and analyze the impact of 3D visual illusion on depth estimation, we collect a large dataset containing almost 3k scenes and 200k images to train and evaluate SOTA monocular and binocular depth estimation methods. We also propose a 3D visual illusion depth estimation framework that uses common sense from the vision language model to adaptively fuse depth from binocular disparity and monocular depth. Experiments show that SOTA monocular, binocular, and multi-view depth estimation approaches are all fooled by various 3D visual illusions, while our method achieves SOTA performance.

[206] Spiking Neural Networks Need High Frequency Information

Yuetong Fang, Deming Zhou, Ziqing Wang, Hongwei Ren, ZeCui Zeng, Lusong Li, Shibo Zhou, Renjing Xu

Main category: cs.CV

TL;DR: Spiking Neural Networks (SNNs) have a frequency bias that suppresses high-frequency components, degrading performance. Max-Former addresses this with frequency-enhancing operators (Max-Pool and Depth-Wise Convolution), achieving state-of-the-art results on ImageNet and CIFAR benchmarks.

Details

Motivation: Challenge the assumption that SNNs' performance lag is due to binary activations, and instead identify frequency bias as the root cause of degraded feature representation.

Method: Introduce Max-Former with two frequency-enhancing operators: extra Max-Pool in patch embedding and Depth-Wise Convolution replacing self-attention. Also apply similar principles to Max-ResNet-18.

Result: Max-Former achieves 82.39% top-1 accuracy on ImageNet with 63.99M parameters, surpassing Spikformer by +7.58%. Max-ResNet-18 achieves 97.17% on CIFAR-10 and 83.06% on CIFAR-100.

Conclusion: Frequency bias, not binary activations, is the main performance bottleneck in SNNs. Simple frequency-enhancing operators can significantly improve SNN performance across different architectures.

Abstract: Spiking Neural Networks promise brain-inspired and energy-efficient computation by transmitting information through binary (0/1) spikes. Yet, their performance still lags behind that of artificial neural networks, often assumed to result from information loss caused by sparse and binary activations. In this work, we challenge this long-standing assumption and reveal a previously overlooked frequency bias: spiking neurons inherently suppress high-frequency components and preferentially propagate low-frequency information. This frequency-domain imbalance, we argue, is the root cause of degraded feature representation in SNNs. Empirically, on Spiking Transformers, adopting Avg-Pooling (low-pass) for token mixing lowers performance to 76.73% on Cifar-100, whereas replacing it with Max-Pool (high-pass) pushes the top-1 accuracy to 79.12%. Accordingly, we introduce Max-Former that restores high-frequency signals through two frequency-enhancing operators: (1) extra Max-Pool in patch embedding, and (2) Depth-Wise Convolution in place of self-attention. Notably, Max-Former attains 82.39% top-1 accuracy on ImageNet using only 63.99M parameters, surpassing Spikformer (74.81%, 66.34M) by +7.58%. Extending our insight beyond transformers, our Max-ResNet-18 achieves state-of-the-art performance on convolution-based benchmarks: 97.17% on CIFAR-10 and 83.06% on CIFAR-100. We hope this simple yet effective solution inspires future research to explore the distinctive nature of spiking neural networks. Code is available: https://github.com/bic-L/MaxFormer.

[207] See through the Dark: Learning Illumination-affined Representations for Nighttime Occupancy Prediction

Yuan Wu, Zhiqiang Yan, Yigong Zhang, Xiang Li, Jian Yang

Main category: cs.CV

TL;DR: LIAR is a novel framework for nighttime occupancy prediction that learns illumination-affined representations through selective low-light image enhancement and illumination-aware components to handle challenging lighting conditions.

Details

Motivation: Existing vision-based occupancy prediction methods perform well in daytime but struggle in nighttime scenarios due to limited visibility and challenging lighting conditions, creating a need for specialized nighttime solutions.

Method: LIAR introduces Selective Low-light Image Enhancement (SLLIE) to adaptively enhance nighttime images using daytime illumination priors, followed by 2D Illumination-guided Sampling (2D-IGS) to handle local underexposure and 3D Illumination-driven Projection (3D-IDP) to address overexposure through illumination intensity fields and residual queries.

Result: Extensive experiments on both real and synthetic datasets demonstrate LIAR’s superior performance in challenging nighttime scenarios compared to existing methods.

Conclusion: LIAR effectively addresses nighttime occupancy prediction challenges through illumination-aware representations and demonstrates strong performance across various datasets, with code and models publicly available.

Abstract: Occupancy prediction aims to estimate the 3D spatial distribution of occupied regions along with their corresponding semantic labels. Existing vision-based methods perform well on daytime benchmarks but struggle in nighttime scenarios due to limited visibility and challenging lighting conditions. To address these challenges, we propose LIAR, a novel framework that learns illumination-affined representations. LIAR first introduces Selective Low-light Image Enhancement (SLLIE), which leverages the illumination priors from daytime scenes to adaptively determine whether a nighttime image is genuinely dark or sufficiently well-lit, enabling more targeted global enhancement. Building on the illumination maps generated by SLLIE, LIAR further incorporates two illumination-aware components: 2D Illumination-guided Sampling (2D-IGS) and 3D Illumination-driven Projection (3D-IDP), to respectively tackle local underexposure and overexposure. Specifically, 2D-IGS modulates feature sampling positions according to illumination maps, assigning larger offsets to darker regions and smaller ones to brighter regions, thereby alleviating feature degradation in underexposed areas. Subsequently,3D-IDP enhances semantic understanding in overexposed regions by constructing illumination intensity fields and supplying refined residual queries to the BEV context refinement process. Extensive experiments on both real and synthetic datasets demonstrate the superior performance of LIAR under challenging nighttime scenarios. The source code and pretrained models are available here.

[208] Uni-MuMER: Unified Multi-Task Fine-Tuning of Vision-Language Model for Handwritten Mathematical Expression Recognition

Yu Li, Jin Jiang, Jianhua Zhu, Shuai Peng, Baole Wei, Yuxuan Zhou, Liangcai Gao

Main category: cs.CV

TL;DR: Uni-MuMER fine-tunes a vision-language model for handwritten math expression recognition, achieving state-of-the-art performance through three integrated tasks: Tree-CoT for spatial reasoning, EDL for reducing character confusion, and SC for consistency in long expressions.

Details

Motivation: HMER is challenging due to free symbol layouts and handwriting variability. Previous methods had isolated architectural modifications that were hard to integrate, while pretrained VLMs offer strong cross-task generalization for unified solutions.

Method: Fully fine-tunes a VLM without architectural changes, integrating three data-driven tasks: Tree-Aware Chain-of-Thought for structured spatial reasoning, Error-Driven Learning to reduce confusion among similar characters, and Symbol Counting for recognition consistency in long expressions.

Result: Achieves super state-of-the-art performance on CROHME and HME100K datasets, outperforming best lightweight specialized model SSAN by 16.31% and top-performing VLM Gemini2.5-flash by 24.42% under zero-shot setting.

Conclusion: Uni-MuMER demonstrates that fully fine-tuning VLMs without architectural modifications can effectively inject domain-specific knowledge and achieve superior performance in HMER tasks through integrated data-driven approaches.

Abstract: Handwritten Mathematical Expression Recognition (HMER) remains a persistent challenge in Optical Character Recognition (OCR) due to the inherent freedom of symbol layouts and variability in handwriting styles. Prior methods have faced performance bottlenecks by proposing isolated architectural modifications, making them difficult to integrate coherently into a unified framework. Meanwhile, recent advances in pretrained vision-language models (VLMs) have demonstrated strong cross-task generalization, offering a promising foundation for developing unified solutions. In this paper, we introduce Uni-MuMER, which fully fine-tunes a VLM for the HMER task without modifying its architecture, effectively injecting domain-specific knowledge into a generalist framework. Our method integrates three data-driven tasks: Tree-Aware Chain-of-Thought (Tree-CoT) for structured spatial reasoning, Error-Driven Learning (EDL) for reducing confusion among visually similar characters, and Symbol Counting (SC) for improving recognition consistency in long expressions. Experiments on the CROHME and HME100K datasets show that Uni-MuMER achieves super state-of-the-art performance, outperforming the best lightweight specialized model SSAN by 16.31% and the top-performing VLM Gemini2.5-flash by 24.42% under zero-shot setting. Our datasets, models, and code are open-sourced at: {https://github.com/BFlameSwift/Uni-MuMER

[209] Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors

Duo Zheng, Shijia Huang, Yanyang Li, Liwei Wang

Main category: cs.CV

TL;DR: VG LLM enables 3D scene understanding directly from videos without explicit 3D data inputs, achieving competitive performance with state-of-the-art methods.

Details

Motivation: To enhance MLLMs' capability for 3D spatial reasoning directly from video data, eliminating the need for additional 3D inputs like point clouds or BEV maps.

Method: Proposes Video-3D Geometry LLM (VG LLM) using a 3D visual geometry encoder to extract 3D prior information from video sequences, integrated with visual tokens into the MLLM.

Result: Achieves substantial improvements in 3D scene understanding and spatial reasoning tasks; 4B model surpasses Gemini-1.5-Pro in VSI-Bench evaluations without explicit 3D data.

Conclusion: VG LLM demonstrates effective 3D reasoning directly from videos, offering a more efficient alternative to methods requiring comprehensive 3D data inputs.

Abstract: Previous research has investigated the application of Multimodal Large Language Models (MLLMs) in understanding 3D scenes by interpreting them as videos. These approaches generally depend on comprehensive 3D data inputs, such as point clouds or reconstructed Bird’s-Eye View (BEV) maps. In our research, we advance this field by enhancing the capability of MLLMs to understand and reason in 3D spaces directly from video data, without the need for additional 3D input. We propose a novel and efficient method called the Video-3D Geometry Large Language Model (VG LLM). Our approach utilizes a 3D visual geometry encoder to extract 3D prior information from video sequences. This information is then integrated with visual tokens and input into the MLLM. Extensive experiments have shown that our method has achieved substantial improvements in various tasks related to 3D scene understanding and spatial reasoning, all directly learned from video sources. Impressively, our 4B model, which does not rely on explicit 3D data inputs, achieves competitive results compared to existing state-of-the-art methods, and even surpasses the Gemini-1.5-Pro in the VSI-Bench evaluations.

[210] Investigating the Relationship between the Weighted Figure of Merit and Rosin’s Measure

Bimal Kumar Ray

Main category: cs.CV

TL;DR: This paper investigates whether weighted figure of merit can substitute Rosin’s measure for evaluating polygonal approximation schemes, finding they are theoretically independent and uncorrelated.

Details

Motivation: Researchers have been using weighted figure of merit as a substitute for Rosin's measure to compare suboptimal polygonal approximation schemes, but it's unclear if these measures are actually related and interchangeable.

Method: The study uses theoretical analysis (mathematical formulas and theorem proofs), experimental investigation with public datasets, and statistical analysis using Pearson’s correlation coefficient and non-linear correlation measures.

Result: Theoretical analysis shows the two measures are independent, experimental graphical analysis supports this finding, and statistical analysis confirms they are uncorrelated.

Conclusion: Weighted figure of merit cannot be used as a substitute for Rosin’s measure since they are independent and uncorrelated - conclusions drawn from one measure cannot be reliably transferred to the other.

Abstract: Many studies have been conducted to solve the problem of approximating a digital boundary by piece straight-line segments for the further processing required in computer vision applications. The authors of these studies compared their schemes to determine the best one. The initial measure used to assess the goodness of fit of a polygonal approximation was the figure of merit. Later,it was noted that this measure was not an appropriate metric for a valid reason which is why Rosin-through mathematical analysis-introduced a measure called merit. However,this measure involves an optimal scheme of polygonal approximation,so it is time-consuming to compute it to assess the goodness of fit of an approximation. This led many researchers to use a weighted figure of merit as a substitute for Rosin’s measure to compare sub optimal schemes. An attempt is made in this communication to investigate whether the two measures-weighted figure of merit and Rosin’s measure-are related so that one can be used instead of the other, and toward this end, theoretical analysis, experimental investigation and statistical analysis are carried out. The mathematical formulas for the weighted figure of merit and Rosin’s measure are analyzed, and through proof of theorems,it is found that the two measures are theoretically independent of each other. The graphical analysis of experiments carried out using a public dataset supports the results of the theoretical analysis. The statistical analysis via Pearson’s correlation coefficient and non-linear correlation measure also revealed that the two measures are uncorrelated. This analysis leads one to conclude that if a suboptimal scheme is found to be better (worse) than some other suboptimal scheme,as indicated by Rosin’s measure,then the same conclusion cannot be drawn using a weighted figure of merit,so one cannot use a weighted figure of merit instead of Rosin’s measure.

[211] Doctor Approved: Generating Medically Accurate Skin Disease Images through AI-Expert Feedback

Janet Wang, Yunbei Zhang, Zhengming Ding, Jihun Hamm

Main category: cs.CV

TL;DR: MAGIC framework uses AI-expert collaboration to generate medically accurate skin disease images for data augmentation, improving diagnostic model performance by translating expert criteria into actionable feedback for diffusion models.

Details

Motivation: Medical data scarcity limits ML model generalizability, and existing diffusion models often produce medically inaccurate images. Expert domain knowledge is crucial for clinical accuracy, but current human feedback methods are labor-intensive.

Method: Proposes MAGIC framework that creatively translates expert-defined clinical criteria into actionable feedback for diffusion models, using multimodal LLMs as evaluators to reduce human workload while ensuring medical accuracy.

Result: Significantly improves clinical quality of synthesized skin disease images, with outputs aligning with dermatologist assessments. Augmentation improves diagnostic accuracy by +9.02% on 20-condition classification and +13.89% in few-shot settings.

Conclusion: MAGIC framework effectively synthesizes clinically accurate medical images through AI-expert collaboration, addressing data scarcity while reducing human workload and improving diagnostic model performance.

Abstract: Paucity of medical data severely limits the generalizability of diagnostic ML models, as the full spectrum of disease variability can not be represented by a small clinical dataset. To address this, diffusion models (DMs) have been considered as a promising avenue for synthetic image generation and augmentation. However, they frequently produce medically inaccurate images, deteriorating the model performance. Expert domain knowledge is critical for synthesizing images that correctly encode clinical information, especially when data is scarce and quality outweighs quantity. Existing approaches for incorporating human feedback, such as reinforcement learning (RL) and Direct Preference Optimization (DPO), rely on robust reward functions or demand labor-intensive expert evaluations. Recent progress in Multimodal Large Language Models (MLLMs) reveals their strong visual reasoning capabilities, making them adept candidates as evaluators. In this work, we propose a novel framework, coined MAGIC (Medically Accurate Generation of Images through AI-Expert Collaboration), that synthesizes clinically accurate skin disease images for data augmentation. Our method creatively translates expert-defined criteria into actionable feedback for image synthesis of DMs, significantly improving clinical accuracy while reducing the direct human workload. Experiments demonstrate that our method greatly improves the clinical quality of synthesized skin disease images, with outputs aligning with dermatologist assessments. Additionally, augmenting training data with these synthesized images improves diagnostic accuracy by +9.02% on a challenging 20-condition skin disease classification task, and by +13.89% in the few-shot setting.

[212] One-Step Diffusion for Detail-Rich and Temporally Consistent Video Super-Resolution

Yujing Sun, Lingchen Sun, Shuaizheng Liu, Rongyuan Wu, Zhengqiang Zhang, Lei Zhang

Main category: cs.CV

TL;DR: DLoRAL is a novel Dual LoRA Learning paradigm for real-world video super-resolution that achieves both rich spatial details and temporal consistency using a two-phase LoRA training approach with cross-frame retrieval.

Details

Motivation: Existing SD-based Real-VSR methods often sacrifice spatial details for temporal coherence, leading to suboptimal visual quality. The challenge is to effectively extract degradation-robust temporal consistency priors from low-quality input while enhancing video details.

Method: Proposes a Dual LoRA Learning paradigm with two alternating phases: 1) Consistency-LoRA (C-LoRA) learns robust temporal representations using Cross-Frame Retrieval, 2) Detail-LoRA (D-LoRA) enhances spatial details while aligning with the temporal space. The two LoRA branches are merged during inference for efficient one-step diffusion.

Result: DLoRAL achieves strong performance in both accuracy and speed, delivering consistent and detail-rich video outputs. Experiments demonstrate superior results compared to existing methods.

Conclusion: The proposed DLoRAL framework successfully addresses the trade-off between spatial detail enhancement and temporal consistency in real-world video super-resolution, enabling high-quality video restoration in a single diffusion step.

Abstract: It is a challenging problem to reproduce rich spatial details while maintaining temporal consistency in real-world video super-resolution (Real-VSR), especially when we leverage pre-trained generative models such as stable diffusion (SD) for realistic details synthesis. Existing SD-based Real-VSR methods often compromise spatial details for temporal coherence, resulting in suboptimal visual quality. We argue that the key lies in how to effectively extract the degradation-robust temporal consistency priors from the low-quality (LQ) input video and enhance the video details while maintaining the extracted consistency priors. To achieve this, we propose a Dual LoRA Learning (DLoRAL) paradigm to train an effective SD-based one-step diffusion model, achieving realistic frame details and temporal consistency simultaneously. Specifically, we introduce a Cross-Frame Retrieval (CFR) module to aggregate complementary information across frames, and train a Consistency-LoRA (C-LoRA) to learn robust temporal representations from degraded inputs. After consistency learning, we fix the CFR and C-LoRA modules and train a Detail-LoRA (D-LoRA) to enhance spatial details while aligning with the temporal space defined by C-LoRA to keep temporal coherence. The two phases alternate iteratively for optimization, collaboratively delivering consistent and detail-rich outputs. During inference, the two LoRA branches are merged into the SD model, allowing efficient and high-quality video restoration in a single diffusion step. Experiments show that DLoRAL achieves strong performance in both accuracy and speed. Code and models are available at https://github.com/yjsunnn/DLoRAL.

[213] With Limited Data for Multimodal Alignment, Let the STRUCTURE Guide You

Fabian Gröger, Shuo Wen, Huyen Le, Maria Brbić

Main category: cs.CV

TL;DR: Multimodal models can be built with limited paired data by aligning pretrained unimodal models using STRUCTURE regularization and optimal layer alignment, achieving strong performance with only tens of thousands of samples.

Details

Motivation: Existing multimodal models require millions of paired samples which are expensive or infeasible to obtain in many domains. This work explores building multimodal models with limited paired data by aligning pretrained unimodal foundation models.

Method: Introduces STRUCTURE regularization to preserve neighborhood geometry of unimodal encoders’ latent spaces, and aligns layers with highest representational similarity across modalities rather than just last layers.

Result: Achieves high-quality alignment with as few as tens of thousands of paired samples (less than 1% of typical data). Shows 51.6% average improvement in classification and 91.8% in retrieval across 24 benchmarks.

Conclusion: The framework is effective for limited-sample multimodal learning and offers a promising path for resource-constrained domains, demonstrating substantial gains when incorporated into existing alignment methods.

Abstract: Multimodal models have demonstrated powerful capabilities in complex tasks requiring multimodal alignment, including zero-shot classification and cross-modal retrieval. However, existing models typically rely on millions of paired multimodal samples, which are prohibitively expensive or infeasible to obtain in many domains. In this work, we explore the feasibility of building multimodal models with limited amount of paired data by aligning pretrained unimodal foundation models. We show that high-quality alignment is possible with as few as tens of thousands of paired samples$\unicode{x2013}$less than $1%$ of the data typically used in the field. To achieve this, we introduce STRUCTURE, an effective regularization technique that preserves the neighborhood geometry of the latent space of unimodal encoders. Additionally, we show that aligning last layers is often suboptimal and demonstrate the benefits of aligning the layers with the highest representational similarity across modalities. These two components can be readily incorporated into existing alignment methods, yielding substantial gains across 24 zero-shot image classification and retrieval benchmarks, with average relative improvement of $51.6%$ in classification and $91.8%$ in retrieval tasks. Our results highlight the effectiveness and broad applicability of our framework for limited-sample multimodal learning and offer a promising path forward for resource-constrained domains.

[214] Chiron-o1: Igniting Multimodal Large Language Models towards Generalizable Medical Reasoning via Mentor-Intern Collaborative Search

Haoran Sun, Yankai Jiang, Wenjie Lou, Yujie Zhang, Wenjie Li, Lilong Wang, Mianxin Liu, Lei Liu, Xiaosong Wang

Main category: cs.CV

TL;DR: Proposes MICS framework for generating medical chain-of-thought data and introduces Chiron-o1 model that achieves SOTA performance on medical reasoning benchmarks.

Details

Motivation: Existing approaches lack comprehensive frameworks for searching and evaluating effective reasoning paths in medical diagnosis, limiting MLLM applications in healthcare.

Method: MICS uses mentor models to initialize reasoning steps, then prompts intern models to continue thinking, selecting optimal paths based on MICS-Score evaluation. Constructs MMRP dataset and Chiron-o1 model via curriculum learning.

Result: Chiron-o1 achieves state-of-the-art performance across medical visual question answering and reasoning benchmarks when trained on CoT data generated by MICS.

Conclusion: MICS provides an effective framework for generating rigorous medical reasoning data, enabling development of robust medical MLLMs with strong reasoning capabilities.

Abstract: Multimodal large language models (MLLMs) have begun to demonstrate robust reasoning capabilities on general tasks, yet their application in the medical domain remains in its early stages. Constructing chain-of-thought (CoT) training data is essential for bolstering the reasoning abilities of medical MLLMs. However, existing approaches exhibit a deficiency in offering a comprehensive framework for searching and evaluating effective reasoning paths towards critical diagnosis. To address this challenge, we propose Mentor-Intern Collaborative Search (MICS), a novel reasoning-path searching scheme to generate rigorous and effective medical CoT data. MICS first leverages mentor models to initialize the reasoning, one step at a time, then prompts each intern model to continue the thinking along those initiated paths, and finally selects the optimal reasoning path according to the overall reasoning performance of multiple intern models. The reasoning performance is determined by an MICS-Score, which assesses the quality of generated reasoning paths. Eventually, we construct MMRP, a multi-task medical reasoning dataset with ranked difficulty, and Chiron-o1, a new medical MLLM devised via a curriculum learning strategy, with robust visual question-answering and generalizable reasoning capabilities. Extensive experiments demonstrate that Chiron-o1, trained on our CoT dataset constructed using MICS, achieves state-of-the-art performance across a list of medical visual question answering and reasoning benchmarks. Codes are available at https://github.com/manglu097/Chiron-o1

[215] Towards foundational LiDAR world models with efficient latent flow matching

Tianran Liu, Shengwen Zhao, Nicholas Rhinehart

Main category: cs.CV

TL;DR: LiDAR world models with strong cross-domain transferability achieve significant improvements over training from scratch, reduce data annotation needs by 95%, and introduce a latent CFM framework that doubles computational efficiency while maintaining SOTA performance.

Details

Motivation: Existing LiDAR world models are narrowly trained and lack transferability across domains. The paper aims to develop models that can generalize across outdoor-indoor environments, different LiDAR beam densities, and non-semantic to semantic tasks.

Method: Proposed a latent conditional flow matching (CFM)-based framework that improves data compression and training efficiency. Conducted systematic domain transfer studies across three scenarios: outdoor-indoor generalization, sparse-dense beam adaptation, and non-semantic to semantic transfer.

Result: Achieved up to 11% absolute improvement (83% relative) over training from scratch, outperforming scratch training in 30/36 comparisons. Reduced labeled data requirements by 95% for semantic occupancy forecasting. Achieved 6x higher compression ratio and 23x computational efficiency improvement with SOTA performance.

Conclusion: The proposed LiDAR world models demonstrate strong cross-domain transferability, significantly reducing data annotation requirements while achieving state-of-the-art performance with substantially improved computational efficiency.

Abstract: LiDAR-based world models offer more structured and geometry-aware representations than their image-based counterparts. However, existing LiDAR world models are narrowly trained; each model excels only in the domain for which it was built. Can we develop LiDAR world models that exhibit strong transferability across multiple domains? We conduct the first systematic domain transfer study across three demanding scenarios: (i) outdoor to indoor generalization, (ii) sparse-beam & dense-beam adaptation, and (iii) non-semantic to semantic transfer. Given different amounts of fine-tuning data, our experiments show that a single pre-trained model can achieve up to 11% absolute improvement (83% relative) over training from scratch and outperforms training from scratch in 30/36 of our comparisons. This transferability of dynamic learning significantly reduces the reliance on manually annotated data for semantic occupancy forecasting: our method exceed the previous semantic occupancy forecasting models with only 5% of the labeled training data required by prior models. We also observed inefficiencies of current LiDAR world models, mainly through their under-compression of LiDAR data and inefficient training objectives. To address this, we propose a latent conditional flow matching (CFM)-based frameworks that achieves state-of-the-art reconstruction accuracy using only half the training data and a compression ratio 6 times higher than that of prior methods. Our model achieves SOTA performance on future-trajectory-conditioned semantic occupancy forecasting while being 23x more computationally efficient (a 28x FPS speedup); and achieves SOTA performance on semantic occupancy forecasting while being 2x more computationally efficient (a 1.1x FPS speedup).

[216] On the Effectiveness of Methods and Metrics for Explainable AI in Remote Sensing Image Scene Classification

Jonas Klotz, Tom Burgert, Begüm Demir

Main category: cs.CV

TL;DR: This paper analyzes the effectiveness of explainable AI methods and evaluation metrics for remote sensing image scene classification, identifying limitations in both explanation methods and metrics while providing guidelines for their selection.

Details

Motivation: Most xAI methods and evaluation metrics in remote sensing were initially developed for natural images in computer vision, and their direct application to RS may not be suitable due to different characteristics of RS scenes.

Method: Methodologically and experimentally analyzed ten explanation metrics across five categories (faithfulness, robustness, localization, complexity, randomization) applied to five feature attribution methods (Occlusion, LIME, GradCAM, LRP, DeepLIFT) across three RS datasets.

Result: Identified key limitations: perturbation-based methods depend on baselines and spatial characteristics; gradient-based methods struggle with multiple labels; relevance propagation can distribute relevance disproportionately. Faithfulness metrics share perturbation method problems; localization and complexity metrics are unreliable for large spatial extent classes; robustness and randomization metrics show greater stability.

Conclusion: Provided guidelines for selecting explanation methods, metrics, and hyperparameters in RS image scene classification based on the identified limitations and performance characteristics.

Abstract: The development of explainable artificial intelligence (xAI) methods for scene classification problems has attracted great attention in remote sensing (RS). Most xAI methods and the related evaluation metrics in RS are initially developed for natural images considered in computer vision (CV), and their direct usage in RS may not be suitable. To address this issue, in this paper, we investigate the effectiveness of explanation methods and metrics in the context of RS image scene classification. In detail, we methodologically and experimentally analyze ten explanation metrics spanning five categories (faithfulness, robustness, localization, complexity, randomization), applied to five established feature attribution methods (Occlusion, LIME, GradCAM, LRP, and DeepLIFT) across three RS datasets. Our methodological analysis identifies key limitations in both explanation methods and metrics. The performance of perturbation-based methods, such as Occlusion and LIME, heavily depends on perturbation baselines and spatial characteristics of RS scenes. Gradient-based approaches like GradCAM struggle when multiple labels are present in the same image, while some relevance propagation methods (LRP) can distribute relevance disproportionately relative to the spatial extent of classes. Analogously, we find limitations in evaluation metrics. Faithfulness metrics share the same problems as perturbation-based methods. Localization metrics and complexity metrics are unreliable for classes with a large spatial extent. In contrast, robustness metrics and randomization metrics consistently exhibit greater stability. Our experimental results support these methodological findings. Based on our analysis, we provide guidelines for selecting explanation methods, metrics, and hyperparameters in the context of RS image scene classification.

[217] Where are we with calibration under dataset shift in image classification?

Mélanie Roschewitz, Raghav Mehta, Fabio de Sousa Ribeiro, Ben Glocker

Main category: cs.CV

TL;DR: Comprehensive study on calibration robustness under dataset shift for image classification, comparing post-hoc and in-training methods across multiple domains and shifts.

Details

Motivation: To provide practical guidelines for robust calibration under real-world dataset shift and understand interactions between different calibration techniques.

Method: Extensive comparison of post-hoc calibration methods and in-training strategies (label smoothing, entropy regularization) across 8 classification tasks with natural shifts, testing on both randomly initialized and foundation model-finetuned classifiers.

Result: Best calibration under shift achieved with entropy regularization + label smoothing; OOD-exposed post-hoc calibrators most robust; simple methods often outperform specialized shift-calibration techniques; calibration improvements under shift trade off with in-distribution performance; foundation models consistently better calibrated; ensembling before calibration more effective.

Conclusion: Ensembling combined with foundation model finetuning yields best overall calibration results, with practical guidelines for calibration strategy selection under dataset shift.

Abstract: We conduct an extensive study on the state of calibration under real-world dataset shift for image classification. Our work provides important insights on the choice of post-hoc and in-training calibration techniques, and yields practical guidelines for all practitioners interested in robust calibration under shift. We compare various post-hoc calibration methods, and their interactions with common in-training calibration strategies (e.g., label smoothing), across a wide range of natural shifts, on eight different classification tasks across several imaging domains. We find that: (i) simultaneously applying entropy regularisation and label smoothing yield the best calibrated raw probabilities under dataset shift, (ii) post-hoc calibrators exposed to a small amount of semantic out-of-distribution data (unrelated to the task) are most robust under shift, (iii) recent calibration methods specifically aimed at increasing calibration under shifts do not necessarily offer significant improvements over simpler post-hoc calibration methods, (iv) improving calibration under shifts often comes at the cost of worsening in-distribution calibration. Importantly, these findings hold for randomly initialised classifiers, as well as for those finetuned from foundation models, the latter being consistently better calibrated compared to models trained from scratch. Finally, we conduct an in-depth analysis of ensembling effects, finding that (i) applying calibration prior to ensembling (instead of after) is more effective for calibration under shifts, (ii) for ensembles, OOD exposure deteriorates the ID-shifted calibration trade-off, (iii) ensembling remains one of the most effective methods to improve calibration robustness and, combined with finetuning from foundation models, yields best calibration results overall.

[218] Latent Diffusion Models with Masked AutoEncoders

Junho Lee, Jeongwoo Shin, Hyungwook Choi, Joonseok Lee

Main category: cs.CV

TL;DR: The paper analyzes autoencoder design in Latent Diffusion Models (LDMs), identifies three key properties (latent smoothness, perceptual compression quality, reconstruction quality), and proposes Variational Masked AutoEncoders (VMAEs) that integrate with LDMs as LDMAEs.

Details

Motivation: Existing autoencoders in LDMs fail to simultaneously satisfy three key properties: latent smoothness, perceptual compression quality, and reconstruction quality, limiting their full potential.

Method: Proposed Variational Masked AutoEncoders (VMAEs) that leverage hierarchical features from Masked AutoEncoders and integrate them into the LDM framework as LDMAEs.

Result: The proposed VMAEs address the limitations of existing autoencoders by better satisfying all three key properties simultaneously.

Conclusion: LDMAEs with VMAEs provide improved autoencoder design for Latent Diffusion Models, enabling better performance across multiple desired properties.

Abstract: In spite of the remarkable potential of Latent Diffusion Models (LDMs) in image generation, the desired properties and optimal design of the autoencoders have been underexplored. In this work, we analyze the role of autoencoders in LDMs and identify three key properties: latent smoothness, perceptual compression quality, and reconstruction quality. We demonstrate that existing autoencoders fail to simultaneously satisfy all three properties, and propose Variational Masked AutoEncoders (VMAEs), taking advantage of the hierarchical features maintained by Masked AutoEncoders. We integrate VMAEs into the LDM framework, introducing Latent Diffusion Models with Masked AutoEncoders (LDMAEs). Our code is available at https://github.com/isno0907/ldmae.

[219] Towards Depth Foundation Model: Recent Trends in Vision-Based Depth Estimation

Zhen Xu, Hongyu Zhou, Sida Peng, Haotong Lin, Haoyu Guo, Jiahao Shao, Peishan Yang, Qinglin Yang, Sheng Miao, Xingyi He, Yifan Wang, Yue Wang, Ruizhen Hu, Yiyi Liao, Xiaowei Zhou, Hujun Bao

Main category: cs.CV

TL;DR: This paper surveys depth foundation models - deep neural networks trained on large datasets for robust zero-shot depth estimation, covering evolution of architectures across monocular, stereo, multi-view and video settings.

Details

Motivation: Traditional depth estimation methods using hardware sensors like LiDAR have limitations in cost, resolution and environmental sensitivity, while current vision-based methods face generalization and stability challenges due to low-capacity models or small datasets.

Method: Comprehensive survey of deep learning architectures and paradigms for depth estimation across different settings (monocular, stereo, multi-view, video), analysis of large-scale datasets, and identification of key architectures and training strategies.

Result: The paper provides a systematic overview of the evolution towards depth foundation models, highlighting their potential to address existing challenges in depth estimation through large-scale training and strong generalization capabilities.

Conclusion: Depth foundation models represent a promising direction for robust depth estimation, with the survey offering insights into future research paths and applications by identifying key architectures and training strategies.

Abstract: Depth estimation is a fundamental task in 3D computer vision, crucial for applications such as 3D reconstruction, free-viewpoint rendering, robotics, autonomous driving, and AR/VR technologies. Traditional methods relying on hardware sensors like LiDAR are often limited by high costs, low resolution, and environmental sensitivity, limiting their applicability in real-world scenarios. Recent advances in vision-based methods offer a promising alternative, yet they face challenges in generalization and stability due to either the low-capacity model architectures or the reliance on domain-specific and small-scale datasets. The emergence of scaling laws and foundation models in other domains has inspired the development of “depth foundation models”: deep neural networks trained on large datasets with strong zero-shot generalization capabilities. This paper surveys the evolution of deep learning architectures and paradigms for depth estimation across the monocular, stereo, multi-view, and monocular video settings. We explore the potential of these models to address existing challenges and provide a comprehensive overview of large-scale datasets that can facilitate their development. By identifying key architectures and training strategies, we aim to highlight the path towards robust depth foundation models, offering insights into their future research and applications.

[220] Rethinking Backbone Design for Lightweight 3D Object Detection in LiDAR

Adwait Chandorkar, Hasan Tercan, Tobias Meisen

Main category: cs.CV

TL;DR: Dense Backbone is a lightweight backbone for 3D object detection that reduces computational costs while maintaining detection accuracy, achieving 29% parameter reduction and 28% latency reduction with only 2% accuracy drop.

Details

Motivation: Current LiDAR-based 3D object detection methods rely on computationally expensive VGG/ResNet backbones, creating a need for lightweight alternatives that maintain performance while reducing complexity.

Method: Introduces Dense Backbone, a dense-layer-based backbone specifically designed for 3D object detection from point clouds, featuring plug-and-play integration with existing detectors like PillarNet.

Result: DensePillarNet achieves 29% reduction in model parameters, 28% reduction in latency, with only 2% drop in detection accuracy on nuScenes test set.

Conclusion: Dense Backbone provides an effective lightweight solution for 3D object detection that significantly reduces computational costs while preserving detection performance, with easy integration into existing architectures.

Abstract: Recent advancements in LiDAR-based 3D object detection have significantly accelerated progress toward the realization of fully autonomous driving in real-world environments. Despite achieving high detection performance, most of the approaches still rely on a VGG-based or ResNet-based backbone for feature exploration, which increases the model complexity. Lightweight backbone design is well-explored for 2D object detection, but research on 3D object detection still remains limited. In this work, we introduce Dense Backbone, a lightweight backbone that combines the benefits of high processing speed, lightweight architecture, and robust detection accuracy. We adapt multiple SoTA 3d object detectors, such as PillarNet, with our backbone and show that with our backbone, these models retain most of their detection capability at a significantly reduced computational cost. To our knowledge, this is the first dense-layer-based backbone tailored specifically for 3D object detection from point cloud data. DensePillarNet, our adaptation of PillarNet, achieves a 29% reduction in model parameters and a 28% reduction in latency with just a 2% drop in detection accuracy on the nuScenes test set. Furthermore, Dense Backbone’s plug-and-play design allows straightforward integration into existing architectures, requiring no modifications to other network components.

[221] Backpropagation-Free Test-Time Adaptation via Probabilistic Gaussian Alignment

Youjia Zhang, Youngeun Kim, Young-Geun Choi, Hongyeob Kim, Huiling Liu, Sungeun Hong

Main category: cs.CV

TL;DR: ADAPT is a backpropagation-free test-time adaptation method that reframes TTA as Gaussian probabilistic inference, using class-conditional likelihoods with CLIP-guided regularization for improved robustness under distribution shifts.

Details

Motivation: Existing TTA methods rely on backpropagation/iterative optimization which limits scalability and real-time deployment, and lack explicit modeling of class-conditional feature distributions needed for reliable decision boundaries and calibrated predictions.

Method: Models class-conditional likelihoods using gradually updated class means and shared covariance matrix, enabling closed-form training-free inference. Uses lightweight CLIP-guided regularization and historical knowledge bank to correct potential likelihood bias.

Result: Achieves state-of-the-art performance across diverse benchmarks under various distribution shifts with superior scalability and robustness, requiring no source data, gradient updates, or full access to target data.

Conclusion: ADAPT provides an effective backpropagation-free solution for test-time adaptation that enables reliable distribution shift robustness through probabilistic modeling and CLIP-guided regularization.

Abstract: Test-time adaptation (TTA) enhances the zero-shot robustness under distribution shifts by leveraging unlabeled test data during inference. Despite notable advances, several challenges still limit its broader applicability. First, most methods rely on backpropagation or iterative optimization, which limits scalability and hinders real-time deployment. Second, they lack explicit modeling of class-conditional feature distributions. This modeling is crucial for producing reliable decision boundaries and calibrated predictions, but it remains underexplored due to the lack of both source data and supervision at test time. In this paper, we propose ADAPT, an Advanced Distribution-Aware and backPropagation-free Test-time adaptation method. We reframe TTA as a Gaussian probabilistic inference task by modeling class-conditional likelihoods using gradually updated class means and a shared covariance matrix. This enables closed-form, training-free inference. To correct potential likelihood bias, we introduce lightweight regularization guided by CLIP priors and a historical knowledge bank. ADAPT requires no source data, no gradient updates, and no full access to target data, supporting both online and transductive settings. Extensive experiments across diverse benchmarks demonstrate that our method achieves state-of-the-art performance under a wide range of distribution shifts with superior scalability and robustness.

[222] Training-Free Label Space Alignment for Universal Domain Adaptation

Dujin Lee, Sojung An, Jungmyung Wi, Kuniaki Saito, Donghyun Kim

Main category: cs.CV

TL;DR: A novel universal domain adaptation method that leverages vision-language models (CLIP) for label space alignment instead of visual space alignment, achieving significant performance improvements over existing methods.

Details

Motivation: Previous UniDA methods focused on visual space alignment but struggled with visual ambiguities due to content differences, limiting robustness and generalizability.

Method: Uses generative vision-language models to identify unknown categories, then proposes a training-free label-space alignment method that filters and refines noisy labels between domains, constructing a universal classifier that integrates shared knowledge and target-private class information.

Result: Significantly outperforms existing UniDA techniques with average improvements of +7.9% in H-score and +6.1% in H³-score across DomainBed benchmarks. Self-training further enhances performance by +1.6% in both metrics.

Conclusion: The proposed label-space alignment approach using VLMs provides more stable and generalizable domain adaptation compared to traditional visual space alignment methods, effectively handling unknown categories and semantic ambiguities.

Abstract: Universal domain adaptation (UniDA) transfers knowledge from a labeled source domain to an unlabeled target domain, where label spaces may differ and the target domain may contain private classes. Previous UniDA methods primarily focused on visual space alignment but often struggled with visual ambiguities due to content differences, which limited their robustness and generalizability. To overcome this, we introduce a novel approach that leverages the strong \textit{zero-shot capabilities} of recent vision-language foundation models (VLMs) like CLIP, concentrating solely on label space alignment to enhance adaptation stability. CLIP can generate task-specific classifiers based only on label names. However, adapting CLIP to UniDA is challenging because the label space is not fully known in advance. In this study, we first utilize generative vision-language models to identify unknown categories in the target domain. Noise and semantic ambiguities in the discovered labels – such as those similar to source labels (e.g., synonyms, hypernyms, hyponyms) – complicate label alignment. To address this, we propose a training-free label-space alignment method for UniDA (\ours). Our method aligns label spaces instead of visual spaces by filtering and refining noisy labels between the domains. We then construct a \textit{universal classifier} that integrates both shared knowledge and target-private class information, thereby improving generalizability under domain shifts. The results reveal that the proposed method considerably outperforms existing UniDA techniques across key DomainBed benchmarks, delivering an average improvement of \textcolor{blue}{+7.9%}in H-score and \textcolor{blue}{+6.1%} in H$^3$-score. Furthermore, incorporating self-training further enhances performance and achieves an additional (\textcolor{blue}{+1.6%}) increment in both H- and H$^3$-scores.

[223] Breaking the Discretization Barrier of Continuous Physics Simulation Learning

Fan Xu, Hao Wu, Nan Wang, Lilan Peng, Kun Wang, Wei Gong, Xibin Zhao

Main category: cs.CV

TL;DR: CoPS is a data-driven method for continuous physics simulation from partial observations, using multiplicative filter networks, geometric grids, and multi-scale graph ODEs with neural auto-correction.

Details

Motivation: Existing data-driven approaches are constrained by fixed spatial and temporal discretization, and struggle with sparsely distributed observations in nonlinear physical dynamics.

Method: Uses multiplicative filter networks to encode spatial information, custom geometric grids with message-passing, multi-scale graph ODEs for continuous-time dynamics, and Markov-based neural auto-correction module.

Result: Comprehensive experiments show CoPS advances state-of-the-art methods in space-time continuous modeling across various scenarios.

Conclusion: CoPS successfully addresses limitations of discretization in modeling continuous physics from partial observations through its novel architecture.

Abstract: The modeling of complicated time-evolving physical dynamics from partial observations is a long-standing challenge. Particularly, observations can be sparsely distributed in a seemingly random or unstructured manner, making it difficult to capture highly nonlinear features in a variety of scientific and engineering problems. However, existing data-driven approaches are often constrained by fixed spatial and temporal discretization. While some researchers attempt to achieve spatio-temporal continuity by designing novel strategies, they either overly rely on traditional numerical methods or fail to truly overcome the limitations imposed by discretization. To address these, we propose CoPS, a purely data-driven methods, to effectively model continuous physics simulation from partial observations. Specifically, we employ multiplicative filter network to fuse and encode spatial information with the corresponding observations. Then we customize geometric grids and use message-passing mechanism to map features from original spatial domain to the customized grids. Subsequently, CoPS models continuous-time dynamics by designing multi-scale graph ODEs, while introducing a Markov-based neural auto-correction module to assist and constrain the continuous extrapolations. Comprehensive experiments demonstrate that CoPS advances the state-of-the-art methods in space-time continuous modeling across various scenarios.

[224] The Photographer Eye: Teaching Multimodal Large Language Models to Understand Image Aesthetics like Photographers

Daiqing Qi, Handong Zhao, Jing Shi, Simon Jenni, Yifei Fan, Franck Dernoncourt, Scott Cohen, Sheng Li

Main category: cs.CV

TL;DR: The paper introduces PhotoCritique dataset, PhotoEye model, and PhotoBench benchmark to enhance MLLMs’ aesthetic visual understanding beyond basic object recognition.

Details

Motivation: Current MLLMs struggle with aesthetic visual understanding, focusing mainly on factual elements rather than aesthetic components like color, lighting, and composition. There's a gap between general visual understanding and professional aesthetic analysis.

Method: Created PhotoCritique dataset from professional photographer discussions, developed PhotoEye model with language-guided multi-view vision fusion, and established PhotoBench benchmark for comprehensive aesthetic evaluation.

Result: The proposed model demonstrates clear advantages over existing models on both existing benchmarks and the new PhotoBench benchmark, showing improved aesthetic understanding capabilities.

Conclusion: The work fundamentally enhances MLLMs’ aesthetic visual understanding through specialized dataset, model architecture, and evaluation benchmark, bridging the gap between general and aesthetic visual comprehension.

Abstract: While editing directly from life, photographers have found it too difficult to see simultaneously both the blue and the sky. Photographer and curator, Szarkowski insightfully revealed one of the notable gaps between general and aesthetic visual understanding: while the former focuses on identifying the factual element in an image (sky), the latter transcends such object identification, viewing it instead as an aesthetic component–a pure color block (blue). Such fundamental distinctions between general (detection, localization, etc.) and aesthetic (color, lighting, composition, etc.) visual understanding present a significant challenge for Multimodal Large Language Models (MLLMs). Although some recent works have made initial explorations, they are often limited to general and basic aesthetic commonsense. As a result, they frequently fall short in real-world scenarios (Fig. 1), which require extensive expertise–including photographic techniques, photo pre/post-processing knowledge, and more, to provide a detailed analysis and description. To fundamentally enhance the aesthetics understanding of MLLMs, we first introduce a novel dataset, PhotoCritique, derived from extensive discussions among professional photographers and enthusiasts, and characterized by the large scale, expertise, and diversity. Then, to better learn visual aesthetics from PhotoCritique, we furthur propose a novel model, PhotoEye, featuring a languageguided multi-view vision fusion mechanism to understand image aesthetics from multiple perspectives. Finally, we present a novel benchmark, PhotoBench, a comprehensive and professional benchmark for aesthetic visual understanding. On existing benchmarks and PhotoBench, our model demonstrates clear advantages over existing models.

[225] Weakly Supervised Food Image Segmentation using Vision Transformers and Segment Anything Model

Ioannis Sarafis, Alexandros Papadopoulos, Anastasios Delopoulos

Main category: cs.CV

TL;DR: A weakly supervised semantic segmentation method for food images using ViT-generated CAMs as prompts for SAM, achieving mIoU of 0.54 on FoodSeg103 dataset without pixel-level annotations.

Details

Motivation: To develop a food image segmentation approach that eliminates the need for expensive pixel-level annotations by leveraging zero-shot capabilities of SAM and attention mechanisms of ViTs.

Method: Uses class activation maps (CAMs) from Swin Transformer ViT trained with image-level annotations as prompts for SAM, combined with image preprocessing and single/multi-mask generation strategies.

Result: Achieved mIoU of 0.54 for multi-mask scenario on FoodSeg103 dataset, generating average 2.4 masks per image (excluding background).

Conclusion: The approach can accelerate food image annotation and be integrated into food/nutrition tracking applications as a practical weakly supervised segmentation tool.

Abstract: In this paper, we propose a weakly supervised semantic segmentation approach for food images which takes advantage of the zero-shot capabilities and promptability of the Segment Anything Model (SAM) along with the attention mechanisms of Vision Transformers (ViTs). Specifically, we use class activation maps (CAMs) from ViTs to generate prompts for SAM, resulting in masks suitable for food image segmentation. The ViT model, a Swin Transformer, is trained exclusively using image-level annotations, eliminating the need for pixel-level annotations during training. Additionally, to enhance the quality of the SAM-generated masks, we examine the use of image preprocessing techniques in combination with single-mask and multi-mask SAM generation strategies. The methodology is evaluated on the FoodSeg103 dataset, generating an average of 2.4 masks per image (excluding background), and achieving an mIoU of 0.54 for the multi-mask scenario. We envision the proposed approach as a tool to accelerate food image annotation tasks or as an integrated component in food and nutrition tracking applications.

[226] kabr-tools: Automated Framework for Multi-Species Behavioral Monitoring

Jenna Kline, Maksim Kholiavchenko, Samuel Stevens, Nina van Tiel, Alison Zhong, Namrata Banerji, Alec Sheets, Sowbaranika Balasubramaniam, Isla Duporge, Matthew Thompson, Elizabeth Campolongo, Jackson Miliko, Neil Rosser, Tanya Berger-Wolf, Charles V. Stewart, Daniel I. Rubenstein

Main category: cs.CV

TL;DR: kabr-tools is an open-source package for automated multi-species behavioral monitoring using drone-based video and machine learning to extract behavioral, social, and spatial metrics from wildlife footage.

Details

Motivation: Traditional field observations are limited in scope, time-consuming, and labor-intensive, hindering comprehensive assessment of behavioral responses across landscapes.

Method: Integration of drone-based video with machine learning systems including object detection, tracking, and behavioral classification to generate metrics like time budgets, behavioral transitions, social interactions, habitat associations, and group composition dynamics.

Result: Drone-based observations significantly improved behavioral granularity, reducing visibility loss by 15% and capturing more transitions with higher accuracy and continuity. Analysis of 969 behavioral sequences revealed species-specific behavioral patterns and spatial segregation in mixed-species herds.

Conclusion: kabr-tools enables automated behavioral monitoring at scale, offering a powerful tool for ecosystem-wide studies that advances conservation, biodiversity research, and ecological monitoring.

Abstract: A comprehensive understanding of animal behavior ecology depends on scalable approaches to quantify and interpret complex, multidimensional behavioral patterns. Traditional field observations are often limited in scope, time-consuming, and labor-intensive, hindering the assessment of behavioral responses across landscapes. To address this, we present kabr-tools (Kenyan Animal Behavior Recognition Tools), an open-source package for automated multi-species behavioral monitoring. This framework integrates drone-based video with machine learning systems to extract behavioral, social, and spatial metrics from wildlife footage. Our pipeline leverages object detection, tracking, and behavioral classification systems to generate key metrics, including time budgets, behavioral transitions, social interactions, habitat associations, and group composition dynamics. Compared to ground-based methods, drone-based observations significantly improved behavioral granularity, reducing visibility loss by 15% and capturing more transitions with higher accuracy and continuity. We validate kabr-tools through three case studies, analyzing 969 behavioral sequences, surpassing the capacity of traditional methods for data capture and annotation. We found that, like Plains zebras, vigilance in Grevy’s zebras decreases with herd size, but, unlike Plains zebras, habitat has a negligible impact. Plains and Grevy’s zebras exhibit strong behavioral inertia, with rare transitions to alert behaviors and observed spatial segregation between Grevy’s zebras, Plains zebras, and giraffes in mixed-species herds. By enabling automated behavioral monitoring at scale, kabr-tools offers a powerful tool for ecosystem-wide studies, advancing conservation, biodiversity research, and ecological monitoring.

[227] How many samples to label for an application given a foundation model? Chest X-ray classification study

Nikolay Nechaev, Evgeniia Przhezdzetskaia, Viktor Gombolevskiy, Dmitry Umerenkov, Dmitry Dylov

Main category: cs.CV

TL;DR: Chest X-ray classification requires fewer labeled samples when using foundation models like XrayCLIP and XraySigLIP compared to ResNet-50 baseline, with learning curves from just 50 cases accurately predicting final performance.

Details

Motivation: Chest X-ray classification is resource-intensive and typically requires extensive annotated data. Foundation models can reduce this dependency, but the exact number of labeled samples needed remains unclear.

Method: Systematically evaluate power-law fits to predict training size needed for specific ROC-AUC thresholds. Test multiple pathologies and foundation models including XrayCLIP and XraySigLIP against ResNet-50 baseline.

Result: XrayCLIP and XraySigLIP achieve strong performance with significantly fewer labeled examples than ResNet-50. Learning curve slopes from just 50 labeled cases accurately forecast final performance plateaus.

Conclusion: Practitioners can minimize annotation costs by labeling only essential samples for targeted performance, using foundation models that require fewer labeled cases than traditional approaches.

Abstract: Chest X-ray classification is vital yet resource-intensive, typically demanding extensive annotated data for accurate diagnosis. Foundation models mitigate this reliance, but how many labeled samples are required remains unclear. We systematically evaluate the use of power-law fits to predict the training size necessary for specific ROC-AUC thresholds. Testing multiple pathologies and foundation models, we find XrayCLIP and XraySigLIP achieve strong performance with significantly fewer labeled examples than a ResNet-50 baseline. Importantly, learning curve slopes from just 50 labeled cases accurately forecast final performance plateaus. Our results enable practitioners to minimize annotation costs by labeling only the essential samples for targeted performance.

[228] ImagerySearch: Adaptive Test-Time Search for Video Generation Beyond Semantic Dependency Constraints

Meiqi Wu, Jiashu Zhu, Xiaokun Feng, Chubin Chen, Chen Zhu, Bingze Song, Fangyuan Mao, Jiahong Wu, Xiangxiang Chu, Kaiqi Huang

Main category: cs.CV

TL;DR: ImagerySearch is a prompt-guided adaptive test-time search strategy that dynamically adjusts inference search space and reward functions for better video generation in imaginative scenarios with long-distance semantic relationships.

Details

Motivation: Video generation models perform poorly in imaginative scenarios with rarely co-occurring concepts and long-distance semantic relationships that fall outside training distributions. Existing test-time scaling methods have fixed search spaces and static rewards that limit adaptability.

Method: Proposes ImagerySearch - a prompt-guided adaptive test-time search strategy that dynamically adjusts both inference search space and reward function based on semantic relationships in the prompt. Also introduces LDT-Bench, the first benchmark for long-distance semantic prompts with 2,839 diverse concept pairs.

Result: ImagerySearch consistently outperforms strong video generation baselines and existing test-time scaling approaches on LDT-Bench, and achieves competitive improvements on VBench, demonstrating effectiveness across diverse prompt types.

Conclusion: The proposed method enables more coherent and visually plausible videos in challenging imaginative settings, and the LDT-Bench benchmark will facilitate future research on imaginative video generation.

Abstract: Video generation models have achieved remarkable progress, particularly excelling in realistic scenarios; however, their performance degrades notably in imaginative scenarios. These prompts often involve rarely co-occurring concepts with long-distance semantic relationships, falling outside training distributions. Existing methods typically apply test-time scaling for improving video quality, but their fixed search spaces and static reward designs limit adaptability to imaginative scenarios. To fill this gap, we propose ImagerySearch, a prompt-guided adaptive test-time search strategy that dynamically adjusts both the inference search space and reward function according to semantic relationships in the prompt. This enables more coherent and visually plausible videos in challenging imaginative settings. To evaluate progress in this direction, we introduce LDT-Bench, the first dedicated benchmark for long-distance semantic prompts, consisting of 2,839 diverse concept pairs and an automated protocol for assessing creative generation capabilities. Extensive experiments show that ImagerySearch consistently outperforms strong video generation baselines and existing test-time scaling approaches on LDT-Bench, and achieves competitive improvements on VBench, demonstrating its effectiveness across diverse prompt types. We will release LDT-Bench and code to facilitate future research on imaginative video generation.

[229] SparseWorld: A Flexible, Adaptive, and Efficient 4D Occupancy World Model Powered by Sparse and Dynamic Queries

Chenxu Dang, Haiyan Liu, Guangjun Bao, Pei An, Xinyue Tang, An Pan, Jie Ma, Bingchuan Sun, Yan Wang

Main category: cs.CV

TL;DR: SparseWorld is a novel 4D occupancy world model that uses sparse dynamic queries for flexible and adaptive perception, achieving state-of-the-art performance in autonomous driving tasks.

Details

Motivation: Existing occupancy world models rely on static embeddings/grids that limit perception flexibility and misalign with the dynamic nature of real scenarios.

Method: Proposes Range-Adaptive Perception with ego-modulated queries, State-Conditioned Forecasting using regression instead of classification, and Temporal-Aware Self-Scheduling training.

Result: Achieves state-of-the-art performance across perception, forecasting, and planning tasks, with advantages in flexibility, adaptability, and efficiency.

Conclusion: SparseWorld demonstrates superior performance through its sparse dynamic query approach, effectively addressing limitations of static occupancy models in dynamic environments.

Abstract: Semantic occupancy has emerged as a powerful representation in world models for its ability to capture rich spatial semantics. However, most existing occupancy world models rely on static and fixed embeddings or grids, which inherently limit the flexibility of perception. Moreover, their “in-place classification” over grids exhibits a potential misalignment with the dynamic and continuous nature of real scenarios.In this paper, we propose SparseWorld, a novel 4D occupancy world model that is flexible, adaptive, and efficient, powered by sparse and dynamic queries. We propose a Range-Adaptive Perception module, in which learnable queries are modulated by the ego vehicle states and enriched with temporal-spatial associations to enable extended-range perception. To effectively capture the dynamics of the scene, we design a State-Conditioned Forecasting module, which replaces classification-based forecasting with regression-guided formulation, precisely aligning the dynamic queries with the continuity of the 4D environment. In addition, We specifically devise a Temporal-Aware Self-Scheduling training strategy to enable smooth and efficient training. Extensive experiments demonstrate that SparseWorld achieves state-of-the-art performance across perception, forecasting, and planning tasks. Comprehensive visualizations and ablation studies further validate the advantages of SparseWorld in terms of flexibility, adaptability, and efficiency. The code is available at https://github.com/MSunDYY/SparseWorld.

[230] Context-Aware Pseudo-Label Scoring for Zero-Shot Video Summarization

Yuanli Wu, Long Zhang, Yue Du, Bin Li

Main category: cs.CV

TL;DR: A zero-shot video summarization framework using rubric-guided pseudo labeling and prompt-driven LLM scoring that achieves competitive results without training.

Details

Motivation: To bridge large language models with structured semantic reasoning for video summarization without requiring parameter tuning or extensive human annotations.

Method: Convert human annotations into pseudo labels organized as dataset-adaptive rubrics, then use LLM scoring with boundary scenes evaluated independently and intermediate scenes incorporating adjacent segment summaries for narrative continuity.

Result: Achieved F1 scores of 57.58 on SumMe, 63.05 on TVSum, and 53.79 on QFVS, surpassing zero-shot baselines by +0.85, +0.84, and +0.37 respectively.

Conclusion: Rubric-guided pseudo labeling with contextual prompting effectively stabilizes LLM-based scoring and establishes a general, interpretable, training-free paradigm for video summarization.

Abstract: We propose a rubric-guided, pseudo-labeled, and prompt-driven zero-shot video summarization framework that bridges large language models with structured semantic reasoning. A small subset of human annotations is converted into high-confidence pseudo labels and organized into dataset-adaptive rubrics defining clear evaluation dimensions such as thematic relevance, action detail, and narrative progression. During inference, boundary scenes, including the opening and closing segments, are scored independently based on their own descriptions, while intermediate scenes incorporate concise summaries of adjacent segments to assess narrative continuity and redundancy. This design enables the language model to balance local salience with global coherence without any parameter tuning. Across three benchmarks, the proposed method achieves stable and competitive results, with F1 scores of 57.58 on SumMe, 63.05 on TVSum, and 53.79 on QFVS, surpassing zero-shot baselines by +0.85, +0.84, and +0.37, respectively. These outcomes demonstrate that rubric-guided pseudo labeling combined with contextual prompting effectively stabilizes LLM-based scoring and establishes a general, interpretable, and training-free paradigm for both generic and query-focused video summarization.

[231] MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models

Yongshun Zhang, Zhongyi Fan, Yonghang Zhang, Zhangzikang Li, Weifeng Chen, Zhongwei Feng, Chaoyue Wang, Peng Hou, Anxiang Zeng

Main category: cs.CV

TL;DR: A training framework for large-scale video generation models that optimizes data processing, model architecture, training strategy, and infrastructure, resulting in MUG-V 10B model that matches state-of-the-art performance and surpasses baselines on e-commerce tasks.

Details

Motivation: Training large-scale video generation models is challenging due to cross-modal text-video alignment, long sequences, and complex spatiotemporal dependencies, requiring resource-intensive approaches.

Method: Optimized four pillars: data processing, model architecture, training strategy, and infrastructure. Used techniques including data preprocessing, video compression, parameter scaling, curriculum-based pretraining, and alignment-focused post-training.

Result: MUG-V 10B model matches state-of-the-art video generators overall and surpasses leading open-source baselines on e-commerce-oriented video generation tasks in human evaluations.

Conclusion: The complete stack including model weights, Megatron-Core-based training code, and inference pipelines is open-sourced, representing the first public release of large-scale video generation training code using Megatron-Core for high efficiency and near-linear multi-node scaling.

Abstract: In recent years, large-scale generative models for visual content (\textit{e.g.,} images, videos, and 3D objects/scenes) have made remarkable progress. However, training large-scale video generation models remains particularly challenging and resource-intensive due to cross-modal text-video alignment, the long sequences involved, and the complex spatiotemporal dependencies. To address these challenges, we present a training framework that optimizes four pillars: (i) data processing, (ii) model architecture, (iii) training strategy, and (iv) infrastructure for large-scale video generation models. These optimizations delivered significant efficiency gains and performance improvements across all stages of data preprocessing, video compression, parameter scaling, curriculum-based pretraining, and alignment-focused post-training. Our resulting model, MUG-V 10B, matches recent state-of-the-art video generators overall and, on e-commerce-oriented video generation tasks, surpasses leading open-source baselines in human evaluations. More importantly, we open-source the complete stack, including model weights, Megatron-Core-based large-scale training code, and inference pipelines for video generation and enhancement. To our knowledge, this is the first public release of large-scale video generation training code that exploits Megatron-Core to achieve high training efficiency and near-linear multi-node scaling, details are available in https://github.com/Shopee-MUG/MUG-V.

[232] PAGE-4D: Disentangled Pose and Geometry Estimation for 4D Perception

Kaichen Zhou, Yuhan Wang, Grace Chen, Xinhai Chang, Gaspard Beaudouin, Fangneng Zhan, Paul Pu Liang, Mengyu Wang

Main category: cs.CV

TL;DR: PAGE-4D extends VGGT to handle dynamic scenes by introducing a dynamics-aware aggregator that disentangles static and dynamic information, enabling improved camera pose estimation, depth prediction, and point cloud reconstruction in dynamic scenarios.

Details

Motivation: Existing 3D feed-forward models like VGGT struggle with dynamic elements in real-world scenarios because they are trained on static datasets, limiting their performance with moving humans or deformable objects.

Method: Proposes a dynamics-aware aggregator that predicts a dynamics-aware mask to disentangle static and dynamic information - suppressing motion cues for pose estimation while amplifying them for geometry reconstruction.

Result: PAGE-4D consistently outperforms VGGT in dynamic scenarios, achieving superior results in camera pose estimation, monocular/video depth estimation, and dense point map reconstruction.

Conclusion: The proposed dynamics-aware approach effectively resolves the inherent conflict between camera pose estimation and geometry reconstruction in dynamic scenes, enabling robust 4D reconstruction without post-processing.

Abstract: Recent 3D feed-forward models, such as the Visual Geometry Grounded Transformer (VGGT), have shown strong capability in inferring 3D attributes of static scenes. However, since they are typically trained on static datasets, these models often struggle in real-world scenarios involving complex dynamic elements, such as moving humans or deformable objects like umbrellas. To address this limitation, we introduce PAGE-4D, a feedforward model that extends VGGT to dynamic scenes, enabling camera pose estimation, depth prediction, and point cloud reconstruction – all without post-processing. A central challenge in multi-task 4D reconstruction is the inherent conflict between tasks: accurate camera pose estimation requires suppressing dynamic regions, while geometry reconstruction requires modeling them. To resolve this tension, we propose a dynamics-aware aggregator that disentangles static and dynamic information by predicting a dynamics-aware mask – suppressing motion cues for pose estimation while amplifying them for geometry reconstruction. Extensive experiments show that PAGE-4D consistently outperforms the original VGGT in dynamic scenarios, achieving superior results in camera pose estimation, monocular and video depth estimation, and dense point map reconstruction.

[233] Chimera: Compositional Image Generation using Part-based Concepting

Shivam Singh, Yiming Chen, Agneet Chatterjee, Amit Raj, James Hays, Yezhou Yang, Chitta Baral

Main category: cs.CV

TL;DR: Chimera is a personalized image generation model that creates novel objects by combining specified parts from different source images using textual instructions, without requiring user-specified masks or annotations.

Details

Motivation: Existing personalized image generative models lack explicit control for composing objects from specific parts of multiple source images without user-specified masks or annotations.

Method: Built a dataset from 464 unique (part, subject) pairs (semantic atoms), generated 37k prompts, trained a custom diffusion prior model with part-conditional guidance to enforce semantic identity and spatial layout.

Result: Chimera outperforms other baselines by 14% in part alignment and compositional accuracy and 21% in visual quality based on human evaluations and the proposed PartEval metric.

Conclusion: Chimera successfully enables personalized image generation with explicit control over object part composition from multiple source images using textual instructions.

Abstract: Personalized image generative models are highly proficient at synthesizing images from text or a single image, yet they lack explicit control for composing objects from specific parts of multiple source images without user specified masks or annotations. To address this, we introduce Chimera, a personalized image generation model that generates novel objects by combining specified parts from different source images according to textual instructions. To train our model, we first construct a dataset from a taxonomy built on 464 unique (part, subject) pairs, which we term semantic atoms. From this, we generate 37k prompts and synthesize the corresponding images with a high-fidelity text-to-image model. We train a custom diffusion prior model with part-conditional guidance, which steers the image-conditioning features to enforce both semantic identity and spatial layout. We also introduce an objective metric PartEval to assess the fidelity and compositional accuracy of generation pipelines. Human evaluations and our proposed metric show that Chimera outperforms other baselines by 14% in part alignment and compositional accuracy and 21% in visual quality.

[234] Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs

Haochen Wang, Yuhao Wang, Tao Zhang, Yikang Zhou, Yanwei Li, Jiacong Wang, Jiani Zheng, Ye Tian, Jiahao Meng, Zilong Huang, Guangcan Mai, Anran Wang, Yunhai Tong, Zhuochen Wang, Xiangtai Li, Zhaoxiang Zhang

Main category: cs.CV

TL;DR: GAR is a region-level MLLM that addresses limitations in dense scene understanding by leveraging global contexts and modeling interactions between multiple regions, achieving advanced compositional reasoning.

Details

Motivation: Current MLLMs struggle with fine-grained analysis of complex scenes and object inter-relationships, while existing region-level approaches understand regions in isolation without considering global contexts.

Method: Uses RoI-aligned feature replay technique to support precise perception with global contexts, modeling interactions between multiple prompts, and achieving compositional reasoning through active dialogue.

Result: GAR-1B outperforms DAM-3B by +4.5 on DLC-Bench, surpasses InternVL3-78B on GAR-Bench-VQA, and GAR-8B outperforms VideoRefer-7B on VideoRefer-BenchQ in zero-shot settings.

Conclusion: GAR enables comprehensive region-level visual understanding with strong capabilities that transfer well to video domains, shifting from passive description to active dialogue paradigm.

Abstract: While Multimodal Large Language Models (MLLMs) excel at holistic understanding, they struggle in capturing the dense world with complex scenes, requiring fine-grained analysis of intricate details and object inter-relationships. Region-level MLLMs have been a promising step. However, previous attempts are generally optimized to understand given regions in isolation, neglecting crucial global contexts. To address this, we introduce Grasp Any Region (GAR) for comprehen- sive region-level visual understanding. Empowered by an effective RoI-aligned feature replay technique, GAR supports (1) precise perception by leveraging necessary global contexts, and (2) modeling interactions between multiple prompts. Together, it then naturally achieves (3) advanced compositional reasoning to answer specific free-form questions about any region, shifting the paradigm from passive description to active dialogue. Moreover, we construct GAR-Bench, which not only provides a more accurate evaluation of single-region comprehension, but also, more importantly, measures interactions and complex reasoning across multiple regions. Extensive experiments have demonstrated that GAR-1B not only maintains the state-of-the-art captioning capabilities, e.g., outperforming DAM-3B +4.5 on DLC-Bench, but also excels at modeling relationships between multiple prompts with advanced comprehension capabilities, even surpassing InternVL3-78B on GAR-Bench-VQA. More importantly, our zero-shot GAR-8B even outperforms in-domain VideoRefer-7B on VideoRefer-BenchQ, indicating its strong capabilities can be easily transferred to videos.

Bohan Li, Zhuang Ma, Dalong Du, Baorui Peng, Zhujin Liang, Zhenqiang Liu, Chao Ma, Yueming Jin, Hao Zhao, Wenjun Zeng, Xin Jin

Main category: cs.CV

TL;DR: OmniNWM is a panoramic navigation world model that generates multi-modal panoramic videos with precise action control and occupancy-based rewards for autonomous driving.

Details

Motivation: Existing autonomous driving world models are limited in state modalities, sequence length, action precision, and reward awareness, which OmniNWM aims to overcome.

Method: Uses panoramic Plucker ray-map representation for precise action control, generates RGB, semantics, depth, and 3D occupancy jointly, and employs occupancy-based rule rewards.

Result: Achieves state-of-the-art performance in video generation, control accuracy, and long-horizon stability with reliable closed-loop evaluation.

Conclusion: OmniNWM provides a unified framework addressing state, action, and reward dimensions effectively for autonomous driving world modeling.

Abstract: Autonomous driving world models are expected to work effectively across three core dimensions: state, action, and reward. Existing models, however, are typically restricted to limited state modalities, short video sequences, imprecise action control, and a lack of reward awareness. In this paper, we introduce OmniNWM, an omniscient panoramic navigation world model that addresses all three dimensions within a unified framework. For state, OmniNWM jointly generates panoramic videos of RGB, semantics, metric depth, and 3D occupancy. A flexible forcing strategy enables high-quality long-horizon auto-regressive generation. For action, we introduce a normalized panoramic Plucker ray-map representation that encodes input trajectories into pixel-level signals, enabling highly precise and generalizable control over panoramic video generation. Regarding reward, we move beyond learning reward functions with external image-based models: instead, we leverage the generated 3D occupancy to directly define rule-based dense rewards for driving compliance and safety. Extensive experiments demonstrate that OmniNWM achieves state-of-the-art performance in video generation, control accuracy, and long-horizon stability, while providing a reliable closed-loop evaluation framework through occupancy-grounded rewards. Project page is available at https://github.com/Arlo0o/OmniNWM.

[236] FeatureFool: Zero-Query Fooling of Video Models via Feature Map

Duoxun Tang, Xi Xiao, Guangwu Hu, Kangkang Sun, Xiao Yang, Dongyang Chen, Qing Li, Yongjie Yin, Jiyao Wang

Main category: cs.CV

TL;DR: FeatureFool is a zero-query black-box attack that uses DNN-extracted feature maps to alter video feature spaces, achieving high success rates against video classifiers and Video-LLMs without iterative queries.

Details

Motivation: Existing black-box attacks require multiple queries and interactions, which are impractical for real-world applications and don't scale well to Video-LLMs. No video-domain attacks directly leverage feature maps to manipulate clean-video feature spaces.

Method: FeatureFool performs zero-query attacks by directly exploiting information extracted from DNNs to alter the feature space of clean videos, using feature map transferability to craft adversarial content.

Result: Achieves >70% attack success rate against traditional video classifiers without queries, successfully bypasses Video-LLM recognition, and generates high-quality adversarial videos with excellent SSIM, PSNR, and Temporal-Inconsistency metrics.

Conclusion: FeatureFool demonstrates an efficient, stealthy zero-query attack approach that is unprecedented in the video domain, showing strong transferability and practical applicability while maintaining high perceptual quality.

Abstract: The vulnerability of deep neural networks (DNNs) has been preliminarily verified. Existing black-box adversarial attacks usually require multi-round interaction with the model and consume numerous queries, which is impractical in the real-world and hard to scale to recently emerged Video-LLMs. Moreover, no attack in the video domain directly leverages feature maps to shift the clean-video feature space. We therefore propose FeatureFool, a stealthy, video-domain, zero-query black-box attack that utilizes information extracted from a DNN to alter the feature space of clean videos. Unlike query-based methods that rely on iterative interaction, FeatureFool performs a zero-query attack by directly exploiting DNN-extracted information. This efficient approach is unprecedented in the video domain. Experiments show that FeatureFool achieves an attack success rate above 70% against traditional video classifiers without any queries. Benefiting from the transferability of the feature map, it can also craft harmful content and bypass Video-LLM recognition. Additionally, adversarial videos generated by FeatureFool exhibit high quality in terms of SSIM, PSNR, and Temporal-Inconsistency, making the attack barely perceptible. This paper may contain violent or explicit content.

[237] ScaleNet: Scaling up Pretrained Neural Networks with Incremental Parameters

Zhiwei Hao, Jianyuan Guo, Li Shen, Kai Han, Yehui Tang, Han Hu, Yunhe Wang

Main category: cs.CV

TL;DR: ScaleNet enables efficient scaling of vision transformers by inserting additional layers with weight sharing and adapter modules, achieving better performance with fewer training epochs compared to training from scratch.

Details

Motivation: Training larger vision transformer models is computationally intensive and costly. ScaleNet addresses this by providing a cost-effective way to scale up existing pretrained ViT models without the need for full retraining.

Method: ScaleNet inserts additional layers into pretrained ViTs using layer-wise weight sharing. To maintain parameter efficiency and avoid performance degradation, it introduces small adjustment parameters through parallel adapter modules for each shared layer.

Result: On ImageNet-1K, ScaleNet achieves 7.42% accuracy improvement over training from scratch with a 2× depth-scaled DeiT-Base model, while requiring only one-third of the training epochs. The method also shows promise in downstream tasks like object detection.

Conclusion: ScaleNet provides an efficient and cost-effective approach for scaling vision transformers, demonstrating significant performance improvements with reduced computational requirements, making it suitable for both image classification and downstream vision tasks.

Abstract: Recent advancements in vision transformers (ViTs) have demonstrated that larger models often achieve superior performance. However, training these models remains computationally intensive and costly. To address this challenge, we introduce ScaleNet, an efficient approach for scaling ViT models. Unlike conventional training from scratch, ScaleNet facilitates rapid model expansion with negligible increases in parameters, building on existing pretrained models. This offers a cost-effective solution for scaling up ViTs. Specifically, ScaleNet achieves model expansion by inserting additional layers into pretrained ViTs, utilizing layer-wise weight sharing to maintain parameters efficiency. Each added layer shares its parameter tensor with a corresponding layer from the pretrained model. To mitigate potential performance degradation due to shared weights, ScaleNet introduces a small set of adjustment parameters for each layer. These adjustment parameters are implemented through parallel adapter modules, ensuring that each instance of the shared parameter tensor remains distinct and optimized for its specific function. Experiments on the ImageNet-1K dataset demonstrate that ScaleNet enables efficient expansion of ViT models. With a 2$\times$ depth-scaled DeiT-Base model, ScaleNet achieves a 7.42% accuracy improvement over training from scratch while requiring only one-third of the training epochs, highlighting its efficiency in scaling ViTs. Beyond image classification, our method shows significant potential for application in downstream vision areas, as evidenced by the validation in object detection task.

[238] ProCLIP: Progressive Vision-Language Alignment via LLM-based Embedder

Xiaoxing Hu, Kaicheng Yang, Ziyang Gong, Qi Ming, Zonghao Guo, Xiang An, Ziyong Feng, Junchi Yan, Xue Yang

Main category: cs.CV

TL;DR: ProCLIP is a curriculum learning framework that progressively aligns CLIP’s image encoder with LLM-based text embedders to handle long texts and multilingual inputs while preserving CLIP’s original vision-language alignment.

Details

Motivation: CLIP's text encoder has limitations: 77-token input limit, no multilingual support, and poor fine-grained semantic understanding. Direct alignment with LLMs disrupts CLIP's pre-trained vision-language alignment.

Method: ProCLIP uses curriculum learning: 1) Knowledge distillation from CLIP text encoder to LLM embedder, 2) Image-text contrastive tuning with self-distillation regularization, 3) Instance semantic alignment and embedding structure alignment losses.

Result: The method effectively aligns CLIP image encoder with LLM-based embedders while preserving CLIP’s original vision-language alignment knowledge.

Conclusion: ProCLIP successfully enables CLIP to handle long texts and multilingual inputs through progressive alignment without disrupting pre-trained vision-language knowledge.

Abstract: The original CLIP text encoder is limited by a maximum input length of 77 tokens, which hampers its ability to effectively process long texts and perform fine-grained semantic understanding. In addition, the CLIP text encoder lacks support for multilingual inputs. All these limitations significantly restrict its applicability across a broader range of tasks. Recent studies have attempted to replace the CLIP text encoder with an LLM-based embedder to enhance its ability in processing long texts, multilingual understanding, and fine-grained semantic comprehension. However, because the representation spaces of LLMs and the vision-language space of CLIP are pretrained independently without alignment priors, direct alignment using contrastive learning can disrupt the intrinsic vision-language alignment in the CLIP image encoder, leading to an underutilization of the knowledge acquired during pre-training. To address this challenge, we propose ProCLIP, a curriculum learning-based progressive vision-language alignment framework to effectively align the CLIP image encoder with an LLM-based embedder. Specifically, ProCLIP first distills knowledge from CLIP’s text encoder into the LLM-based embedder to leverage CLIP’s rich pretrained knowledge while establishing initial alignment between the LLM embedder and CLIP image encoder. Subsequently, ProCLIP further aligns the CLIP image encoder with the LLM-based embedder through image-text contrastive tuning, employing self-distillation regularization to avoid overfitting. To achieve a more effective alignment, instance semantic alignment loss and embedding structure alignment loss are employed during representation inheritance and contrastive tuning. The Code is available at https://github.com/VisionXLab/ProCLIP.

[239] SAM 2++: Tracking Anything at Any Granularity

Jiaming Zhang, Cheng Liang, Yichun Yang, Chenkai Zeng, Yutao Cui, Xinwen Zhang, Xin Zhou, Kai Ma, Gangshan Wu, Limin Wang

Main category: cs.CV

TL;DR: SAM 2++ is a unified video tracking model that handles tracking at any granularity (masks, boxes, points) using task-specific prompts, unified decoder, and task-adaptive memory mechanism, achieving state-of-the-art performance across diverse tracking tasks.

Details

Motivation: Existing trackers are tailored to single tasks with custom-designed modules, limiting generalization and causing redundancy in model design and parameters. There's a need for a unified approach to handle tracking at different granularities.

Method: 1) Task-specific prompts to encode various inputs into general prompt embeddings, 2) Unified decoder to unify diverse task results, 3) Task-adaptive memory mechanism for cross-granularity memory matching, 4) Customized data engine producing Tracking-Any-Granularity dataset with rich annotations.

Result: Comprehensive experiments show SAM 2++ sets new state-of-the-art across diverse tracking tasks at different granularities, establishing a unified and robust tracking framework.

Conclusion: SAM 2++ successfully unifies video tracking tasks across different granularities, overcoming limitations of task-specific trackers and providing a comprehensive solution for tracking at any granularity.

Abstract: Video tracking aims at finding the specific target in subsequent frames given its initial state. Due to the varying granularity of target states across different tasks, most existing trackers are tailored to a single task and heavily rely on custom-designed modules within the individual task, which limits their generalization and leads to redundancy in both model design and parameters. To unify video tracking tasks, we present SAM 2++, a unified model towards tracking at any granularity, including masks, boxes, and points. First, to extend target granularity, we design task-specific prompts to encode various task inputs into general prompt embeddings, and a unified decoder to unify diverse task results into a unified form pre-output. Next, to satisfy memory matching, the core operation of tracking, we introduce a task-adaptive memory mechanism that unifies memory across different granularities. Finally, we introduce a customized data engine to support tracking training at any granularity, producing a large and diverse video tracking dataset with rich annotations at three granularities, termed Tracking-Any-Granularity, which represents a comprehensive resource for training and benchmarking on unified tracking. Comprehensive experiments on multiple benchmarks confirm that SAM 2++ sets a new state of the art across diverse tracking tasks at different granularities, establishing a unified and robust tracking framework.

cs.AI

[240] Test-time Verification via Optimal Transport: Coverage, ROC, & Sub-optimality

Arpan Mukherjee, Marcello Bullo, Debabrota Basu, Deniz Gündüz

Main category: cs.AI

TL;DR: The paper analyzes test-time scaling with verification in LLMs, revealing three regimes in the sub-optimality-coverage relationship and proposing transport-based framework to understand verifier-generator interactions.

Details

Motivation: To understand the underexplored role of verifiers and their imperfections in test-time scaling, and provide a unified framework for quantifying the interplay between generator coverage, verifier's region of convergence, and sampling sub-optimality.

Method: Framed verifiable test-time scaling as a transport problem, characterized interactions between coverage, ROC, and sub-optimality, and proposed/analyzed sequential and batched sampling algorithms.

Result: Identified three regimes in sub-optimality-coverage curve: transport regime (sub-optimality increases with coverage), policy improvement regime (sub-optimality may decrease with coverage depending on ROC), and saturation regime (sub-optimality plateaus). Empirical results with Qwen, Llama, and Gemma models support findings.

Conclusion: The transport framework successfully characterizes the geometric interplay between verifier imperfections, generator coverage, and sampling algorithms, revealing distinct operational regimes that inform practical implementation of test-time scaling with verification.

Abstract: While test-time scaling with verification has shown promise in improving the performance of large language models (LLMs), the role of the verifier and its imperfections remain underexplored. The effect of verification manifests through interactions of three quantities: (i) the generator’s coverage, (ii) the verifier’s region of convergence (ROC), and (iii) the sampling algorithm’s sub-optimality. Though recent studies capture subsets of these factors, a unified framework quantifying the geometry of their interplay is missing. We frame verifiable test-time scaling as a transport problem. This characterizes the interaction of coverage, ROC, and sub-optimality, and uncovers that the sub-optimality–coverage curve exhibits three regimes. A transport regime – where sub-optimality increases with coverage, a policy improvement regime – where sub-optimality may decrease with coverage, depending on the verifier’s ROC, and a saturation regime – where sub-optimality plateaus, unaffected by coverage. We further propose and analyze two classes of sampling algorithms – sequential and batched, and examine how their computational complexities shape these trade-offs. Empirical results with Qwen, Llama, and Gemma models corroborate our theoretical findings.

[241] Timely Clinical Diagnosis through Active Test Selection

Silas Ruhrberg Estévez, Nicolás Astorga, Mihaela van der Schaar

Main category: cs.AI

TL;DR: ACTMED is a diagnostic framework that combines Bayesian Experimental Design with LLMs to optimize clinical test selection, reducing diagnostic uncertainty while maintaining clinician oversight.

Details

Motivation: Current ML approaches for clinical diagnosis fail to capture the sequential, resource-aware reasoning used by clinicians in practice, especially in high-pressure or resource-limited settings.

Method: Integrates Bayesian Experimental Design with large language models to select tests that maximize reduction in diagnostic uncertainty. LLMs act as flexible simulators for patient state distributions and belief updates without requiring structured training data.

Result: ACTMED optimizes test selection to improve diagnostic accuracy, interpretability, and resource use on real-world datasets.

Conclusion: The framework represents progress toward transparent, adaptive, clinician-aligned diagnostic systems that generalize across settings with reduced need for domain-specific data.

Abstract: There is growing interest in using machine learning (ML) to support clinical diag- nosis, but most approaches rely on static, fully observed datasets and fail to reflect the sequential, resource-aware reasoning clinicians use in practice. Diagnosis remains complex and error prone, especially in high-pressure or resource-limited settings, underscoring the need for frameworks that help clinicians make timely and cost-effective decisions. We propose ACTMED (Adaptive Clinical Test selection via Model-based Experimental Design), a diagnostic framework that integrates Bayesian Experimental Design (BED) with large language models (LLMs) to better emulate real-world diagnostic reasoning. At each step, ACTMED selects the test expected to yield the greatest reduction in diagnostic uncertainty for a given patient. LLMs act as flexible simulators, generating plausible patient state distributions and supporting belief updates without requiring structured, task-specific training data. Clinicians can remain in the loop; reviewing test suggestions, interpreting intermediate outputs, and applying clinical judgment throughout. We evaluate ACTMED on real-world datasets and show it can optimize test selection to improve diagnostic accuracy, interpretability, and resource use. This represents a step to- ward transparent, adaptive, and clinician-aligned diagnostic systems that generalize across settings with reduced reliance on domain-specific data.

[242] The MUSE Benchmark: Probing Music Perception and Auditory Relational Reasoning in Audio LLMS

Brandon James Carone, Iran R. Roman, Pablo Ripollés

Main category: cs.AI

TL;DR: The MUSE Benchmark evaluates multimodal LLMs on music perception skills, revealing significant gaps between SOTA models and human experts, with inconsistent performance across models and detrimental effects from Chain-of-Thought prompting.

Details

Motivation: Current evaluations of MLLMs' audio understanding capabilities may obscure fundamental weaknesses in relational reasoning, particularly in music perception.

Method: Developed the MUSE Benchmark with 10 tasks to probe fundamental music perception skills, evaluated four SOTA models against a large human baseline (N=200).

Result: Wide variance in SOTA capabilities with persistent gap from human experts. Gemini Pro succeeds on basic perception, while Qwen and Audio Flamingo 3 perform at/near chance, exposing severe perceptual deficits. Chain-of-Thought prompting provides inconsistent, often detrimental results.

Conclusion: The MUSE Benchmark provides a critical tool for evaluating invariant musical representations and driving development of more robust AI systems.

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated capabilities in audio understanding, but current evaluations may obscure fundamental weaknesses in relational reasoning. We introduce the Music Understanding and Structural Evaluation (MUSE) Benchmark, an open-source resource with 10 tasks designed to probe fundamental music perception skills. We evaluate four SOTA models (Gemini Pro and Flash, Qwen2.5-Omni, and Audio-Flamingo 3) against a large human baseline (N=200). Our results reveal a wide variance in SOTA capabilities and a persistent gap with human experts. While Gemini Pro succeeds on basic perception, Qwen and Audio Flamingo 3 perform at or near chance, exposing severe perceptual deficits. Furthermore, we find Chain-of-Thought (CoT) prompting provides inconsistent, often detrimental results. Our work provides a critical tool for evaluating invariant musical representations and driving development of more robust AI systems.

[243] Rectifying Shortcut Behaviors in Preference-based Reward Learning

Wenqian Ye, Guangtao Zheng, Aidong Zhang

Main category: cs.AI

TL;DR: PRISM addresses reward hacking in preference-based reward models by learning group-invariant kernels to mitigate shortcut behaviors, improving generalization and reducing dependency on spurious features.

Details

Motivation: Preference-based reward models in RLHF are prone to reward hacking and poor generalization due to over-optimization, exploiting spurious features like response verbosity or sycophancy rather than genuine human preferences.

Method: Proposed PRISM (Preference-based Reward Invariance for Shortcut Mitigation), which learns group-invariant kernels with feature maps using a closed-form learning objective inspired by invariant theory in kernel perspective.

Result: Experimental results show PRISM consistently improves reward model accuracy on diverse out-of-distribution tasks and reduces shortcut dependency in downstream policy models.

Conclusion: PRISM establishes a robust framework for preference-based alignment by effectively mitigating shortcut behaviors in reward learning.

Abstract: In reinforcement learning from human feedback, preference-based reward models play a central role in aligning large language models to human-aligned behavior. However, recent studies show that these models are prone to reward hacking and often fail to generalize well due to over-optimization. They achieve high reward scores by exploiting shortcuts, that is, exploiting spurious features (e.g., response verbosity, agreeable tone, or sycophancy) that correlate with human preference labels in the training data rather than genuinely reflecting the intended objectives. In this paper, instead of probing these issues one at a time, we take a broader view of the reward hacking problem as shortcut behaviors and introduce a principled yet flexible approach to mitigate shortcut behaviors in preference-based reward learning. Inspired by the invariant theory in the kernel perspective, we propose Preference-based Reward Invariance for Shortcut Mitigation (PRISM), which learns group-invariant kernels with feature maps in a closed-form learning objective. Experimental results in several benchmarks show that our method consistently improves the accuracy of the reward model on diverse out-of-distribution tasks and reduces the dependency on shortcuts in downstream policy models, establishing a robust framework for preference-based alignment.

[244] A Multi-faceted Analysis of Cognitive Abilities: Evaluating Prompt Methods with Large Language Models on the CONSORT Checklist

Sohyeon Jeon, Hyung-Chul Lee

Main category: cs.AI

TL;DR: LLMs show varying cognitive strategies and limitations in assessing clinical trial reporting against CONSORT standards, with differences in reasoning style and uncertainty expression based on prompt conditions.

Details

Motivation: To evaluate LLMs' ability to assess clinical trial reporting according to CONSORT standards and understand their cognitive and reasoning strategies in healthcare applications.

Method: Applied behavioral and metacognitive analysis with expert-validated data, systematically comparing two representative LLMs under three different prompt conditions.

Result: Clear differences emerged in how models approached CONSORT items and prompt types, showing variations in reasoning style, explicit uncertainty expression, and alternative interpretations that shaped response patterns.

Conclusion: Current LLMs have limitations in clinical compliance automation, highlighting the need to understand their cognitive adaptations and strategic behavior for developing more explainable and reliable medical AI.

Abstract: Despite the rapid expansion of Large Language Models (LLMs) in healthcare, the ability of these systems to assess clinical trial reporting according to CONSORT standards remains unclear, particularly with respect to their cognitive and reasoning strategies. This study applies a behavioral and metacognitive analytic approach with expert-validated data, systematically comparing two representative LLMs under three prompt conditions. Clear differences emerged in how the models approached various CONSORT items, and prompt types, including shifts in reasoning style, explicit uncertainty, and alternative interpretations shaped response patterns. Our results highlight the current limitations of these systems in clinical compliance automation and underscore the importance of understanding their cognitive adaptations and strategic behavior in developing more explainable and reliable medical AI.

[245] The Zero-Step Thinking: An Empirical Study of Mode Selection as Harder Early Exit in Reasoning Models

Yuqiao Tan, Shizhu He, Kang Liu, Jun Zhao

Main category: cs.AI

TL;DR: Mode Selection aims to reduce computational overhead in reasoning models by choosing between Long-CoT (Thinking) or Short-CoT (No-Thinking) modes at the start of reasoning, using zero-step thinking without explicit reasoning process.

Details

Motivation: Reasoning models often overthink during step-by-step reasoning, causing unnecessary computational overhead. Mode Selection addresses this by automatically deciding the optimal reasoning approach upfront.

Method: Identifies Mode Selection as a challenging variant of Early Exit problem. Evaluates nine baselines including prompt-based approaches and methods leveraging internal model information. Uses zero-step thinking with pre-defined fake thoughts for decision making.

Result: Prompt-based approaches fail due to limited classification capabilities with minimal hand-crafted information. Internal information approaches perform better but suffer from stability issues. Existing methods are insufficient for effective Mode Selection in limited information scenarios.

Conclusion: Mode Selection remains challenging as it requires making optimal reasoning decisions at the beginning without explicit reasoning process. Current approaches relying solely on model information are inadequate, highlighting ongoing research challenges.

Abstract: Reasoning models have demonstrated exceptional performance in tasks such as mathematics and logical reasoning, primarily due to their ability to engage in step-by-step thinking during the reasoning process. However, this often leads to overthinking, resulting in unnecessary computational overhead. To address this issue, Mode Selection aims to automatically decide between Long-CoT (Chain-of-Thought) or Short-CoT by utilizing either a Thinking or NoThinking mode. Simultaneously, Early Exit determines the optimal stopping point during the iterative reasoning process. Both methods seek to reduce the computational burden. In this paper, we first identify Mode Selection as a more challenging variant of the Early Exit problem, as they share similar objectives but differ in decision timing. While Early Exit focuses on determining the best stopping point for concise reasoning at inference time, Mode Selection must make this decision at the beginning of the reasoning process, relying on pre-defined fake thoughts without engaging in an explicit reasoning process, referred to as zero-step thinking. Through empirical studies on nine baselines, we observe that prompt-based approaches often fail due to their limited classification capabilities when provided with minimal hand-crafted information. In contrast, approaches that leverage internal information generally perform better across most scenarios but still exhibit issues with stability. Our findings indicate that existing methods relying solely on the information provided by models are insufficient for effectively addressing Mode Selection in scenarios with limited information, highlighting the ongoing challenges of this task. Our code is available at https://github.com/Trae1ounG/Zero_Step_Thinking.

[246] WebGraphEval: Multi-Turn Trajectory Evaluation for Web Agents using Graph Representation

Yaoyao Qian, Yuanli Wang, Jinda Zhang, Yun Zong, Meixu Chen, Hanhan Zhou, Jindan Huang, Yifan Zeng, Xinyu Hu, Chan Hee Song, Danqing Zhang

Main category: cs.AI

TL;DR: WebGraphEval is a framework that abstracts web agent trajectories into weighted action graphs for comprehensive evaluation, capturing structural diversity and efficiency beyond binary success metrics.

Details

Motivation: Current web agent evaluation uses binary success metrics or single reference trajectories, ignoring the structural diversity in benchmark datasets and failing to capture cross-model regularities and efficiency.

Method: Abstracts trajectories from multiple agents into unified weighted action graphs, canonically encodes actions, merges recurring behaviors, and applies structural analyses including reward propagation and success-weighted edge statistics.

Result: Evaluations across thousands of trajectories from six web agents show the graph abstraction captures cross-model regularities, highlights redundancy and inefficiency, and identifies critical decision points overlooked by outcome-based metrics.

Conclusion: WebGraphEval establishes a general methodology for multi-path, cross-agent, and efficiency-aware evaluation of web agents by framing web interaction as graph-structured data.

Abstract: Current evaluation of web agents largely reduces to binary success metrics or conformity to a single reference trajectory, ignoring the structural diversity present in benchmark datasets. We present WebGraphEval, a framework that abstracts trajectories from multiple agents into a unified, weighted action graph. This representation is directly compatible with benchmarks such as WebArena, leveraging leaderboard runs and newly collected trajectories without modifying environments. The framework canonically encodes actions, merges recurring behaviors, and applies structural analyses including reward propagation and success-weighted edge statistics. Evaluations across thousands of trajectories from six web agents show that the graph abstraction captures cross-model regularities, highlights redundancy and inefficiency, and identifies critical decision points overlooked by outcome-based metrics. By framing web interaction as graph-structured data, WebGraphEval establishes a general methodology for multi-path, cross-agent, and efficiency-aware evaluation of web agents.

[247] ChatGPT Unveils Its Limits: Principles of Law Deliver Checkmate

Marianna Molinari, Ilaria Angela Amantea, Marinella Quaranta, Guido Governatori

Main category: cs.AI

TL;DR: ChatGPT underperforms in legal document analysis compared to regex baselines, revealing limitations in assembling knowledge and reasoning for comprehensive solutions.

Details

Motivation: To assess ChatGPT's performance in legal domain tasks, specifically extracting principles of law from legal decisions, and compare against regex baselines rather than just human performance.

Method: Conducted experiments comparing ChatGPT’s performance against regular expressions (Regex) in extracting key legal passages and principles of law from legal decisions.

Result: ChatGPT failed to assemble necessary knowledge and reasoning capabilities to provide exhaustive results, performing worse than regex baselines in legal document analysis tasks.

Conclusion: ChatGPT has major limitations in legal reasoning and comprehensive understanding, suggesting genuine intelligence remains uniquely human in the legal domain.

Abstract: This study examines the performance of ChatGPT with an experiment in the legal domain. We compare the outcome with it a baseline using regular expressions (Regex), rather than focusing solely on the assessment against human performance. The study reveals that even if ChatGPT has access to the necessary knowledge and competencies, it is unable to assemble them, reason through, in a way that leads to an exhaustive result. This unveils a major limitation of ChatGPT. Intelligence encompasses the ability to break down complex issues and address them according to multiple required competencies, providing a unified and comprehensive solution. In the legal domain, one of the most crucial tasks is reading legal decisions and extracting key passages condensed from principles of law (PoLs), which are then incorporated into subsequent rulings by judges or defense documents by lawyers. In performing this task, artificial intelligence lacks an all-encompassing understanding and reasoning, which makes it inherently limited. Genuine intelligence, remains a uniquely human trait, at least in this particular field.

[248] An Argumentative Explanation Framework for Generalized Reason Model with Inconsistent Precedents

Wachara Fungwacharakorn, Gauvain Bourgne, Ken Satoh

Main category: cs.AI

TL;DR: Extends derivation state argumentation framework to explain reasoning with inconsistent precedents in case-based reasoning.

Details

Motivation: No argumentative explanation methods exist for generalized reason models that handle inconsistent precedents, unlike traditional consistent models.

Method: Extends the derivation state argumentation framework (DSA-framework) to accommodate the generalized reason model.

Result: Develops an argumentative explanation approach for reasoning with inconsistent precedents.

Conclusion: Successfully addresses the gap in argumentative explanation methods for generalized reason models dealing with inconsistent precedents.

Abstract: Precedential constraint is one foundation of case-based reasoning in AI and Law. It generally assumes that the underlying set of precedents must be consistent. To relax this assumption, a generalized notion of the reason model has been introduced. While several argumentative explanation approaches exist for reasoning with precedents based on the traditional consistent reason model, there has been no corresponding argumentative explanation method developed for this generalized reasoning framework accommodating inconsistent precedents. To address this question, this paper examines an extension of the derivation state argumentation framework (DSA-framework) to explain the reasoning according to the generalized notion of the reason model.

Philipp J. Schneider, Lin Tian, Marian-Andrei Rizoiu

Main category: cs.AI

TL;DR: LLM agents can reproduce human social dynamics through behavioral reward functions and in-context learning, forming emergent network structures similar to real online communities.

Details

Motivation: To understand if LLM agents can replicate complex human social dynamics like homophily, reciprocity, and social validation, and what mechanisms enable such dynamics to emerge.

Method: Multi-agent LLM simulation framework with repeated interactions, behavioral reward functions capturing core online engagement drivers (social interaction, information seeking, self-presentation, coordination, emotional support), and in-context learning accelerated by coaching signals.

Result: Coached LLM agents develop stable interaction patterns and form emergent social ties, creating network structures that mirror properties of real online communities.

Conclusion: The framework provides a principled testbed for studying collective dynamics in LLM populations and reveals how artificial agents can approximate or diverge from human-like social behavior.

Abstract: Can large language model (LLM) agents reproduce the complex social dynamics that characterize human online behavior – shaped by homophily, reciprocity, and social validation – and what memory and learning mechanisms enable such dynamics to emerge? We present a multi-agent LLM simulation framework in which agents repeatedly interact, evaluate one another, and adapt their behavior through in-context learning accelerated by a coaching signal. To model human social behavior, we design behavioral reward functions that capture core drivers of online engagement, including social interaction, information seeking, self-presentation, coordination, and emotional support. These rewards align agent objectives with empirically observed user motivations, enabling the study of how network structures and group formations emerge from individual decision-making. Our experiments show that coached LLM agents develop stable interaction patterns and form emergent social ties, yielding network structures that mirror properties of real online communities. By combining behavioral rewards with in-context adaptation, our framework establishes a principled testbed for investigating collective dynamics in LLM populations and reveals how artificial agents may approximate or diverge from human-like social behavior.

[250] Continual Knowledge Adaptation for Reinforcement Learning

Jinwu Hu, Zihao Lian, Zhiquan Wen, Chenghao Li, Guohao Chen, Xutao Wen, Bin Xiao, Mingkui Tan

Main category: cs.AI

TL;DR: CKA-RL is a Continual Reinforcement Learning method that uses task-specific knowledge vectors and adaptive merging to prevent catastrophic forgetting and enable efficient knowledge transfer across non-stationary environments.

Details

Motivation: Real-world environments are non-stationary, requiring continuous adaptation. Existing Continual RL methods suffer from catastrophic forgetting and inefficient knowledge utilization.

Method: Proposes Continual Knowledge Adaptation with task-specific knowledge vector pool and Adaptive Knowledge Merging mechanism to combine similar vectors for scalability.

Result: Outperforms state-of-the-art methods with 4.20% overall performance improvement and 8.02% forward transfer improvement on three benchmarks.

Conclusion: CKA-RL effectively addresses catastrophic forgetting and enables efficient knowledge transfer in continual reinforcement learning through knowledge adaptation and merging strategies.

Abstract: Reinforcement Learning enables agents to learn optimal behaviors through interactions with environments. However, real-world environments are typically non-stationary, requiring agents to continuously adapt to new tasks and changing conditions. Although Continual Reinforcement Learning facilitates learning across multiple tasks, existing methods often suffer from catastrophic forgetting and inefficient knowledge utilization. To address these challenges, we propose Continual Knowledge Adaptation for Reinforcement Learning (CKA-RL), which enables the accumulation and effective utilization of historical knowledge. Specifically, we introduce a Continual Knowledge Adaptation strategy, which involves maintaining a task-specific knowledge vector pool and dynamically using historical knowledge to adapt the agent to new tasks. This process mitigates catastrophic forgetting and enables efficient knowledge transfer across tasks by preserving and adapting critical model parameters. Additionally, we propose an Adaptive Knowledge Merging mechanism that combines similar knowledge vectors to address scalability challenges, reducing memory requirements while ensuring the retention of essential knowledge. Experiments on three benchmarks demonstrate that the proposed CKA-RL outperforms state-of-the-art methods, achieving an improvement of 4.20% in overall performance and 8.02% in forward transfer. The source code is available at https://github.com/Fhujinwu/CKA-RL.

[251] MSC-Bench: A Rigorous Benchmark for Multi-Server Tool Orchestration

Jia-Kai Dong, I-Wei Huang, Chun-Tin Wu, Yi-Tien Tsai

Main category: cs.AI

TL;DR: MSC-Bench is a benchmark for evaluating multi-hop tool orchestration by LLM agents in hierarchical MCP ecosystems, addressing gaps in existing benchmarks through equal function sets and systematic curriculum testing.

Details

Motivation: Existing benchmarks evaluate tools in isolation, ignoring functional overlap and cross-server orchestration challenges, leading to overly optimistic assessments of agent capabilities.

Method: Constructs ground truth through ’equal function sets’ and organizes evaluation as a five-level curriculum testing capabilities from single-tool orchestration to complex cross-server planning and robustness to out-of-scope requests.

Result: Experiments show rigid hierarchies hinder performance without co-designed strategies, and state-of-the-art agents exhibit systemic weaknesses in robustness.

Conclusion: MSC-Bench provides a diagnostic framework to expose limitations and guide development of more capable and efficient tool-using agents, with publicly available benchmark resources.

Abstract: We introduce MSC-Bench, a large-scale benchmark for evaluating multi-hop, end-to-end tool orchestration by LLM agents in a hierarchical Model-Context Protocol (MCP) ecosystem. Existing benchmarks often evaluate tools in isolation, ignoring challenges such as functional overlap and cross-server orchestration, leading to overly optimistic assessments. MSC-Bench addresses these gaps by constructing ground truth through ’equal function sets’, allowing objective metrics such as F1 score and reducing the dependency on LLM-as-a-judge evaluation. Organized as a five-level curriculum, it systematically tests agent capabilities from single-tool orchestration to complex cross-server planning, and robustness to out-of-scope requests. Experiments reveal that rigid hierarchies can hinder performance without co-designed strategies, and even state-of-the-art agents exhibit systemic weaknesses in robustness. MSC-Bench provides a diagnostic framework to expose these limitations and guide the development of more capable and efficient tool-using agents. The benchmark and resources are publicly available at https://github.com/snooow1029/MSC_Bench.

[252] NeSyPr: Neurosymbolic Proceduralization For Efficient Embodied Reasoning

Wonje Choi, Jooyoung Kim, Honguk Woo

Main category: cs.AI

TL;DR: NeSyPr is a neurosymbolic proceduralization framework that compiles symbolic plans into composable procedures for language models, enabling efficient embodied reasoning without external symbolic tools at test time.

Details

Motivation: To address latency, connectivity, and resource limitations when using language models for embodied tasks in dynamic environments where online access to large-scale inference engines or symbolic planners is constrained.

Method: First generates task-specific plans using symbolic tools with declarative knowledge, then transforms these plans into composable procedural representations that encode implicit production rules, allowing seamless integration into LM inference.

Result: Evaluated on PDDLGym, VirtualHome, and ALFWorld benchmarks, demonstrating efficient reasoning capabilities over large-scale reasoning models and symbolic planners while using more compact LMs.

Conclusion: NeSyPr enables efficient test-time inference without relying on external symbolic guidance, making it suitable for deployment in latency-sensitive and resource-constrained physical systems.

Abstract: We address the challenge of adopting language models (LMs) for embodied tasks in dynamic environments, where online access to large-scale inference engines or symbolic planners is constrained due to latency, connectivity, and resource limitations. To this end, we present NeSyPr, a novel embodied reasoning framework that compiles knowledge via neurosymbolic proceduralization, thereby equipping LM-based agents with structured, adaptive, and timely reasoning capabilities. In NeSyPr, task-specific plans are first explicitly generated by a symbolic tool leveraging its declarative knowledge. These plans are then transformed into composable procedural representations that encode the plans’ implicit production rules, enabling the resulting composed procedures to be seamlessly integrated into the LM’s inference process. This neurosymbolic proceduralization abstracts and generalizes multi-step symbolic structured path-finding and reasoning into single-step LM inference, akin to human knowledge compilation. It supports efficient test-time inference without relying on external symbolic guidance, making it well suited for deployment in latency-sensitive and resource-constrained physical systems. We evaluate NeSyPr on the embodied benchmarks PDDLGym, VirtualHome, and ALFWorld, demonstrating its efficient reasoning capabilities over large-scale reasoning models and a symbolic planner, while using more compact LMs.

[253] HSCodeComp: A Realistic and Expert-level Benchmark for Deep Search Agents in Hierarchical Rule Application

Yiqian Yang, Tian Lan, Qianghuai Jia, Li Zhu, Hui Jiang, Hang Zhu, Longyue Wang, Weihua Luo, Kaifu Zhang

Main category: cs.AI

TL;DR: HSCodeComp is a new benchmark for evaluating deep search agents’ ability to apply hierarchical rules like tariff codes, showing current agents perform poorly (46.8% accuracy) compared to humans (95.0%).

Details

Motivation: Current agent benchmarks overlook the critical capability of applying complex rules with vague boundaries and implicit logic, which is essential for real-world applications like legal, medical, and customs domains.

Method: Created HSCodeComp benchmark using real-world e-commerce data with 632 product entries, requiring agents to predict 10-digit Harmonized System Codes from noisy product descriptions using hierarchical rule application.

Result: Best agent achieved only 46.8% 10-digit accuracy, far below human expert performance of 95.0%. Test-time scaling failed to improve performance, highlighting the challenge of hierarchical rule application.

Conclusion: There is a significant performance gap between current deep search agents and human experts in hierarchical rule application, indicating the need for improved agent capabilities in handling complex, implicit rule systems.

Abstract: Effective deep search agents must not only access open-domain and domain-specific knowledge but also apply complex rules-such as legal clauses, medical manuals and tariff rules. These rules often feature vague boundaries and implicit logic relationships, making precise application challenging for agents. However, this critical capability is largely overlooked by current agent benchmarks. To fill this gap, we introduce HSCodeComp, the first realistic, expert-level e-commerce benchmark designed to evaluate deep search agents in hierarchical rule application. In this task, the deep reasoning process of agents is guided by these rules to predict 10-digit Harmonized System Code (HSCode) of products with noisy but realistic descriptions. These codes, established by the World Customs Organization, are vital for global supply chain efficiency. Built from real-world data collected from large-scale e-commerce platforms, our proposed HSCodeComp comprises 632 product entries spanning diverse product categories, with these HSCodes annotated by several human experts. Extensive experimental results on several state-of-the-art LLMs, open-source, and closed-source agents reveal a huge performance gap: best agent achieves only 46.8% 10-digit accuracy, far below human experts at 95.0%. Besides, detailed analysis demonstrates the challenges of hierarchical rule application, and test-time scaling fails to improve performance further.

[254] DAIL: Beyond Task Ambiguity for Language-Conditioned Reinforcement Learning

Runpeng Xie, Quanwei Wang, Hao Hu, Zherui Zhou, Ni Mu, Xiyun Li, Yiqin Yang, Shuang Xu, Qianchuan Zhao, Bo XU

Main category: cs.AI

TL;DR: DAIL (Distributional Aligned Learning) addresses linguistic instruction ambiguity in intelligent agents through distributional policy and semantic alignment, improving task performance.

Details

Motivation: Natural language instructions are flexible but ambiguous, which degrades performance in language-conditioned tasks for intelligent agents.

Method: DAIL uses distributional policy for value distribution estimation and semantic alignment to connect trajectories with linguistic instructions.

Result: Extensive experiments on structured and visual benchmarks show DAIL effectively resolves instruction ambiguities and outperforms baseline methods.

Conclusion: DAIL successfully addresses linguistic instruction ambiguity through distributional alignment, achieving superior performance in language-conditioned tasks.

Abstract: Comprehending natural language and following human instructions are critical capabilities for intelligent agents. However, the flexibility of linguistic instructions induces substantial ambiguity across language-conditioned tasks, severely degrading algorithmic performance. To address these limitations, we present a novel method named DAIL (Distributional Aligned Learning), featuring two key components: distributional policy and semantic alignment. Specifically, we provide theoretical results that the value distribution estimation mechanism enhances task differentiability. Meanwhile, the semantic alignment module captures the correspondence between trajectories and linguistic instructions. Extensive experimental results on both structured and visual observation benchmarks demonstrate that DAIL effectively resolves instruction ambiguities, achieving superior performance to baseline methods. Our implementation is available at https://github.com/RunpengXie/Distributional-Aligned-Learning.

[255] AgentSense: LLMs Empower Generalizable and Explainable Web-Based Participatory Urban Sensing

Xusen Guo, Mingxing Peng, Xixuan Hao, Xingchen Zou, Qiongyan Wang, Sijie Ruan, Yuxuan Liang

Main category: cs.AI

TL;DR: AgentSense is a training-free framework that integrates LLMs into participatory urban sensing using a multi-agent evolution system to adapt sensing tasks to dynamic urban conditions and worker preferences while providing natural language explanations.

Details

Motivation: Existing urban sensing systems have limited generalization across diverse urban scenarios and poor interpretability in decision-making, which hinders their effectiveness in modern urban management.

Method: A hybrid framework that initially uses classical planners to generate baseline solutions, then iteratively refines them through a multi-agent evolution system to adapt to dynamic urban conditions and heterogeneous worker preferences while producing natural language explanations.

Result: Extensive experiments show AgentSense outperforms traditional methods in adaptivity and explainability, and beats single-agent LLM baselines in both performance and robustness while providing more reasonable and transparent explanations.

Conclusion: AgentSense represents a significant advancement towards deploying adaptive and explainable urban sensing systems on the web, addressing key limitations of existing approaches.

Abstract: Web-based participatory urban sensing has emerged as a vital approach for modern urban management by leveraging mobile individuals as distributed sensors. However, existing urban sensing systems struggle with limited generalization across diverse urban scenarios and poor interpretability in decision-making. In this work, we introduce AgentSense, a hybrid, training-free framework that integrates large language models (LLMs) into participatory urban sensing through a multi-agent evolution system. AgentSense initially employs classical planner to generate baseline solutions and then iteratively refines them to adapt sensing task assignments to dynamic urban conditions and heterogeneous worker preferences, while producing natural language explanations that enhance transparency and trust. Extensive experiments across two large-scale mobility datasets and seven types of dynamic disturbances demonstrate that AgentSense offers distinct advantages in adaptivity and explainability over traditional methods. Furthermore, compared to single-agent LLM baselines, our approach outperforms in both performance and robustness, while delivering more reasonable and transparent explanations. These results position AgentSense as a significant advancement towards deploying adaptive and explainable urban sensing systems on the web.

[256] A Graph Engine for Guitar Chord-Tone Soloing Education

Matthew Keating, Michael Casey

Main category: cs.AI

TL;DR: A graph-based engine for generating chord tone soloing suggestions for guitar students, using weighted graphs to find optimal transitions between chord arpeggios.

Details

Motivation: Chord tone soloing is fundamental for jazz guitar improvisation but difficult to learn and practice, requiring systematic guidance.

Method: Generate chord-tone arpeggios, construct weighted graph with nodes as chord arpeggios, calculate edge weights based on optimal transition tones, find shortest path, and reconstruct soloing line.

Result: Developed a computational engine that generates chord tone soloing suggestions through graph-based path optimization.

Conclusion: The system provides a user-friendly tool for guitar students to practice chord tone soloing effectively.

Abstract: We present a graph-based engine for computing chord tone soloing suggestions for guitar students. Chord tone soloing is a fundamental practice for improvising over a chord progression, where the instrumentalist uses only the notes contained in the current chord. This practice is a building block for all advanced jazz guitar theory but is difficult to learn and practice. First, we discuss methods for generating chord-tone arpeggios. Next, we construct a weighted graph where each node represents a chord tone arpeggio for a chord in the progression. Then, we calculate the edge weight between each consecutive chord’s nodes in terms of optimal transition tones. We then find the shortest path through this graph and reconstruct a chord-tone soloing line. Finally, we discuss a user-friendly system to handle input and output to this engine for guitar students to practice chord tone soloing.

[257] Explainable e-sports win prediction through Machine Learning classification in streaming

Silvia García-Méndez, Francisco de Arriba-Pérez

Main category: cs.AI

TL;DR: This paper presents an explainable win prediction system for e-sports that operates in streaming mode using sliding windows, achieving over 90% accuracy and surpassing existing solutions.

Details

Motivation: The growth of e-sports and cloud computing has created demand for better analytics. Traditional AI solutions focus on batch classification and lack visualization, while streaming data analysis with explainability is needed for real-time decision-making.

Method: An explainable win prediction classification solution that processes input data through multiple sliding windows to capture relevant game changes in streaming mode.

Result: Experimental results achieved accuracy higher than 90%, outperforming competing solutions in the literature.

Conclusion: The system can be used by ranking and recommender systems for informed decision-making, with the explainability module building trust in prediction outcomes.

Abstract: The increasing number of spectators and players in e-sports, along with the development of optimized communication solutions and cloud computing technology, has motivated the constant growth of the online game industry. Even though Artificial Intelligence-based solutions for e-sports analytics are traditionally defined as extracting meaningful patterns from related data and visualizing them to enhance decision-making, most of the effort in professional winning prediction has been focused on the classification aspect from a batch perspective, also leaving aside the visualization techniques. Consequently, this work contributes to an explainable win prediction classification solution in streaming in which input data is controlled over several sliding windows to reflect relevant game changes. Experimental results attained an accuracy higher than 90 %, surpassing the performance of competing solutions in the literature. Ultimately, our system can be leveraged by ranking and recommender systems for informed decision-making, thanks to the explainability module, which fosters trust in the outcome predictions.

Yang Yang, Hua XU, Zhangyi Hu, Yutao Yue

Main category: cs.AI

TL;DR: RLIE integrates LLMs with probabilistic modeling to learn weighted rules through generation, logistic regression, iterative refinement, and evaluation stages, showing that direct rule application outperforms LLM-based probabilistic integration.

Details

Motivation: To address the limitations of LLM-based rule learning approaches that ignore rule interactions and probabilistic inference, by combining LLMs' semantic generation capabilities with robust probabilistic rule learning methods.

Method: Four-stage framework: 1) LLM generates and filters rule candidates, 2) Logistic regression learns probabilistic weights, 3) Iterative refinement updates rules using prediction errors, 4) Evaluation compares direct rule application vs. LLM-based inference.

Result: Direct application of weighted rules outperforms prompting LLMs with rules and weights, showing LLMs excel at semantic tasks but struggle with precise probabilistic integration.

Conclusion: RLIE demonstrates that coupling LLMs with probabilistic rule learning enables more reliable neuro-symbolic reasoning, clarifying LLMs’ strengths in generation vs. limitations in probabilistic reasoning.

Abstract: Large Language Models (LLMs) can propose rules in natural language, sidestepping the need for a predefined predicate space in traditional rule learning. Yet many LLM-based approaches ignore interactions among rules, and the opportunity to couple LLMs with probabilistic rule learning for robust inference remains underexplored. We present RLIE, a unified framework that integrates LLMs with probabilistic modeling to learn a set of weighted rules. RLIE has four stages: (1) Rule generation, where an LLM proposes and filters candidates; (2) Logistic regression, which learns probabilistic weights for global selection and calibration; (3) Iterative refinement, which updates the rule set using prediction errors; and (4) Evaluation, which compares the weighted rule set as a direct classifier with methods that inject rules into an LLM. We evaluate multiple inference strategies on real-world datasets. Applying rules directly with their learned weights yields superior performance, whereas prompting LLMs with the rules, weights, and logistic-model outputs surprisingly degrades accuracy. This supports the view that LLMs excel at semantic generation and interpretation but are less reliable for precise probabilistic integration. RLIE clarifies the potential and limitations of LLMs for inductive reasoning and couples them with classic probabilistic rule combination methods to enable more reliable neuro-symbolic reasoning.

[259] Memo: Training Memory-Efficient Embodied Agents with Reinforcement Learning

Gunshi Gupta, Karmesh Yadav, Zsolt Kira, Yarin Gal, Rahaf Aljundi

Main category: cs.AI

TL;DR: Memo is a transformer-based architecture for RL that creates and retrieves memory using periodic summarization tokens, enabling efficient handling of long-horizon tasks while being more compute-efficient than naive long-context transformers.

Details

Motivation: Current transformer-based policies for embodied agents struggle with visual inputs overwhelming context limits, while humans effectively compress lifetime experiences into memories. Existing approaches use either fixed-size recurrent models or full-context transformers, lacking efficient memory compression.

Method: Memo interleaves periodic summarization tokens with model inputs during training to create and retrieve memory. It’s designed for reinforcement learning on memory-intensive, long-horizon tasks.

Result: Memo outperforms naive long-context transformer baselines on gridworld meta-RL and multi-object navigation tasks while being more compute and storage efficient. It also generalizes better to longer contexts and remains robust in streaming settings.

Conclusion: Memo successfully addresses the memory bottleneck in embodied AI by enabling efficient memory creation and retrieval through summarization tokens, providing a scalable solution for long-horizon tasks with computational efficiency.

Abstract: To enable embodied agents to operate effectively over extended timeframes, it is crucial to develop models that form and access memories to stay contextualized in their environment. In the current paradigm of training transformer-based policies for embodied sequential decision-making tasks, visual inputs often overwhelm the context limits of transformers, while humans can maintain and utilize a lifetime of experience compressed as memories. Significant compression is possible in principle, as much of the input is irrelevant and can be abstracted. However, existing approaches predominantly focus on either recurrent models with fixed-size memory or transformers with full-context reliance. In this work, we propose Memo, a transformer-based architecture and training recipe for reinforcement learning (RL) on memory-intensive, long-horizon tasks. Memo incorporates the creation and retrieval of memory by interleaving periodic summarization tokens with the inputs of a model during training. We demonstrate Memo’s effectiveness on a gridworld meta-RL benchmark and a multi-object navigation task in photo-realistic indoor settings. Memo outperforms naive long-context transformer baselines while being more compute and storage efficient. Additionally, Memo generalizes better to longer contexts at inference time and remains robust in streaming settings, where historical context must be truncated to fit inference constraints.

[260] Misalignment Bounty: Crowdsourcing AI Agent Misbehavior

Rustem Turtayev, Natalia Fedorova, Oleg Serikov, Sergey Koldyba, Lev Avagyan, Dmitrii Volkov

Main category: cs.AI

TL;DR: The Misalignment Bounty project collected 295 submissions of AI systems acting against human intent, with 9 winning cases demonstrating unsafe goal pursuit.

Details

Motivation: To gather clear, reproducible examples of AI systems acting in ways that differ from human intent and pursuing unintended or unsafe goals.

Method: Ran a crowdsourced project called Misalignment Bounty that collected cases from participants, receiving 295 submissions which were evaluated against specific criteria.

Result: 295 submissions were received, with 9 winning cases selected that demonstrated clear examples of AI misalignment and unsafe goal pursuit.

Conclusion: The project successfully identified and documented concrete examples of AI misalignment through crowdsourced submissions, providing valuable case studies of how advanced AI systems can diverge from human intent.

Abstract: Advanced AI systems sometimes act in ways that differ from human intent. To gather clear, reproducible examples, we ran the Misalignment Bounty: a crowdsourced project that collected cases of agents pursuing unintended or unsafe goals. The bounty received 295 submissions, of which nine were awarded. This report explains the program’s motivation and evaluation criteria, and walks through the nine winning submissions step by step.

[261] ROTATE: Regret-driven Open-ended Training for Ad Hoc Teamwork

Caroline Wang, Arrasy Rahman, Jiaxun Cui, Yoonchang Sung, Peter Stone

Main category: cs.AI

TL;DR: ROTATE is a unified framework for Ad Hoc Teamwork that combines teammate generation and agent training through an adversarial, regret-driven process to improve generalization to unseen partners.

Details

Motivation: Existing AHT approaches use separate stages for teammate generation and agent training, leading to limited behavior coverage and ignoring whether generated teammates are easy to learn from. Current methods treat training teammates as static, limiting generalization.

Method: ROTATE reformulates AHT as an open-ended learning process between an AHT agent and adversarial teammate generator. It alternates between improving the AHT agent and generating teammates that probe its deficiencies using a regret-driven approach.

Result: Experiments across diverse two-player environments show ROTATE significantly outperforms baselines at generalizing to unseen evaluation teammates, establishing a new standard for robust teamwork.

Conclusion: The unified adversarial framework enables better coverage of teammate behaviors and more effective learning, leading to superior generalization in ad hoc teamwork scenarios.

Abstract: Learning to collaborate with previously unseen partners is a fundamental generalization challenge in multi-agent learning, known as Ad Hoc Teamwork (AHT). Existing AHT approaches often adopt a two-stage pipeline, where first, a fixed population of teammates is generated with the idea that they should be representative of the teammates that will be seen at deployment time, and second, an AHT agent is trained to collaborate well with agents in the population. To date, the research community has focused on designing separate algorithms for each stage. This separation has led to algorithms that generate teammates with limited coverage of possible behaviors, and that ignore whether the generated teammates are easy to learn from for the AHT agent. Furthermore, algorithms for training AHT agents typically treat the set of training teammates as static, thus attempting to generalize to previously unseen partner agents without assuming any control over the set of training teammates. This paper presents a unified framework for AHT by reformulating the problem as an open-ended learning process between an AHT agent and an adversarial teammate generator. We introduce ROTATE, a regret-driven, open-ended training algorithm that alternates between improving the AHT agent and generating teammates that probe its deficiencies. Experiments across diverse two-player environments demonstrate that ROTATE significantly outperforms baselines at generalizing to an unseen set of evaluation teammates, thus establishing a new standard for robust and generalizable teamwork.

[262] Beyond Reactivity: Measuring Proactive Problem Solving in LLM Agents

Gil Pasternak, Dheeraj Rajagopal, Julia White, Dhruv Atreja, Matthew Thomas, George Hurn-Maloney, Ash Lewis

Main category: cs.AI

TL;DR: PROBE is a new benchmark for evaluating proactive LLM agents that decomposes proactivity into three capabilities: searching for unspecified issues, identifying bottlenecks, and executing resolutions. It shows current state-of-the-art models struggle with proactive reasoning.

Details

Motivation: Current benchmarks for evaluating proactive LLM agents are limited to localized context and cannot test reasoning across multiple sources and longer time horizons, making it challenging to properly assess proactivity.

Method: PROBE decomposes proactivity into a pipeline of three core capabilities: (1) searching for unspecified issues, (2) identifying specific bottlenecks, and (3) executing appropriate resolutions. This framework is applied to evaluate leading LLMs and agentic frameworks.

Result: Even state-of-the-art models struggle with the PROBE benchmark, with the best end-to-end performance of 40% achieved by both GPT-5 and Claude Opus-4.1. The study provides consistent measurements across frontier LLMs and analyzes their relative capabilities and failure modes.

Conclusion: The results highlight current limitations in autonomous action for agentic systems and expose promising future research directions for improving proactive reasoning capabilities in LLM-based agents.

Abstract: LLM-based agents are increasingly moving towards proactivity: rather than awaiting instruction, they exercise agency to anticipate user needs and solve them autonomously. However, evaluating proactivity is challenging; current benchmarks are constrained to localized context, limiting their ability to test reasoning across sources and longer time horizons. To address this gap, we present PROBE (Proactive Resolution Of BottlEnecks). PROBE decomposes proactivity as a pipeline of three core capabilities: (1) searching for unspecified issues, (2) identifying specific bottlenecks, and (3) executing appropriate resolutions. We apply PROBE to evaluate leading LLMs and popular agentic frameworks, showing that even state-of-the-art models struggle to solve this benchmark. Computing our consistent measurements across frontier LLMs and agents, we find that the best end-to-end performance of 40% is achieved by both GPT-5 and Claude Opus-4.1. Additionally, we demonstrate the relative capabilities of each model and analyze mutual failure modes. Our results highlight the current limitations of autonomous action in agentic systems, and expose promising future research directions.

[263] IM-Chat: A Multi-agent LLM Framework Integrating Tool-Calling and Diffusion Modeling for Knowledge Transfer in Injection Molding Industry

Junhyeong Lee, Joon-Young Kim, Heekyu Kim, Inhyo Lee, Seunghwa Ryu

Main category: cs.AI

TL;DR: IM-Chat is a multi-agent LLM framework for injection molding knowledge transfer that integrates documented knowledge and field data through a data-driven process condition generator, using RAG and tool-calling agents without fine-tuning.

Details

Motivation: Address challenges in preserving and transferring field knowledge in injection molding due to retiring experienced workers and multilingual communication barriers.

Method: Multi-agent framework using LLMs with RAG strategy and tool-calling agents, integrating documented knowledge and field data via data-driven process condition generator that infers optimal settings from environmental inputs.

Result: Evaluation across 160 tasks showed more capable models achieve higher accuracy, especially in complex scenarios. IM-Chat outperformed fine-tuned single-agent LLMs in accuracy (particularly quantitative reasoning) and scalability with multiple information sources.

Conclusion: Multi-agent LLM systems are viable for industrial knowledge workflows, with IM-Chat establishing a scalable and generalizable approach to AI-assisted decision support in manufacturing.

Abstract: The injection molding industry faces critical challenges in preserving and transferring field knowledge, particularly as experienced workers retire and multilingual barriers hinder effective communication. This study introduces IM-Chat, a multi-agent framework based on large language models (LLMs), designed to facilitate knowledge transfer in injection molding. IM-Chat integrates both limited documented knowledge (e.g., troubleshooting tables, manuals) and extensive field data modeled through a data-driven process condition generator that infers optimal manufacturing settings from environmental inputs such as temperature and humidity, enabling robust and context-aware task resolution. By adopting a retrieval-augmented generation (RAG) strategy and tool-calling agents within a modular architecture, IM-Chat ensures adaptability without the need for fine-tuning. Performance was assessed across 100 single-tool and 60 hybrid tasks for GPT-4o, GPT-4o-mini, and GPT-3.5-turbo by domain experts using a 10-point rubric focused on relevance and correctness, and was further supplemented by automated evaluation using GPT-4o guided by a domain-adapted instruction prompt. The evaluation results indicate that more capable models tend to achieve higher accuracy, particularly in complex, tool-integrated scenarios. In addition, compared with the fine-tuned single-agent LLM, IM-Chat demonstrated superior accuracy, particularly in quantitative reasoning, and greater scalability in handling multiple information sources. Overall, these findings demonstrate the viability of multi-agent LLM systems for industrial knowledge workflows and establish IM-Chat as a scalable and generalizable approach to AI-assisted decision support in manufacturing.

[264] Benchmarking World-Model Learning

Archana Warrier, Dat Nyugen, Michelangelo Naim, Moksh Jain, Yichao Liang, Karen Schroeder, Cambridge Yang, Joshua B. Tenenbaum, Sebastian Vollmer, Kevin Ellis, Zenna Tavares

Main category: cs.AI

TL;DR: WorldTest is a new evaluation protocol for model-learning agents that separates reward-free interaction from testing in different environments, with AutumnBench as a concrete implementation showing humans outperform current models.

Details

Motivation: Current world model evaluation methods are limited by being anchored to next-frame prediction and same-environment reward maximization, failing to assess models' ability to support diverse downstream tasks.

Method: Proposed WorldTest protocol with reward-free exploration phase followed by scored testing in related but different environments, implemented as AutumnBench - 43 grid-world environments with 129 tasks across masked-frame prediction, planning, and causal dynamics prediction.

Result: Humans significantly outperform three frontier models on AutumnBench, and scaling compute only improves performance in some environments but not others, showing current models have substantial room for improvement.

Conclusion: WorldTest provides a novel evaluation template that better assesses what agents learn about environment dynamics, and AutumnBench reveals significant headroom in world-model learning capabilities.

Abstract: Model-learning agents should gather information to learn world models that support many downstream tasks and inferences, such as predicting unobserved states, estimating near- and far-term consequences of actions, planning action sequences, and detecting changes in dynamics. Current methods for learning and evaluating world models diverge from this goal: training and evaluation are anchored to next-frame prediction, and success is scored by reward maximization in the same environment. We propose WorldTest, a protocol to evaluate model-learning agents that separates reward-free interaction from a scored test phase in a different but related environment. WorldTest is open-ended$\unicode{x2014}$models should support many different tasks unknown ahead of time$\unicode{x2014}$and agnostic to model representation, allowing comparison across approaches. We instantiated WorldTest with AutumnBench, a suite of 43 interactive grid-world environments and 129 tasks across three families: masked-frame prediction, planning, and predicting changes to the causal dynamics. We compared 517 human participants and three frontier models on AutumnBench. We found that humans outperform the models, and scaling compute improves performance only in some environments but not others. WorldTest provides a novel template$\unicode{x2014}$reward-free exploration, derived tests, and behavior-based scoring$\unicode{x2014}$to evaluate what agents learn about environment dynamics, and AutumnBench exposes significant headroom in world-model learning.

[265] RADAR: A Risk-Aware Dynamic Multi-Agent Framework for LLM Safety Evaluation via Role-Specialized Collaboration

Xiuyuan Chen, Jian Zhao, Yuchen Yuan, Tianle Zhang, Huilin Zhou, Zheng Zhu, Ping Hu, Linghe Kong, Chi Zhang, Weiran Huang, Xuelong Li

Main category: cs.AI

TL;DR: RADAR is a multi-agent collaborative framework for LLM safety evaluation that decomposes risk into explicit, implicit, and non-risk subspaces, using debate mechanisms to improve accuracy and reduce bias.

Details

Motivation: Existing LLM safety evaluation methods suffer from evaluator bias and detection failures due to model homogeneity, undermining risk evaluation robustness.

Method: Decompose latent risk concept space into explicit, implicit, and non-risk subspaces. Use RADAR framework with multi-agent collaboration, multi-round debates through four specialized roles, and dynamic update mechanisms for self-evolution.

Result: RADAR achieves 28.87% improvement in risk identification accuracy over strongest baseline, outperforms baselines on accuracy, stability, and self-evaluation risk sensitivity on 800 challenging cases and public benchmarks.

Conclusion: RADAR provides a robust safety evaluation paradigm that comprehensively covers explicit and implicit risks while mitigating evaluator bias through multi-agent collaboration and dynamic concept evolution.

Abstract: Existing safety evaluation methods for large language models (LLMs) suffer from inherent limitations, including evaluator bias and detection failures arising from model homogeneity, which collectively undermine the robustness of risk evaluation processes. This paper seeks to re-examine the risk evaluation paradigm by introducing a theoretical framework that reconstructs the underlying risk concept space. Specifically, we decompose the latent risk concept space into three mutually exclusive subspaces: the explicit risk subspace (encompassing direct violations of safety guidelines), the implicit risk subspace (capturing potential malicious content that requires contextual reasoning for identification), and the non-risk subspace. Furthermore, we propose RADAR, a multi-agent collaborative evaluation framework that leverages multi-round debate mechanisms through four specialized complementary roles and employs dynamic update mechanisms to achieve self-evolution of risk concept distributions. This approach enables comprehensive coverage of both explicit and implicit risks while mitigating evaluator bias. To validate the effectiveness of our framework, we construct an evaluation dataset comprising 800 challenging cases. Extensive experiments on our challenging testset and public benchmarks demonstrate that RADAR significantly outperforms baseline evaluation methods across multiple dimensions, including accuracy, stability, and self-evaluation risk sensitivity. Notably, RADAR achieves a 28.87% improvement in risk identification accuracy compared to the strongest baseline evaluation method.

[266] Hummer: Towards Limited Competitive Preference Dataset

Li Jiang, Yusen Wu, Junwu Xiong, Jingqing Ruan, Yichuan Ding, Qingpei Guo, Zujie Wen, Jun Zhou, Xiaotie Deng

Main category: cs.AI

TL;DR: The paper introduces Alignment Dimension Conflict metric to quantify conflicts in preference datasets, presents Hummer datasets with reduced-conflict alignment objectives, and develops reward models using hybrid sampling to balance alignment objectives effectively.

Details

Motivation: Preference datasets often have conflicting alignment objectives that increase vulnerability to jailbreak attacks and make it difficult to prioritize specific alignment objectives without negatively impacting others.

Method: Developed Alignment Dimension Conflict metric to quantify conflicts, created Hummer and Hummer-F datasets based on UltraFeedback with AI feedback from GPT-4, and built reward models using hybrid sampling approach.

Result: Created the first preference dataset aimed at reducing competition between alignment objectives, developed HummerRM and HummerRM-F reward models that effectively balance diverse alignment objectives.

Conclusion: The proposed approach enables better domain-specific fine-tuning and reduces vulnerabilities to attacks by addressing conflicts in alignment objectives within preference datasets.

Abstract: Preference datasets are essential for incorporating human preferences into pre-trained language models, playing a key role in the success of Reinforcement Learning from Human Feedback. However, these datasets often demonstrate conflicting alignment objectives, leading to increased vulnerability to jailbreak attacks and challenges in adapting downstream tasks to prioritize specific alignment objectives without negatively impacting others. In this work, we introduce a novel statistical metric, Alignment Dimension Conflict, to quantify the degree of conflict within preference datasets. We then present \texttt{Hummer} and its fine-grained variant, \texttt{Hummer-F}, as innovative pairwise preference datasets with reduced-conflict alignment objectives. \texttt{Hummer} is built based on UltraFeedback and is enhanced by AI feedback from GPT-4, marking as the first preference dataset aimed at reducing the competition between alignment objectives. Furthermore, we develop reward models, HummerRM and HummerRM-F, which employ a hybrid sampling approach to balance diverse alignment objectives effectively. This sampling method positions HummerRM as an ideal model for domain-specific further fine-tuning and reducing vulnerabilities to attacks.

[267] Reasoning Models Better Express Their Confidence

Dongkeun Yoon, Seungone Kim, Sohee Yang, Sunkyoung Kim, Soyeon Kim, Yongil Kim, Eunbi Choi, Yireun Kim, Minjoon Seo

Main category: cs.AI

TL;DR: Reasoning models with chain-of-thought (CoT) reasoning show better confidence calibration than non-reasoning models, with gains attributed to slow thinking behaviors like exploring alternatives and backtracking.

Details

Motivation: Large language models often fail to accurately communicate their confidence, limiting their reliability. This work aims to understand if reasoning models can improve confidence calibration.

Method: Benchmarked six reasoning models across six datasets, analyzed slow thinking behaviors in CoT reasoning, and tested calibration improvements through in-context learning guidance.

Result: Reasoning models achieved strictly better confidence calibration than non-reasoning counterparts in 33 out of 36 settings. Calibration improves progressively as CoT unfolds, and removing slow thinking behaviors reduces calibration.

Conclusion: Slow thinking behaviors in reasoning models are the key source of improved confidence calibration, and even non-reasoning models can achieve better calibration when guided to slow think.

Abstract: Despite their strengths, large language models (LLMs) often fail to communicate their confidence accurately, making it difficult to assess when they might be wrong and limiting their reliability. In this work, we demonstrate that reasoning models that engage in extended chain-of-thought (CoT) reasoning exhibit superior performance not only in problem-solving but also in accurately expressing their confidence. Specifically, we benchmark six reasoning models across six datasets and find that they achieve strictly better confidence calibration than their non-reasoning counterparts in 33 out of the 36 settings. Our detailed analysis reveals that these gains in calibration stem from the slow thinking behaviors of reasoning models (e.g., exploring alternative approaches and backtracking) which enable them to adjust their confidence dynamically throughout their CoT, making it progressively more accurate. In particular, we find that reasoning models become increasingly better calibrated as their CoT unfolds, a trend not observed in non-reasoning models. Moreover, removing slow thinking behaviors from the CoT leads to a significant drop in calibration. Lastly, we show that non-reasoning models also demonstrate enhanced calibration when simply guided to slow think via in-context learning, fully isolating slow thinking as the source of the calibration gains.

[268] Follow the STARs: Dynamic $ω$-Regular Shielding of Learned Policies

Ashwani Anand, Satya Prakash Nayak, Ritam Raha, Anne-Kathrin Schmuck

Main category: cs.AI

TL;DR: STARs is a dynamic post-shielding framework that enforces ω-regular properties on probabilistic policies, shifting from safety-only to safety+liveness shielding with tunable interference control.

Details

Motivation: To move beyond traditional safety-shielding (preventing bad events) to enforce both safety and liveness properties (ensuring good events eventually happen) while minimizing interference with pre-computed policies.

Method: Uses Strategy-Template-based Adaptive Runtime Shields (STARs) with permissive strategy templates and dynamic interference control mechanism that balances formal obligations with task-specific behavior at runtime.

Result: STARs successfully enforce ω-regular properties on learned probabilistic policies with controllable interference, support runtime adaptation to changing specifications or failures, and demonstrate effectiveness on mobile robot benchmarks.

Conclusion: STARs represent a paradigm shift in shielding by enabling comprehensive ω-regular property enforcement with tunable interference, making them particularly suitable for cyber-physical systems requiring runtime adaptation.

Abstract: This paper presents a novel dynamic post-shielding framework that enforces the full class of $\omega$-regular correctness properties over pre-computed probabilistic policies. This constitutes a paradigm shift from the predominant setting of safety-shielding – i.e., ensuring that nothing bad ever happens – to a shielding process that additionally enforces liveness – i.e., ensures that something good eventually happens. At the core, our method uses Strategy-Template-based Adaptive Runtime Shields (STARs), which leverage permissive strategy templates to enable post-shielding with minimal interference. As its main feature, STARs introduce a mechanism to dynamically control interference, allowing a tunable enforcement parameter to balance formal obligations and task-specific behavior at runtime. This allows to trigger more aggressive enforcement when needed, while allowing for optimized policy choices otherwise. In addition, STARs support runtime adaptation to changing specifications or actuator failures, making them especially suited for cyber-physical applications. We evaluate STARs on a mobile robot benchmark to demonstrate their controllable interference when enforcing (incrementally updated) $\omega$-regular correctness properties over learned probabilistic policies.

[269] Can Agents Fix Agent Issues?

Alfin Wijaya Rahardja, Junwei Liu, Weitong Chen, Zhenpeng Chen, Yiling Lou

Main category: cs.AI

TL;DR: This paper introduces AGENTISSUE-BENCH, a benchmark for evaluating software engineering agents’ ability to resolve real-world issues in LLM-based agent systems, revealing their limited effectiveness (3.33%-12.67% success rates).

Details

Motivation: LLM-based agent systems are widely used but prone to bugs and evolving requirements, yet existing SE agents designed for traditional software may not effectively handle agent-specific issues.

Method: Manually analyzed 201 real agent issues, identified common categories, and constructed AGENTISSUE-BENCH with 50 reproducible agent issue resolution tasks including executable environments and failure-triggering tests.

Result: Evaluation of state-of-the-art SE agents showed very limited effectiveness with resolution rates ranging from only 3.33% to 12.67%.

Conclusion: Agent systems present unique maintenance challenges distinct from traditional software, highlighting the need for developing more advanced SE agents specifically designed for resolving agent issues.

Abstract: LLM-based agent systems are emerging as a new software paradigm and have been widely adopted across diverse domains such as medicine, robotics, and programming. However, maintaining these systems requires substantial effort, as they are inevitably prone to bugs and continually evolve to meet changing external requirements. Therefore, automatically resolving agent issues (i.e., bug reports or feature requests) is a crucial and challenging task. While recent software engineering (SE) agents (e.g., SWE-agent) have shown promise in addressing issues in traditional software systems, it remains unclear how effectively they can resolve real-world issues in agent systems, which differ significantly from traditional software. To fill this gap, we first manually analyze 201 real-world agent issues and identify common categories of agent issues. We then spend 500 person-hours constructing AGENTISSUE-BENCH, a reproducible benchmark comprising 50 agent issue resolution tasks (each with an executable environment and failure-triggering tests). We further evaluate state-of-the-art SE agents on AGENTISSUE-BENCH and reveal their limited effectiveness (i.e., with only 3.33% - 12.67% resolution rates). These results underscore the unique challenges of maintaining agent systems compared to traditional software, highlighting the need for further research to develop advanced SE agents for resolving agent issues. Data and code are available at https://alfin06.github.io/AgentIssue-Bench-Leaderboard/#/ .

[270] Measuring Data Science Automation: A Survey of Evaluation Tools for AI Assistants and Agents

Irene Testini, José Hernández-Orallo, Lorenzo Pacchiardi

Main category: cs.AI

TL;DR: Survey of LLM assistants and agents for data science evaluation, identifying gaps in coverage of data management, human-AI collaboration, and task transformation.

Details

Motivation: To understand how Large Language Models (LLMs) are being evaluated as assistants and agents in data science workflows, and identify research gaps in current evaluation practices.

Method: Conducted a comprehensive survey of existing literature on LLM assistants and agents for data science, analyzing evaluation methodologies and focus areas.

Result: Found three main gaps: (1) narrow focus on goal-oriented activities while ignoring data management and exploratory work, (2) concentration on pure assistance or full autonomy without intermediate collaboration levels, (3) emphasis on human substitution rather than task transformation for higher automation.

Conclusion: Current evaluation of LLMs in data science is limited and needs broader coverage of data science activities, more nuanced human-AI collaboration models, and consideration of task transformation for true automation benefits.

Abstract: Data science aims to extract insights from data to support decision-making processes. Recently, Large Language Models (LLMs) have been increasingly used as assistants for data science, by suggesting ideas, techniques and small code snippets, or for the interpretation of results and reporting. Proper automation of some data-science activities is now promised by the rise of LLM agents, i.e., AI systems powered by an LLM equipped with additional affordances–such as code execution and knowledge bases–that can perform self-directed actions and interact with digital environments. In this paper, we survey the evaluation of LLM assistants and agents for data science. We find (1) a dominant focus on a small subset of goal-oriented activities, largely ignoring data management and exploratory activities; (2) a concentration on pure assistance or fully autonomous agents, without considering intermediate levels of human-AI collaboration; and (3) an emphasis on human substitution, therefore neglecting the possibility of higher levels of automation thanks to task transformation.

[271] The Endless Tuning. An Artificial Intelligence Design To Avoid Human Replacement and Trace Back Responsibilities

Elio Grande

Main category: cs.AI

TL;DR: The Endless Tuning is a design method for reliable AI deployment using a double mirroring process to avoid human replacement and fill the responsibility gap, tested in three applications with domain experts.

Details

Motivation: To address the responsibility gap in AI systems and prevent human replacement, while building a bridge between accountability and liability in case of damage.

Method: A double mirroring process protocol implemented in three prototypical applications (loan granting, pneumonia diagnosis, art style recognition) using deep learning models with reversed and hermeneutic deployment of XAI algorithms, tested with domain experts.

Result: Full control was perceived by interviewees in decision-making settings, with focus on user experience rather than statistical accuracy. A bridge between accountability and liability was demonstrated.

Conclusion: The Endless Tuning method successfully provides a different voice in AI ethics, enabling reliable deployment while maintaining human control and addressing responsibility gaps through philosophical-technical integration.

Abstract: The Endless Tuning is a design method for a reliable deployment of artificial intelligence based on a double mirroring process, which pursues both the goals of avoiding human replacement and filling the so-called responsibility gap (Matthias 2004). Originally depicted in (Fabris et al. 2024) and ensuing the relational approach urged therein, it was then actualized in a protocol, implemented in three prototypical applications regarding decision-making processes (respectively: loan granting, pneumonia diagnosis, and art style recognition) and tested with such as many domain experts. Step by step illustrating the protocol, giving insights concretely showing a different voice (Gilligan 1993) in the ethics of artificial intelligence, a philosophical account of technical choices (e.g., a reversed and hermeneutic deployment of XAI algorithms) will be provided in the present study together with the results of the experiments, focusing on user experience rather than statistical accuracy. Even thoroughly employing deep learning models, full control was perceived by the interviewees in the decision-making setting, while it appeared that a bridge can be built between accountability and liability in case of damage.

[272] AgentTTS: Large Language Model Agent for Test-time Compute-optimal Scaling Strategy in Complex Tasks

Fali Wang, Hui Liu, Zhenwei Dai, Jingying Zeng, Zhiwei Zhang, Zongyu Wu, Chen Luo, Zhen Li, Xianfeng Tang, Qi He, Suhang Wang

Main category: cs.AI

TL;DR: AgentTTS is a framework that uses LLM agents to autonomously find compute-optimal model and budget allocations for multi-stage complex tasks, outperforming traditional methods in efficiency and robustness.

Details

Motivation: Existing test-time scaling research focuses on single-stage tasks, but real-world problems are often multi-stage complex tasks with heterogeneous subtasks requiring specific LLM capabilities, creating a gap in compute-optimal scaling for such scenarios.

Method: Proposed AgentTTS framework that uses LLM agents to autonomously search for compute-optimal allocations through iterative feedback-driven interactions with the execution environment, based on empirical insights from pilot experiments.

Result: AgentTTS significantly outperforms traditional and other LLM-based baselines in search efficiency, shows improved robustness to varying training set sizes, and provides enhanced interpretability.

Conclusion: The proposed AgentTTS framework effectively addresses the challenges of test-time compute-optimal scaling in multi-stage complex tasks through autonomous LLM-agent-based search.

Abstract: Test-time scaling (TTS) enhances the performance of large language models (LLMs) by allocating additional compute resources during inference. However, existing research primarily investigates TTS in single-stage tasks; while many real-world problems are multi-stage complex tasks, composed of a sequence of heterogeneous subtasks with each subtask requires LLM of specific capability. Therefore, we study a novel problem: the test-time compute-optimal scaling in multi-stage complex tasks, aiming to select suitable models and allocate budgets per subtask to maximize overall performance. TTS in multi-stage tasks introduces two fundamental challenges: (i) The combinatorial search space of model and budget allocations, combined with the high cost of inference, makes brute-force search impractical. (ii) The optimal model and budget allocations across subtasks are interdependent, increasing the complexity of the compute-optimal search. To address this gap, we conduct extensive pilot experiments on four tasks across six datasets, deriving three empirical insights characterizing the behavior of LLMs in multi-stage complex tasks. Informed by these insights, we propose AgentTTS, an LLM-agent-based framework that autonomously searches for compute-optimal allocations through iterative feedback-driven interactions with the execution environment. Experimental results demonstrate that AgentTTS significantly outperforms traditional and other LLM-based baselines in search efficiency, and shows improved robustness to varying training set sizes and enhanced interpretability.

[273] Traffic-R1: Reinforced LLMs Bring Human-Like Reasoning to Traffic Signal Control Systems

Xingchen Zou, Yuhao Yang, Zheng Chen, Xixuan Hao, Yiqi Chen, Chao Huang, Yuxuan Liang

Main category: cs.AI

TL;DR: Traffic-R1 is a 3B-parameter foundation model for traffic signal control that achieves human-like reasoning through self-exploration and iterative reinforcement learning, offering zero-shot generalization, real-time edge deployment, and explainable multi-intersection coordination.

Details

Motivation: To overcome limitations of traditional reinforcement learning and recent LLM-based methods in traffic signal control by developing a model that can generalize to new scenarios without retraining, support real-time deployment, and provide explainable decision-making.

Method: Developed via self-exploration and iterative reinforcement of LLM with expert guidance in simulated traffic environments, leveraging internal traffic-control policies and reasoning capabilities.

Result: Outperforms strong baselines and training-intensive RL controllers, manages signals affecting 55,000+ drivers daily, reduces average queue lengths by >5%, and halves operator workload in production deployment.

Conclusion: Traffic-R1 demonstrates that compact foundation models with human-like reasoning can achieve superior traffic signal control performance with zero-shot generalization, real-time capability, and explainable coordination, making them practical for real-world deployment.

Abstract: We introduce Traffic-R1, a 3B-parameter foundation model with human-like reasoning for Traffic signal control (TSC), developed via self-exploration and iterative reinforcement of LLM with expert guidance in a simulated traffic environment. Compared with traditional reinforcement learning and recent LLM-based methods, Traffic-R1 offers three main advantages: zero-shot generalization, transferring unchanged to new road networks and out-of-distribution incidents by leveraging internal traffic-control policies and reasoning; a compact 3B-parameter design that supports real-time inference on mobile-class chips for edge deployment; and an explainable TSC process that enables multi-intersection coordination through communication and an asynchronous communication network. Extensive benchmarks show Traffic-R1 outperforms strong baselines and training-intensive RL controllers. In production, the model now manages signals affecting over 55,000 drivers daily, reduces average queue lengths by more than 5%, and halves operator workload. Our model is available at https://huggingface.co/Season998/Traffic-R1.

[274] Test-time Prompt Intervention

Chenxu Yang, Qingyi Si, Mz Dai, Dingyu Yao, Mingyu Zheng, Minghui Chen, Zheng Lin, Weiping Wang

Main category: cs.AI

TL;DR: PI is a test-time prompt intervention framework that dynamically guides LLM reasoning paths during inference to reduce redundancy and improve reliability in chain-of-thought generation.

Details

Motivation: Current LLMs with test-time compute produce redundant chain-of-thought reasoning due to post-training that relies on outcome rewards rather than process rewards, which are hard to construct at scale.

Method: PI framework with three modules: When (timely intervention), How (proper intervention), and Which (post-intervention sampling) to integrate human expertise and cognitive principles into reasoning.

Result: Extensive experiments show PI significantly shortens CoTs while reducing hallucination, producing more concise and reliable reasoning across multiple models and datasets.

Conclusion: PI provides an effective interface for dynamically regulating LLM reasoning paths, enhancing controllability and interpretability while maintaining reasoning quality.

Abstract: Test-time compute has led to remarkable success in the large language model (LLM) community, particularly for complex tasks, where longer chains of thought (CoTs) are generated to enhance reasoning capabilities. However, growing evidence reveals that such reasoning models often produce CoTs plagued by excessive redundancy, including unnecessary verification steps and repetitive reasoning shifts. The root cause lies in post-training of them that overly rely on outcome reward paradigms, as the data of process reward paradigms, which regulate intermediate reasoning steps, is difficult to construct at scale. To address this, we propose PI, a novel framework for Test-time Prompt Intervention. PI provides an interface to dynamically guide and regulate reasoning paths during inference through timely (When module) and proper (How module) interventions and post-intervention sampling (Which module). This allows human problem-solving expertise and cognitive science principles to be seamlessly integrated into LLMs’ reasoning processes, enhancing controllability and interpretability. Extensive experiments across multiple models and datasets demonstrate that PI significantly shortens CoTs while reducing hallucination, yielding more concise and reliable reasoning.

[275] Benchmarking Large Language Models for Personalized Guidance in AI-Enhanced Learning

Bo Yuan, Jiazi Hu

Main category: cs.AI

TL;DR: Empirical comparison of three LLMs (GPT-4o, DeepSeek-V3, GLM-4.5) as tutoring assistants shows GPT-4o generally preferred for generating more informative and structured feedback in personalized learning scenarios.

Details

Motivation: While LLMs are increasingly envisioned as intelligent assistants for personalized learning, systematic head-to-head evaluations in authentic learning scenarios remain scarce.

Method: Used dataset with student responses to mixed-format questions, asked models to analyze quiz, infer student mastery, and generate guidance. Employed Gemini as virtual judge for pairwise comparisons across accuracy, clarity, actionability, and appropriateness dimensions.

Result: GPT-4o is generally preferred, producing more informative and better structured feedback than counterparts. DeepSeek-V3 and GLM-4.5 show intermittent strengths but lower consistency.

Conclusion: Findings highlight feasibility of deploying LLMs as advanced teaching assistants for individualized support and provide methodological insights for LLM-driven personalized learning research.

Abstract: While Large Language Models (LLMs) are increasingly envisioned as intelligent assistants for personalized learning, systematic head-to-head evaluations in authentic learning scenarios remain scarce. This study presents an empirical comparison of three state-of-the-art LLMs on a tutoring task simulating a realistic learning setting. Using a dataset containing a student’s responses to ten mixed-format questions with correctness labels, each model was asked to (i) analyze the quiz to identify underlying knowledge components, (ii) infer the student’s mastery profile, and (iii) generate targeted guidance for improvement. To mitigate subjectivity and evaluator bias, Gemini was employed as a virtual judge to perform pairwise comparisons across multiple dimensions: accuracy, clarity, actionability, and appropriateness. Results analyzed via the Bradley-Terry model reveal that GPT-4o is generally preferred, producing feedback that is more informative and better structured than its counterparts, whereas DeepSeek-V3 and GLM-4.5 demonstrate intermittent strengths but lower consistency. These findings highlight the feasibility of deploying LLMs as advanced teaching assistants for individualized support and provide methodological insights for subsequent empirical research on LLM-driven personalized learning.

[276] Data Efficient Adaptation in Large Language Models via Continuous Low-Rank Fine-Tuning

Xiao Han, Zimo Zhao, Wanyu Wang, Maolin Wang, Zitao Liu, Yi Chang, Xiangyu Zhao

Main category: cs.AI

TL;DR: DEAL is a novel framework that combines Low-Rank Adaptation (LoRA) with continuous fine-tuning to address catastrophic forgetting and improve data efficiency in LLM fine-tuning.

Details

Motivation: Conventional fine-tuning approaches suffer from catastrophic forgetting and suboptimal data efficiency, limiting their real-world applicability for adapting LLMs to specific tasks.

Method: Integrates Low-Rank Adaptation (LoRA) with continuous fine-tuning strategy, incorporating knowledge retention and adaptive parameter update modules.

Result: Experiments on 15 diverse datasets show DEAL consistently outperforms baseline methods with substantial gains in task accuracy and resource efficiency.

Conclusion: DEAL demonstrates potential to advance continual adaptation in LLMs by enhancing task performance while improving resource efficiency.

Abstract: Recent advancements in Large Language Models (LLMs) have emphasized the critical role of fine-tuning (FT) techniques in adapting LLMs to specific tasks, especially when retraining from scratch is computationally infeasible. Fine-tuning enables LLMs to leverage task- or domain-specific data, producing models that more effectively meet the requirements of targeted applications. However, conventional FT approaches often suffer from catastrophic forgetting and suboptimal data efficiency, limiting their real-world applicability. To address these challenges, this paper proposes \textbf{DEAL}, a novel framework that integrates Low-Rank Adaptation (LoRA) with a continuous fine-tuning strategy. By incorporating knowledge retention and adaptive parameter update modules, the framework mitigates the limitations of existing FT methods while maintaining efficiency. Experiments on 15 diverse datasets show that DEAL consistently outperforms baseline methods, yielding substantial gains in task accuracy and resource efficiency. These findings demonstrate the potential of our approach to advance continual adaptation in LLMs by enhancing task performance while improving resource efficiency. The source code is publicly available at https://github.com/zzm-black/DEAL-Continuous-Low-Rank-Fine-Tuning.

[277] EEG-Based Consumer Behaviour Prediction: An Exploration from Classical Machine Learning to Graph Neural Networks

Mohammad Parsa Afshar, Aryan Azimi

Main category: cs.AI

TL;DR: This research compares classical machine learning models and Graph Neural Networks (GNNs) for predicting consumer behavior using EEG data from the NeuMa dataset, finding that GNNs generally performed better than classical models.

Details

Motivation: To predict consumer behavior using EEG data, which provides detailed information about brain neural activity, for applications in marketing, cognitive neuroscience, and human-computer interaction.

Method: Extracted and cleaned EEG features from NeuMa dataset, created brain connectivity features for GNN models, and compared various classical machine learning models (including ensemble models and SVM) with different GNN architectures.

Result: No significant overall performance difference between models, but GNN models generally performed better in some basic criteria where classical models were unsatisfactory.

Conclusion: Combining EEG signal analysis with machine learning provides deeper understanding of consumer behavior, and GNNs show promise as an alternative to traditional models in EEG-based neuromarketing.

Abstract: Prediction of consumer behavior is one of the important purposes in marketing, cognitive neuroscience, and human-computer interaction. The electroencephalography (EEG) data can help analyze the decision process by providing detailed information about the brain’s neural activity. In this research, a comparative approach is utilized for predicting consumer behavior by EEG data. In the first step, the features of the EEG data from the NeuMa dataset were extracted and cleaned. For the Graph Neural Network (GNN) models, the brain connectivity features were created. Different machine learning models, such as classical models and Graph Neural Networks, are used and compared. The GNN models with different architectures are implemented to have a comprehensive comparison; furthermore, a wide range of classical models, such as ensemble models, are applied, which can be very helpful to show the difference and performance of each model on the dataset. Although the results did not show a significant difference overall, the GNN models generally performed better in some basic criteria where classical models were not satisfactory. This study not only shows that combining EEG signal analysis and machine learning models can provide an approach to deeper understanding of consumer behavior, but also provides a comprehensive comparison between the machine learning models that have been widely used in previous studies in the EEG-based neuromarketing such as Support Vector Machine (SVM), and the models which are not used or rarely used in the field, like Graph Neural Networks.

[278] Democratizing AI scientists using ToolUniverse

Shanghua Gao, Richard Zhu, Pengwei Sui, Zhenglun Kong, Sufian Aldogom, Yepeng Huang, Ayush Noori, Reza Shamji, Krishna Parvataneni, Theodoros Tsiligkaridis, Marinka Zitnik

Main category: cs.AI

TL;DR: ToolUniverse is an ecosystem for building AI scientists that standardizes tool identification and calling across 600+ ML models, datasets, APIs, and scientific packages, with automatic interface refinement and workflow composition.

Details

Motivation: Current AI scientist systems are bespoke, tied to rigid workflows, and lack shared environments that unify tools, data, and analyses into a common ecosystem, unlike the unified ecosystems that have transformed genomics research.

Method: ToolUniverse standardizes how AI scientists identify and call tools, automatically refines tool interfaces for correct use, generates new tools from natural language descriptions, iteratively optimizes tool specifications, and composes tools into agentic workflows.

Result: In a hypercholesterolemia case study, ToolUniverse created an AI scientist that identified a potent analog of a drug with favorable predicted properties.

Conclusion: ToolUniverse provides the necessary infrastructure for building AI scientists comparable to the unified ecosystems that have transformed genomics research, enabling interoperability, reuse, and community-driven development.

Abstract: AI scientists are emerging computational systems that serve as collaborative partners in discovery. These systems remain difficult to build because they are bespoke, tied to rigid workflows, and lack shared environments that unify tools, data, and analyses into a common ecosystem. In genomics, unified ecosystems have transformed research by enabling interoperability, reuse, and community-driven development; AI scientists require comparable infrastructure. We present ToolUniverse, an ecosystem for building AI scientists from any language or reasoning model across open- and closed-weight models. ToolUniverse standardizes how AI scientists identify and call tools by providing more than 600 machine learning models, datasets, APIs, and scientific packages for data analysis, knowledge retrieval, and experimental design. It automatically refines tool interfaces for correct use by AI scientists, generates new tools from natural language descriptions, iteratively optimizes tool specifications, and composes tools into agentic workflows. In a case study of hypercholesterolemia, ToolUniverse was used to create an AI scientist to identify a potent analog of a drug with favorable predicted properties. The open-source ToolUniverse is available at https://aiscientist.tools.

[279] The Open Syndrome Definition

Ana Paula Gomes Ferreira, Aleksandar Anžel, Izabel Oliva Marcilio de Souza, Helen Hughes, Alex J Elliot, Jude Dzevela Kong, Madlen Schranz, Alexander Ullrich, Georges Hattab

Main category: cs.AI

TL;DR: The paper proposes the first open, machine-readable format for representing case and syndrome definitions to address interoperability challenges in public health data exchange and AI applications.

Details

Motivation: Current case definitions lack standardized machine-readable formats, causing challenges in interoperability, epidemiological research, data exchange, and computational analysis including AI applications across organizations and regions.

Method: Developed an open machine-readable format for case definitions, created a comprehensive dataset of standardized definitions, built tools to convert human-readable definitions to machine-readable format, and established an online platform for browsing and contributing definitions.

Result: Created the Open Syndrome Definition format and accessible online platform (https://opensyndrome.org) with tools for conversion and analysis, enabling consistent and scalable use of case definitions across systems.

Conclusion: The Open Syndrome Definition format unlocks AI’s potential to strengthen public health preparedness and response by enabling machine-readable case definitions that facilitate interoperability and technological innovation.

Abstract: Case definitions are essential for effectively communicating public health threats. However, the absence of a standardized, machine-readable format poses significant challenges to interoperability, epidemiological research, the exchange of qualitative data, and the effective application of computational analysis methods, including artificial intelligence (AI). This complicates comparisons and collaborations across organizations and regions, limits data integration, and hinders technological innovation in public health. To address these issues, we propose the first open, machine-readable format for representing case and syndrome definitions. Additionally, we introduce the first comprehensive dataset of standardized case definitions and tools to convert existing human-readable definitions into machine-readable formats. We also provide an accessible online platform for browsing, analyzing, and contributing new definitions, available at https://opensyndrome.org. The Open Syndrome Definition format enables consistent, scalable use of case definitions across systems, unlocking AI’s potential to strengthen public health preparedness and response. The source code for the format can be found at https://github.com/OpenSyndrome/schema under the MIT license.

[280] Base Models Know How to Reason, Thinking Models Learn When

Constantin Venhoff, Iván Arcuschin, Philip Torr, Arthur Conmy, Neel Nanda

Main category: cs.AI

TL;DR: Thinking models like DeepSeek R1 outperform base models by efficiently deploying pre-existing reasoning capabilities at the right time rather than learning entirely new capabilities.

Details

Motivation: To understand whether thinking models learn new reasoning capabilities or simply repurpose existing base model capabilities through better deployment timing.

Method: Proposed a hybrid model that activates reasoning mechanisms in base models at appropriate times, using an unsupervised bottom-up approach to discover human-interpretable reasoning behaviors without manual assumptions.

Result: The hybrid model recovered up to 91% of the performance gap to thinking models without weight updates while steering only 12% of tokens, across three base and four thinking models on GSM8K and MATH500 benchmarks.

Conclusion: Pre-training provides most reasoning mechanisms, while post-training teaches efficient deployment timing, enabling thinking models to better utilize their inference-time compute.

Abstract: Why do thinking language models like DeepSeek R1 outperform their base counterparts? Despite consistent performance gains, it remains unclear to what extent thinking models learn entirely new reasoning capabilities or repurpose pre-existing base model ones. In this work, we propose a hybrid model where we activate reasoning mechanisms in base models at the right time to elicit thinking-model-level reasoning chains, implying that thinking models exploit already existing capabilities. To ground our analysis, we introduce an unsupervised, bottom-up approach for uncovering human-interpretable reasoning behaviors in thinking models. This approach provides an unbiased method to discover reasoning behaviors without imposing manual or LLM-derived assumptions. Across three base and four thinking models, using GSM8K and MATH500, our hybrid model recovers up to 91% of the performance gap to thinking models without any weight updates while steering only 12% of tokens. Concretely, our empirical setup provides a simple, causal way to test the effectiveness of existing reasoning mechanisms in base models by invoking them directly and measuring the resulting task performance. More broadly, these results reframe our understanding of how thinking models are trained: pre-training is when models acquire most of their reasoning mechanisms, and post-training teaches efficient deployment of these mechanisms at the right time, enabling efficient use of their inference-time compute.

[281] ARM-FM: Automated Reward Machines via Foundation Models for Compositional Reinforcement Learning

Roger Creus Castanyer, Faisal Mohamed, Pablo Samuel Castro, Cyrus Neary, Glen Berseth

Main category: cs.AI

TL;DR: ARM-FM automates reward design in RL using foundation models to generate reward machines from natural language, enabling compositional task decomposition and zero-shot generalization.

Details

Motivation: RL algorithms are highly sensitive to reward function specification, which limits their broad applicability. Manual reward design is challenging and time-consuming.

Method: Use foundation models to automatically generate reward machines from natural language specifications, associate language embeddings with each automata-state for generalization, and leverage the structured formalism of RMs for effective task decomposition.

Result: Empirical evidence shows ARM-FM’s effectiveness in diverse challenging environments, including zero-shot generalization capabilities.

Conclusion: ARM-FM provides an automated framework for compositional reward design that bridges natural language specifications with formal RL objectives, addressing the central challenge of reward function sensitivity in RL.

Abstract: Reinforcement learning (RL) algorithms are highly sensitive to reward function specification, which remains a central challenge limiting their broad applicability. We present ARM-FM: Automated Reward Machines via Foundation Models, a framework for automated, compositional reward design in RL that leverages the high-level reasoning capabilities of foundation models (FMs). Reward machines (RMs) – an automata-based formalism for reward specification – are used as the mechanism for RL objective specification, and are automatically constructed via the use of FMs. The structured formalism of RMs yields effective task decompositions, while the use of FMs enables objective specifications in natural language. Concretely, we (i) use FMs to automatically generate RMs from natural language specifications; (ii) associate language embeddings with each RM automata-state to enable generalization across tasks; and (iii) provide empirical evidence of ARM-FM’s effectiveness in a diverse suite of challenging environments, including evidence of zero-shot generalization.

[282] RoboGPT-R1: Enhancing Robot Planning with Reinforcement Learning

Jinrui Liu, Bingyan Nie, Boyu Li, Yaran Chen, Yuze Wang, Shunsen He, Haoran Li

Main category: cs.AI

TL;DR: RoboGPT-R1 is a two-stage fine-tuning framework that combines supervised training with reinforcement learning to improve embodied agents’ reasoning for long-horizon manipulation tasks, achieving significant performance gains over larger models.

Details

Motivation: Current large language and vision models struggle with long-horizon manipulation tasks in complex real-world environments due to limited common sense and reasoning capabilities, and supervised fine-tuning alone suffers from poor generalization and insufficient physical understanding.

Method: A two-stage fine-tuning framework: supervised training acquires foundational knowledge through expert sequences, followed by RL with a rule-based reward function that considers long-horizon performance and action constraints to address visual-spatial understanding and reasoning shortcomings.

Result: The model trained on Qwen2.5-VL-3B significantly outperforms GPT-4o-mini by 21.33% and surpasses other work trained on Qwen2.5-VL-7B by 20.33% on the EmbodiedBench benchmark.

Conclusion: The proposed two-stage fine-tuning approach with RL effectively enhances embodied agents’ reasoning capabilities for long-horizon manipulation tasks, demonstrating that smaller models can outperform larger ones when properly trained with physical understanding considerations.

Abstract: Improving the reasoning capabilities of embodied agents is crucial for robots to complete complex human instructions in long-view manipulation tasks successfully. Despite the success of large language models and vision language models based on Supervised Fine-Tuning (SFT) in planning tasks, they continue facing challenges in performing long-horizon manipulation tasks in complex real-world environments, owing to their restricted common sense and reasoning capabilities. Considering that aligning general-purpose vision language models to robotic planning tasks via supervised fine-tuning suffers from poor generalization and insufficient physical understanding, we propose RoboGPT-R1, a two-stage fine-tuning framework for embodied planning. In this framework, supervised training acquires foundational knowledge through expert sequences, followed by RL to address the model’s shortcomings in visual-spatial understanding and reasoning. To achieve physical understanding and action sequence consistency in multi-step reasoning tasks, we design a rule-based reward function that simultaneously considers long-horizon performance and action constraint in the environment. The reasoning model, trained on Qwen2.5-VL-3B, significantly outperforms the larger-scale model, GPT-4o-mini, by 21.33% and surpasses other work trained on Qwen2.5-VL-7B by 20.33% on the EmbodiedBench benchmark.

[283] Invoice Information Extraction: Methods and Performance Evaluation

Sai Yashwant, Anurag Dubey, Praneeth Paikray, Gantala Thulsiram

Main category: cs.AI

TL;DR: Methods for extracting structured information from invoices with evaluation metrics to assess accuracy against ground truth.

Details

Motivation: To provide a standardized way to compare different invoice extraction methods and identify field-specific performance strengths and weaknesses.

Method: Pre-process scanned/digital invoices, apply Docling and LlamaCloud Services to extract key fields (invoice number, date, total amount, vendor details), and establish evaluation framework with field-level precision, consistency checks, and exact match accuracy.

Result: Proposed evaluation metrics enable reliable assessment of extraction accuracy and comparison between different methods.

Conclusion: The evaluation framework provides standardized metrics for assessing invoice extraction methods and identifying performance variations across different fields.

Abstract: This paper presents methods for extracting structured information from invoice documents and proposes a set of evaluation metrics (EM) to assess the accuracy of the extracted data against annotated ground truth. The approach involves pre-processing scanned or digital invoices, applying Docling and LlamaCloud Services to identify and extract key fields such as invoice number, date, total amount, and vendor details. To ensure the reliability of the extraction process, we establish a robust evaluation framework comprising field-level precision, consistency check failures, and exact match accuracy. The proposed metrics provide a standardized way to compare different extraction methods and highlight strengths and weaknesses in field-specific performance.

[284] The Right to Be Remembered: Preserving Maximally Truthful Digital Memory in the Age of AI

Alex Zhavoronkov, Dominika Wilczok, Roman Yampolskiy

Main category: cs.AI

TL;DR: The paper proposes a ‘Right To Be Remembered’ (RTBR) framework to address AI-driven information omission risks in LLMs, which can disproportionately suppress certain narratives while amplifying others, reshaping collective memory.

Details

Motivation: LLMs provide synthesized responses that feel authoritative but collapse multiple perspectives, concentrating information power in few vendors and creating risks of bias, omission, and gradual erasure of those with limited digital presence.

Method: The paper presents the concept of Right To Be Remembered (RTBR) as a framework to minimize AI-driven information omission risks and ensure fair treatment and truthfulness in generated content.

Result: The RTBR concept is proposed as a solution to address the threat of disproportionate suppression of certain narratives, individuals, or groups while preventing amplification of already prominent entities.

Conclusion: The Right To Be Remembered framework is necessary to counter the reshaping of collective memory by LLMs and ensure equitable information preservation and representation.

Abstract: Since the rapid expansion of large language models (LLMs), people have begun to rely on them for information retrieval. While traditional search engines display ranked lists of sources shaped by search engine optimization (SEO), advertising, and personalization, LLMs typically provide a synthesized response that feels singular and authoritative. While both approaches carry risks of bias and omission, LLMs may amplify the effect by collapsing multiple perspectives into one answer, reducing users ability or inclination to compare alternatives. This concentrates power over information in a few LLM vendors whose systems effectively shape what is remembered and what is overlooked. As a result, certain narratives, individuals or groups, may be disproportionately suppressed, while others are disproportionately elevated. Over time, this creates a new threat: the gradual erasure of those with limited digital presence, and the amplification of those already prominent, reshaping collective memory. To address these concerns, this paper presents a concept of the Right To Be Remembered (RTBR) which encompasses minimizing the risk of AI-driven information omission, embracing the right of fair treatment, while ensuring that the generated content would be maximally truthful.

[285] MedRule-KG: A Knowledge-Graph–Steered Scaffold for Mathematical Reasoning with a Lightweight Verifier

Crystal Su

Main category: cs.AI

TL;DR: MedRule-KG is a typed knowledge graph with symbolic verifier that enforces mathematical/logical constraints in LLM reasoning, improving exact match from 76.7% to 100% on FDA benchmark.

Details

Motivation: LLMs often produce fluent reasoning steps while violating simple mathematical or logical constraints, creating unreliable outputs.

Method: Introduces MedRule-KG (compact typed knowledge graph) coupled with symbolic verifier that encodes entities, relations, and domain rules, then checks predictions and applies minimal corrections for consistency.

Result: On 90-example FDA-derived benchmark: grounding in MedRule-KG improved exact match from 0.767 to 0.900; adding verifier yielded 1.000 EM while eliminating all rule violations.

Conclusion: MedRule-KG provides a general scaffold for safe mathematical reasoning and enables reliable constraint enforcement in LLM outputs.

Abstract: Large language models (LLMs) often produce fluent reasoning steps while violating simple mathematical or logical constraints. We introduce MedRule-KG, a compact typed knowledge graph coupled with a symbolic verifier, designed to enforce mathematically interpretable rules in reasoning tasks. MedRule-KG encodes entities, relations, and three domain-inspired rules, while the verifier checks predictions and applies minimal corrections to guarantee consistency. On a 90-example FDA-derived benchmark, grounding in MedRule-KG improves exact match (EM) from 0.767 to 0.900, and adding the verifier yields 1.000 EM while eliminating rule violations entirely. We demonstrate how MedRule-KG provides a general scaffold for safe mathematical reasoning, discuss ablations, and release code and data to encourage reproducibility.

[286] A Brain Cell Type Resource Created by Large Language Models and a Multi-Agent AI System for Collaborative Community Annotation

Rongbin Li, Wenbo Chen, Zhao Li, Rodrigo Munoz-Castaneda, Jinbo Li, Neha S. Maurya, Arnav Solanki, Huan He, Hanwen Xing, Meaghan Ramlakhan, Zachary Wise, Zhuhao Wu, Hua Xu, Michael Hawrylycz, W. Jim Zheng

Main category: cs.AI

TL;DR: BRAINCELL-AID is a multi-agent AI system that combines free-text descriptions with ontology labels for accurate gene set annotation, achieving 77% correct annotations for mouse gene sets and successfully annotating 5,322 brain cell clusters.

Details

Motivation: Traditional gene set annotation methods like GSEA rely on well-curated annotations and perform poorly with poorly characterized genes, while LLMs struggle to represent complex biological knowledge in structured ontologies.

Method: Developed a multi-agent AI system integrating free-text descriptions with ontology labels, using retrieval-augmented generation (RAG) to refine predictions with PubMed literature, reducing hallucinations and enhancing interpretability.

Result: Achieved 77% correct annotations for mouse gene sets among top predictions, annotated 5,322 brain cell clusters from mouse brain cell atlas, identified region-specific gene co-expression patterns and functional roles of gene ensembles, and identified Basal Ganglia-related cell types with neurologically meaningful descriptions.

Conclusion: BRAINCELL-AID creates a valuable resource for community-driven cell type annotation, enabling novel insights into brain cell function through accurate and robust gene set annotation.

Abstract: Single-cell RNA sequencing has transformed our ability to identify diverse cell types and their transcriptomic signatures. However, annotating these signatures-especially those involving poorly characterized genes-remains a major challenge. Traditional methods, such as Gene Set Enrichment Analysis (GSEA), depend on well-curated annotations and often perform poorly in these contexts. Large Language Models (LLMs) offer a promising alternative but struggle to represent complex biological knowledge within structured ontologies. To address this, we present BRAINCELL-AID (BRAINCELL-AID: https://biodataai.uth.edu/BRAINCELL-AID), a novel multi-agent AI system that integrates free-text descriptions with ontology labels to enable more accurate and robust gene set annotation. By incorporating retrieval-augmented generation (RAG), we developed a robust agentic workflow that refines predictions using relevant PubMed literature, reducing hallucinations and enhancing interpretability. Using this workflow, we achieved correct annotations for 77% of mouse gene sets among their top predictions. Applying this approach, we annotated 5,322 brain cell clusters from the comprehensive mouse brain cell atlas generated by the BRAIN Initiative Cell Census Network, enabling novel insights into brain cell function by identifying region-specific gene co-expression patterns and inferring functional roles of gene ensembles. BRAINCELL-AID also identifies Basal Ganglia-related cell types with neurologically meaningful descriptions. Hence, we create a valuable resource to support community-driven cell type annotation.

[287] FST.ai 2.0: An Explainable AI Ecosystem for Fair, Fast, and Inclusive Decision-Making in Olympic and Paralympic Taekwondo

Keivan Shariatmadar, Ahmad Osman, Ramin Ray, Kisam Kim

Main category: cs.AI

TL;DR: FST.ai 2.0 is an explainable AI ecosystem for Taekwondo that integrates pose-based action recognition, uncertainty modeling, and explainability overlays to support referees, coaches, and athletes in real-time competitions and training.

Details

Motivation: To address the challenge of fair, transparent, and explainable decision-making in Olympic and Paralympic combat sports, particularly Taekwondo.

Method: Integrates pose-based action recognition using graph convolutional networks (GCNs), epistemic uncertainty modeling through credal sets, explainability overlays for visual decision support, and interactive dashboards for human-AI collaboration.

Result: Experimental validation shows 85% reduction in decision review time and 93% referee trust in AI-assisted decisions.

Conclusion: The framework establishes a transparent and extensible pipeline for trustworthy, data-driven officiating and athlete assessment, representing a step toward equitable, accountable, and human-aligned AI in sports.

Abstract: Fair, transparent, and explainable decision-making remains a critical challenge in Olympic and Paralympic combat sports. This paper presents \emph{FST.ai 2.0}, an explainable AI ecosystem designed to support referees, coaches, and athletes in real time during Taekwondo competitions and training. The system integrates {pose-based action recognition} using graph convolutional networks (GCNs), {epistemic uncertainty modeling} through credal sets, and {explainability overlays} for visual decision support. A set of {interactive dashboards} enables human–AI collaboration in referee evaluation, athlete performance analysis, and Para-Taekwondo classification. Beyond automated scoring, FST.ai2.0 incorporates modules for referee training, fairness monitoring, and policy-level analytics within the World Taekwondo ecosystem. Experimental validation on competition data demonstrates an {85% reduction in decision review time} and {93% referee trust} in AI-assisted decisions. The framework thus establishes a transparent and extensible pipeline for trustworthy, data-driven officiating and athlete assessment. By bridging real-time perception, explainable inference, and governance-aware design, FST.ai2.0 represents a step toward equitable, accountable, and human-aligned AI in sports.

cs.SD

[288] AMAuT: A Flexible and Efficient Multiview Audio Transformer Framework Trained from Scratch

Weichuang Shao, Iman Yi Liao, Tomas Henrique Bode Maul, Tissa Chandesa

Main category: cs.SD

TL;DR: AMAuT is a training-from-scratch audio transformer framework that supports arbitrary sample rates and audio lengths, achieving up to 99.8% accuracy while using less than 3% of GPU hours compared to pre-trained models.

Details

Motivation: Existing foundational audio models like SSAST, EAT, HuBERT, Qwen-Audio, and Audio Flamingo are limited by fixed input rates and durations, hindering their reusability and flexibility.

Method: AMAuT integrates four key components: augmentation-driven multiview learning, conv1+conv7+conv1 1D CNN bottleneck, dual CLS+TAL tokens for bidirectional context, and test-time adaptation/augmentation (TTA²).

Result: Experiments on five benchmarks show AMAuT achieves up to 99.8% accuracy while consuming less than 3% of GPU hours required by comparable pre-trained models.

Conclusion: AMAuT presents a highly efficient and flexible alternative to large pre-trained models, making state-of-the-art audio classification accessible in computationally constrained settings.

Abstract: Recent foundational models, SSAST, EAT, HuBERT, Qwen-Audio, and Audio Flamingo, achieve top-tier results across standard audio benchmarks but are limited by fixed input rates and durations, hindering their reusability. This paper introduces the Augmentation-driven Multiview Audio Transformer (AMAuT), a training-from-scratch framework that eliminates the dependency on pre-trained weights while supporting arbitrary sample rates and audio lengths. AMAuT integrates four key components: (1) augmentation-driven multiview learning for robustness, (2) a conv1 + conv7 + conv1 one-dimensional CNN bottleneck for stable temporal encoding, (3) dual CLS + TAL tokens for bidirectional context representation, and (4) test-time adaptation/augmentation (TTA^2) to improve inference reliability. Experiments on five public benchmarks, AudioMNIST, SpeechCommands V1 & V2, VocalSound, and CochlScene, show that AMAuT achieves accuracies up to 99.8% while consuming less than 3% of the GPU hours required by comparable pre-trained models. Thus, AMAuT presents a highly efficient and flexible alternative to large pre-trained models, making state-of-the-art audio classification accessible in computationally constrained settings.

[289] Time delay embeddings to characterize the timbre of musical instruments using Topological Data Analysis: a study on synthetic and real data

Gakusei Sato, Hiroya Nakao, Riccardo Muolo

Main category: cs.SD

TL;DR: This paper investigates how different time delay embeddings affect Topological Data Analysis (TDA) results for timbre analysis, identifying specific delays that enhance harmonic structure detection in both synthetic and real audio signals.

Details

Motivation: Traditional approaches to timbre analysis often overlook subtle sound characteristics, and TDA's application to timbre has been limited due to unclear sound representation methods. The study aims to find effective ways to represent sound for TDA to capture complex timbral patterns.

Method: The researchers use time delay embeddings with different delays and apply Topological Data Analysis to both synthetic and real audio signals. They specifically investigate delays related to fractions of the fundamental period to detect harmonic structures.

Result: The findings show that specific time delays, particularly those related to fractions of the fundamental period, allow TDA to reveal key harmonic features and distinguish between integer and non-integer harmonics. The method is effective for both synthetic and real musical instrument sounds.

Conclusion: The study demonstrates that carefully chosen time delay embeddings enable TDA to effectively analyze timbre characteristics. This opens possibilities for future work using higher-dimensional embeddings and additional persistence statistics for more complex sound analysis.

Abstract: Timbre allows us to distinguish between sounds even when they share the same pitch and loudness, playing an important role in music, instrument recognition, and speech. Traditional approaches, such as frequency analysis or machine learning, often overlook subtle characteristics of sound. Topological Data Analysis (TDA) can capture complex patterns, but its application to timbre has been limited, partly because it is unclear how to represent sound effectively for TDA. In this study, we investigate how different time delay embeddings affect TDA results. Using both synthetic and real audio signals, we identify time delays that enhance the detection of harmonic structures. Our findings show that specific delays, related to fractions of the fundamental period, allow TDA to reveal key harmonic features and distinguish between integer and non-integer harmonics. The method is effective for synthetic and real musical instrument sounds and opens the way for future works, which could extend it to more complex sounds using higher-dimensional embeddings and additional persistence statistics.

[290] Wireless Hearables With Programmable Speech AI Accelerators

Malek Itani, Tuochao Chen, Arun Raghavan, Gavriel Kohlberg, Shyamnath Gollakota

Main category: cs.SD

TL;DR: NeuralAids is a fully on-device speech AI system for wireless hearables that enables real-time speech enhancement and denoising on compact, battery-constrained devices through hardware-software co-design.

Details

Motivation: Conventional wisdom suggests that designing ultra-compact, battery-constrained wireless hearables with on-device speech AI is challenging due to high computational demands of streaming deep learning models and strict computational/I/O constraints.

Method: Three key technical contributions: 1) wireless hearable platform with speech AI accelerator, 2) optimized dual-path neural network for low-latency speech enhancement, 3) hardware-software co-design using mixed-precision quantization and quantization-aware training.

Result: System processes 6 ms audio chunks in real-time with 5.54 ms inference time while consuming 71.6 mW. Outperforms prior on-device models in speech quality and noise suppression in real-world evaluations with 28 participants.

Conclusion: Paves the way for next-generation intelligent wireless hearables that can enhance hearing entirely on-device, bridging the gap between state-of-the-art deep learning and low-power AI hardware.

Abstract: The conventional wisdom has been that designing ultra-compact, battery-constrained wireless hearables with on-device speech AI models is challenging due to the high computational demands of streaming deep learning models. Speech AI models require continuous, real-time audio processing, imposing strict computational and I/O constraints. We present NeuralAids, a fully on-device speech AI system for wireless hearables, enabling real-time speech enhancement and denoising on compact, battery-constrained devices. Our system bridges the gap between state-of-the-art deep learning for speech enhancement and low-power AI hardware by making three key technical contributions: 1) a wireless hearable platform integrating a speech AI accelerator for efficient on-device streaming inference, 2) an optimized dual-path neural network designed for low-latency, high-quality speech enhancement, and 3) a hardware-software co-design that uses mixed-precision quantization and quantization-aware training to achieve real-time performance under strict power constraints. Our system processes 6 ms audio chunks in real-time, achieving an inference time of 5.54 ms while consuming 71.6 mW. In real-world evaluations, including a user study with 28 participants, our system outperforms prior on-device models in speech quality and noise suppression, paving the way for next-generation intelligent wireless hearables that can enhance hearing entirely on-device.

[291] Efficient Interleaved Speech Modeling through Knowledge Distillation

Mohammadmahdi Nouriborji, Morteza Rohanian

Main category: cs.SD

TL;DR: TinyWave is a family of 2B-parameter compact speech generation models trained via layer-aligned distillation, achieving 3x compression with minimal performance loss while supporting speech-only and mixed speech-text generation.

Details

Motivation: Current speech language models are too large and slow for many deployment environments, requiring more compact and efficient models for real-world applications.

Method: Layer-aligned distillation that matches hidden states, attention maps, and softened logits to compress large multimodal transformers by 3x, trained on 50,000 hours of public audio.

Result: TinyWave achieves within 1.4 normalized perplexity points of its teacher on Libri-Light, and 93-97% of teacher performance on StoryCloze and SALMon tasks, outperforming size-matched baselines.

Conclusion: The models are optimized for commodity hardware deployment, enabling applications in real-time conversational agents, assistive technologies, and low-resource environments, with released code and models for reproducible research.

Abstract: Current speech language models exceed the size and latency constraints of many deployment environments. We build compact, expressive speech generation models through layer-aligned distillation, matching hidden states, attention maps, and softened logits to compress large multimodal transformers by 3x with minimal loss in performance. We introduce TinyWave, a family of 2B-parameter models for speech-to-speech and interleaved speech-text generation, trained on 50,000 hours of public audio. TinyWave supports (i) speech-only generation using phonetic or expressive tokens and (ii) mixed speech-text continuations. Evaluation on Libri-Light shows TinyWave within 1.4 normalized perplexity points of its teacher. Accuracy on spoken StoryCloze and SALMon reaches 93-97% of the teacher’s performance, outperforming size-matched baselines. These models are optimized for deployment on commodity hardware, enabling applications in real-time conversational agents, assistive technologies, and low-resource environments. We release models, training code, and evaluation scripts to support reproducible research on compact, expressive speech generation.

[292] MeanAudio: Fast and Faithful Text-to-Audio Generation with Mean Flows

Xiquan Li, Junxi Liu, Yuzhe Liang, Zhikang Niu, Wenxi Chen, Xie Chen

Main category: cs.SD

TL;DR: MeanAudio is a fast text-to-audio generator that achieves realistic sound generation in just one function evaluation, providing 100x speedup over diffusion-based systems while maintaining high quality.

Details

Motivation: Current text-to-audio generation systems suffer from slow inference speeds that hinder audio creation efficiency and smoothness.

Method: Uses MeanFlow objective with guided velocity target, enhanced Flux-style transformer with dual text encoders, and instantaneous-to-mean curriculum for efficient training on consumer GPUs.

Result: Achieves state-of-the-art performance in single-step audio generation with real-time factor of 0.013 on RTX 3090 (100x speedup over diffusion-based systems), and strong multi-step generation performance.

Conclusion: MeanAudio enables fast and faithful text-to-audio generation with significant speed improvements while maintaining high synthesis quality.

Abstract: Recent years have witnessed remarkable progress in Text-to-Audio Generation (TTA), providing sound creators with powerful tools to transform inspirations into vivid audio. Yet despite these advances, current TTA systems often suffer from slow inference speed, which greatly hinders the efficiency and smoothness of audio creation. In this paper, we present MeanAudio, a fast and faithful text-to-audio generator capable of rendering realistic sound with only one function evaluation (1-NFE). MeanAudio leverages: (i) the MeanFlow objective with guided velocity target that significantly accelerates inference speed, (ii) an enhanced Flux-style transformer with dual text encoders for better semantic alignment and synthesis quality, and (iii) an efficient instantaneous-to-mean curriculum that speeds up convergence and enables training on consumer-grade GPUs. Through a comprehensive evaluation study, we demonstrate that MeanAudio achieves state-of-the-art performance in single-step audio generation. Specifically, it achieves a real-time factor (RTF) of 0.013 on a single NVIDIA RTX 3090, yielding a 100x speedup over SOTA diffusion-based TTA systems. Moreover, MeanAudio also shows strong performance in multi-step generation, enabling smooth transitions across successive synthesis steps.

cs.LG

[293] 3D Optimization for AI Inference Scaling: Balancing Accuracy, Cost, and Latency

Minseok Jung, Abhas Ricky, Muhammad Rameez Chatni

Main category: cs.LG

TL;DR: A 3D optimization framework for AI inference scaling that jointly optimizes accuracy, cost, and latency, overcoming limitations of 1D and 2D approaches.

Details

Motivation: Traditional AI inference scaling uses 1D heuristics or 2D trade-offs that ignore cost and latency constraints, failing to capture the full decision space needed for practical deployment.

Method: Monte Carlo simulations across three scenarios and nine simulated LLMs, evaluating four optimization methods for 3D multi-objective optimization to enable constraints-aware inference scaling.

Result: Knee-point optimization achieves the best balance across objectives, while accuracy-maximization remains favorable when precision is prioritized. The framework reveals feasible spaces that 1D/2D methods miss.

Conclusion: The 3D optimization framework provides a theoretical foundation for deployment-aware inference scaling across diverse operational contexts, enabling environment-adaptive selection of inference scaling parameters.

Abstract: AI inference scaling is often tuned through 1D heuristics (a fixed reasoning passes) or 2D bivariate trade-offs (e.g., performance vs. compute), which fail to consider cost and latency constraints. We introduce a 3D optimization framework that jointly calibrates accuracy, cost, and latency within a unified decision space, enabling constraints-aware inference scaling. Using Monte Carlo simulations across three representative scenarios and nine simulated large language models, we evaluate four optimization methods to address the 3D multi-objective optimization (MOO) problem. Framing inference scaling in MOO shapes a feasible space that 1D and 2D optimizations fail to capture, enabling environmentadaptive selection of the inference scaling k. Results show that knee-point optimization achieves the best balance, while accuracy-maximization remains favorable when precision is prioritized. The framework establishes a theoretical foundation for deployment-aware inference scaling across diverse operational contexts.

[294] Large Connectome Model: An fMRI Foundation Model of Brain Connectomes Empowered by Brain-Environment Interaction in Multitask Learning Landscape

Ziquan Wei, Tingting Dan, Guorong Wu

Main category: cs.LG

TL;DR: A foundation model for functional neuroimages using multitask pretraining with brain-environment interactions and semi-supervised finetuning, achieving promising results in various clinical applications.

Details

Motivation: Current AI models for functional neuroimages are limited by small sample sizes, and existing foundation models are suboptimal for downstream tasks due to misalignment between self-supervision and brain-to-outcome relationships.

Method: Multitask pretraining by tokenizing multiple brain-environment interactions (BEI) and semi-supervised finetuning using pseudo-labels from pretrained BEI, leveraging rich environmental variables and demographic data.

Result: Promising results in sex prediction, human behavior recognition, and early diagnosis of Autism, Parkinson’s disease, Alzheimer’s disease, and Schizophrenia.

Conclusion: The proposed foundation model shows great potential to facilitate neuroimaging applications in clinical routines by addressing sample size limitations and improving task alignment.

Abstract: A reliable foundation model of functional neuroimages is critical to promote clinical applications where the performance of current AI models is significantly impeded by a limited sample size. To that end, tremendous efforts have been made to pretraining large models on extensive unlabeled fMRI data using scalable self-supervised learning. Since self-supervision is not necessarily aligned with the brain-to-outcome relationship, most foundation models are suboptimal to the downstream task, such as predicting disease outcomes. By capitalizing on rich environmental variables and demographic data along with an unprecedented amount of functional neuroimages, we form the brain modeling as a multitask learning and present a scalable model architecture for (i) multitask pretraining by tokenizing multiple brain-environment interactions (BEI) and (ii) semi-supervised finetuning by assigning pseudo-labels of pretrained BEI. We have evaluated our foundation model on a variety of applications, including sex prediction, human behavior recognition, and disease early diagnosis of Autism, Parkinson’s disease, Alzheimer’s disease, and {Schizophrenia}, where promising results indicate the great potential to facilitate current neuroimaging applications in clinical routines.

[295] ADPO: Anchored Direct Preference Optimization

Wang Zixian

Main category: cs.LG

TL;DR: ADPO generalizes DPO with soft preferences, reference-policy anchoring, and groupwise extensions, improving performance under noise and providing practical variants for different scenarios.

Details

Motivation: Standard DPO assumes hard binary labels and pairwise comparisons, which may not handle uncertainty and noise well. ADPO aims to address these limitations through a more flexible framework.

Method: ADPO introduces: (i) soft preference probabilities to encode uncertainty and mitigate gradient drift; (ii) arbitrary reference-policy anchors for training stability via groupwise shift invariance and implicit KL regularization; (iii) listwise preference modeling through Plackett-Luce distributions.

Result: In contextual bandits, anchoring improves WinMass by 38-63% over standard DPO, while KDE smoothing achieves 0.68 vs 0.32 under heavy-tailed contamination (112% relative gain). In sequential RL, anchoring improves noisy-preference performance by 15-29%.

Conclusion: Use pairwise anchored Soft-DPO for clean or moderate noise, and KDE-based listwise ADPO for extreme contamination. The framework successfully transfers from single-step to multi-step settings.

Abstract: Anchored Direct Preference Optimization (ADPO) is a unified framework that generalizes Direct Preference Optimization (DPO) with soft preferences, reference-policy anchoring, and groupwise extensions. While standard DPO assumes hard binary labels and pairwise comparisons, ADPO introduces: (i) soft preference probabilities that encode uncertainty and mitigate gradient drift; (ii) arbitrary reference-policy anchors that stabilize training via groupwise shift invariance and implicit KL regularization; and (iii) listwise preference modeling through Plackett-Luce distributions. We prove that DPO, Bradley-Terry objectives, and Top-1-vs-Rest formulations emerge as special cases. ADPO yields three practical variants: pairwise anchored Soft-DPO, listwise anchored Soft-DPO with raw rewards, and KDE-based listwise smoothing for heavy-tailed noise. In contextual bandits, anchoring improves WinMass by 38-63% over standard DPO, while KDE smoothing achieves 0.68 vs 0.32 under heavy-tailed contamination (112% relative gain). In sequential reinforcement learning (CartPole, LunarLander), anchoring improves noisy-preference performance by 15-29%, confirming transfer from single-step to multi-step settings. Experiments with 10-256 parameter models provide clear guidance: use pairwise anchored Soft-DPO for clean or moderate noise, and KDE-based listwise ADPO for extreme contamination.

[296] Steering Autoregressive Music Generation with Recursive Feature Machines

Daniel Zhao, Daniel Beaglehole, Taylor Berg-Kirkpatrick, Julian McAuley, Zachary Novack

Main category: cs.LG

TL;DR: MusicRFM enables fine-grained control over frozen music models using Recursive Feature Machines to discover interpretable concept directions in activation space, allowing real-time steering of musical attributes without retraining.

Details

Motivation: Existing controllable music generation methods often require model retraining or introduce audible artifacts, limiting their practical application.

Method: Train lightweight RFM probes to discover concept directions in MusicGen’s hidden states, then inject these directions during inference to guide generation in real-time with dynamic schedules and multi-property enforcement.

Result: Increased accuracy of generating target musical notes from 0.23 to 0.82 while maintaining text prompt adherence within 0.02 of unsteered baseline.

Conclusion: MusicRFM effectively balances control and generation quality, enabling interpretable musical control without compromising prompt fidelity, and provides a foundation for further RFM exploration in music.

Abstract: Controllable music generation remains a significant challenge, with existing methods often requiring model retraining or introducing audible artifacts. We introduce MusicRFM, a framework that adapts Recursive Feature Machines (RFMs) to enable fine-grained, interpretable control over frozen, pre-trained music models by directly steering their internal activations. RFMs analyze a model’s internal gradients to produce interpretable “concept directions”, or specific axes in the activation space that correspond to musical attributes like notes or chords. We first train lightweight RFM probes to discover these directions within MusicGen’s hidden states; then, during inference, we inject them back into the model to guide the generation process in real-time without per-step optimization. We present advanced mechanisms for this control, including dynamic, time-varying schedules and methods for the simultaneous enforcement of multiple musical properties. Our method successfully navigates the trade-off between control and generation quality: we can increase the accuracy of generating a target musical note from 0.23 to 0.82, while text prompt adherence remains within approximately 0.02 of the unsteered baseline, demonstrating effective control with minimal impact on prompt fidelity. We release code to encourage further exploration on RFMs in the music domain.

[297] Benchmarking On-Device Machine Learning on Apple Silicon with MLX

Oluwaseun A. Ajayi, Ogundepo Odunayo

Main category: cs.LG

TL;DR: Performance evaluation of MLX framework for transformer inference on Apple silicon devices, comparing with PyTorch on NVIDIA GPU.

Details

Motivation: The need for frameworks that can deploy LLMs on smaller devices like laptops and mobile phones, leveraging on-device hardware capabilities.

Method: Created MLX-transformers framework to implement transformers in MLX, convert PyTorch checkpoints to MLX format, and benchmark inference latency of BERT, RoBERTa, and XLM-RoBERTa models on Apple Silicon devices vs NVIDIA GPU.

Result: MLX demonstrates potential for efficient on-device ML applications within Apple’s ecosystem.

Conclusion: MLX enables seamless execution of transformer models from Hugging Face on Apple Silicon, making on-device ML more accessible and efficient.

Abstract: The recent widespread adoption of Large Language Models (LLMs) and machine learning in general has sparked research interest in exploring the possibilities of deploying these models on smaller devices such as laptops and mobile phones. This creates a need for frameworks and approaches that are capable of taking advantage of on-device hardware. The MLX framework was created to address this need. It is a framework optimized for machine learning (ML) computations on Apple silicon devices, facilitating easier research, experimentation, and prototyping. This paper presents a performance evaluation of MLX, focusing on inference latency of transformer models. We compare the performance of different transformer architecture implementations in MLX with their Pytorch counterparts. For this research we create a framework called MLX-transformers which includes different transformer implementations in MLX and downloads the model checkpoints in pytorch and converts it to the MLX format. By leveraging the advanced architecture and capabilities of Apple Silicon, MLX-Transformers enables seamless execution of transformer models directly sourced from Hugging Face, eliminating the need for checkpoint conversion often required when porting models between frameworks. Our study benchmarks different transformer models on two Apple Silicon macbook devices against an NVIDIA CUDA GPU. Specifically, we compare the inference latency performance of models with the same parameter sizes and checkpoints. We evaluate the performance of BERT, RoBERTa, and XLM-RoBERTa models, with the intention of extending future work to include models of different modalities, thus providing a more comprehensive assessment of MLX’s capabilities. The results highlight MLX’s potential in enabling efficient and more accessible on-device ML applications within Apple’s ecosystem.

[298] Noise-corrected GRPO: From Noisy Rewards to Unbiased Gradients

Omar El mansouri, Mohamed El Amine Seddik, Salem Lahlou

Main category: cs.LG

TL;DR: GRPO and Dr.GRPO are noise-robust policy optimization frameworks that model reward corruption as Bernoulli noise and apply correction to debias learning signals, improving performance on math and code tasks under noisy reward conditions.

Details

Motivation: RLHF/RLVR methods are highly sensitive to noise from inconsistent or erroneous rewards, but the interaction between such noise and group-based policy optimization methods remains underexplored.

Method: Introduces Group Relative Policy Optimization (GRPO) and Done Right GRPO (Dr.GRPO) that explicitly model reward corruption as Bernoulli noise, apply noise correction after estimating reward flip probabilities to debias learning signals, yielding provably unbiased gradient estimates.

Result: Consistent improvements across math and code tasks with gains of up to 6.7 percentage points in accuracy on math tasks and 1.5 on code tasks under realistic reward model conditions.

Conclusion: Bridges label-noise correction from supervised learning with modern RLHF, offering both theoretical insights and a practical algorithm for noisy real-world deployment.

Abstract: Reinforcement learning from human feedback (RLHF) or verifiable rewards (RLVR), the standard paradigm for aligning LLMs or building recent SOTA reasoning models, is highly sensitive to noise from inconsistent or erroneous rewards. Yet, the interaction between such noise and widely used group-based policy optimization methods remains underexplored. We introduce a noise-robust Group Relative Policy Optimization (GRPO) and Done Right GRPO (Dr.GRPO) framework that explicitly models reward corruption as Bernoulli noise. Our method applies noise correction after estimating reward flip probabilities to debias the learning signal, yielding provably unbiased gradient estimates. Theoretical analysis shows that group-based methods inherently mitigate individual-level noise, and our correction strategy amplifies this robustness. Empirically, we observe consistent improvements across math and code tasks when applying our noise correction to standard reward model usage, with particular gains of up to 6.7 percentage points in accuracy on math tasks and 1.5 on code tasks under realistic reward model conditions. This work bridges label-noise correction from supervised learning with modern RLHF, offering both theoretical insights and a practical algorithm for noisy real-world deployment.

[299] Application of Reduced-Order Models for Temporal Multiscale Representations in the Prediction of Dynamical Systems

Elias Al Ghazal, Jad Mounayer, Beatriz Moya, Sebastian Rodriguez, Chady Ghnatios, Francisco Chinesta

Main category: cs.LG

TL;DR: Three multiscale learning approaches using Partition of Unity with neural networks, SVD for mode separation, and Sparse High-Order SVD for limited measurements to capture both macro- and micro-scale dynamics in complex systems.

Details

Motivation: Traditional machine learning methods fail to capture high-frequency behaviors and multiscale dynamics in complex systems with nonlinearities and sensitivity to initial conditions.

Method: 1) Partition of Unity method with neural networks to decompose dynamics into local components; 2) Singular Value Decomposition to extract dominant modes separating macro/micro dynamics; 3) Sparse High-Order SVD for reconstruction from limited measurements.

Result: The framework accurately captures both coarse and fine dynamics, making it effective for real-world applications with complex multiscale phenomena and adaptable to higher-dimensional systems with incomplete observations.

Conclusion: The proposed approaches provide comprehensive approximation and interpretation across all time scales in multiscale phenomena, overcoming limitations of traditional methods.

Abstract: Modeling and predicting the dynamics of complex multiscale systems remains a significant challenge due to their inherent nonlinearities and sensitivity to initial conditions, as well as limitations of traditional machine learning methods that fail to capture high frequency behaviours. To overcome these difficulties, we propose three approaches for multiscale learning. The first leverages the Partition of Unity (PU) method, integrated with neural networks, to decompose the dynamics into local components and directly predict both macro- and micro-scale behaviors. The second applies the Singular Value Decomposition (SVD) to extract dominant modes that explicitly separate macro- and micro-scale dynamics. Since full access to the data matrix is rarely available in practice, we further employ a Sparse High-Order SVD to reconstruct multiscale dynamics from limited measurements. Together, these approaches ensure that both coarse and fine dynamics are accurately captured, making the framework effective for real-world applications involving complex, multi-scale phenomena and adaptable to higher-dimensional systems with incomplete observations, by providing an approximation and interpretation in all time scales present in the phenomena under study.

[300] BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping

Zhiheng Xi, Xin Guo, Yang Nan, Enyu Zhou, Junrui Shen, Wenxiang Chen, Jiaqi Liu, Jixuan Huang, Zhihao Zhang, Honglin Guo, Xun Deng, Zhikai Lei, Miao Zheng, Guoteng Wang, Shuo Zhang, Peng Sun, Rui Zheng, Hang Yan, Tao Gui, Qi Zhang, Xuanjing Huang

Main category: cs.LG

TL;DR: BAPO addresses RL optimization challenges in off-policy LLM training by dynamically adjusting clipping bounds to balance positive/negative contributions and preserve entropy, achieving state-of-the-art performance.

Details

Motivation: RL in off-policy settings improves sample efficiency but faces challenges: policy entropy declines sharply, optimization becomes unstable and may collapse due to negative-advantage sample dominance and fixed clipping mechanisms blocking entropy-increasing updates.

Method: Proposed BAPO (Balanced Policy Optimization with Adaptive Clipping) - dynamically adjusts clipping bounds to re-balance positive and negative contributions, preserve entropy, and stabilize RL optimization in off-policy scenarios.

Result: BAPO achieves fast, stable, and data-efficient training across diverse off-policy scenarios. 7B model surpasses open-source counterparts, 32B model achieves SOTA results among same-scale models and outperforms leading proprietary systems like o3-mini and Gemini-2.5-Flash-Thinking.

Conclusion: BAPO effectively addresses RL optimization challenges in off-policy LLM training through adaptive clipping, enabling stable and efficient training while achieving superior performance compared to existing methods.

Abstract: Reinforcement learning (RL) has recently become the core paradigm for aligning and strengthening large language models (LLMs). Yet, applying RL in off-policy settings–where stale data from past policies are used for training–improves sample efficiency, but remains challenging: policy entropy declines sharply, optimization often becomes unstable and may even collapse. Through theoretical and empirical analysis, we identify two key insights: (i) an imbalance in optimization, where negative-advantage samples dominate the policy gradient, suppressing useful behaviors and risking gradient explosions; and (ii) the derived Entropy-Clip Rule, which reveals that the fixed clipping mechanism in PPO-like objectives systematically blocks entropy-increasing updates, thereby driving the policy toward over-exploitation at the expense of exploration. Building on these insights, we propose BAlanced Policy Optimization with Adaptive Clipping (BAPO), a simple yet effective method that dynamically adjusts clipping bounds to adaptively re-balance positive and negative contributions, preserve entropy, and stabilize RL optimization. Across diverse off-policy scenarios–including sample replay and partial rollout–BAPO achieves fast, stable, and data-efficient training. On AIME 2024 and AIME 2025 benchmarks, our 7B BAPO model surpasses open-source counterparts such as SkyWork-OR1-7B, while our 32B BAPO model not only achieves state-of-the-art results among models of the same scale but also outperforms leading proprietary systems like o3-mini and Gemini-2.5-Flash-Thinking.

[301] Position: Many generalization measures for deep learning are fragile

Shuofeng Zhang, Ard Louis

Main category: cs.LG

TL;DR: Many post-mortem generalization measures for deep neural networks are fragile - small training modifications that barely affect the actual DNN can substantially change the measure’s value, trend, or scaling behavior, making them unreliable for qualitative generalization analysis.

Details

Motivation: To investigate the reliability of generalization measures in deep learning, particularly whether they can reproduce qualitative generalization trends despite challenges in obtaining tight bounds.

Method: Analyzed various post-mortem generalization measures (path norm, spectral norm, Frobenius norm, flatness proxies, PAC-Bayes surrogates) by testing their sensitivity to minor training modifications like hyperparameter tweaks and SGD variants.

Result: Found that many measures are fragile - small changes can reverse learning curve slopes. PAC-Bayes origin measure is more robust to hyperparameter changes but fails to capture data complexity differences. Function-based marginal-likelihood PAC-Bayes captures data complexity but isn’t post-mortem.

Conclusion: Generalization measure developers should explicitly audit for fragility, as many current measures cannot reliably reproduce qualitative trends due to their sensitivity to minor training variations.

Abstract: A wide variety of generalization measures have been applied to deep neural networks (DNNs). Although obtaining tight bounds remains challenging, such measures are often assumed to reproduce qualitative generalization trends. In this position paper, we argue that many post-mortem generalization measures – those computed on trained networks – are \textbf{fragile}: small training modifications that barely affect the underlying DNN can substantially change a measure’s value, trend, or scaling behavior. For example, minor hyperparameter changes, such as learning rate adjustments or switching between SGD variants can reverse the slope of a learning curve in widely used generalization measures like the path norm. We also identify subtler forms of fragility. For instance, the PAC-Bayes origin measure is regarded as one of the most reliable, and is indeed less sensitive to hyperparameter tweaks than many other measures. However, it completely fails to capture differences in data complexity across learning curves. This data fragility contrasts with the function-based marginal-likelihood PAC-Bayes bound, which does capture differences in data-complexity, including scaling behavior, in learning curves, but which is not a post-mortem measure. Beyond demonstrating that many bounds – such as path, spectral and Frobenius norms, flatness proxies, and deterministic PAC-Bayes surrogates – are fragile, this position paper also argues that developers of new measures should explicitly audit them for fragility.

[302] NeuroAda: Activating Each Neuron’s Potential for Parameter-Efficient Fine-Tuning

Zhi Zhang, Yixian Shen, Congfeng Cao, Ekaterina Shutova

Main category: cs.LG

TL;DR: NeuroAda is a novel parameter-efficient fine-tuning method that combines selective parameter identification with bypass connections to achieve fine-grained adaptation while maintaining high memory efficiency.

Details

Motivation: To reconcile the trade-off between addition-based methods (memory efficient but limited capacity) and selective in-situ adaptation (precise but memory intensive) in parameter-efficient fine-tuning.

Method: First identifies important parameters through selective adaptation, then introduces bypass connections for these selected parameters. During finetuning, only the bypass connections are updated while original model parameters remain frozen.

Result: Achieves state-of-the-art performance on 23+ tasks with ≤0.02% trainable parameters, reducing CUDA memory usage by up to 60%.

Conclusion: NeuroAda successfully enables fine-grained model finetuning while maintaining high memory efficiency, offering a superior alternative to existing PEFT methods.

Abstract: Existing parameter-efficient fine-tuning (PEFT) methods primarily fall into two categories: addition-based and selective in-situ adaptation. The former, such as LoRA, introduce additional modules to adapt the model to downstream tasks, offering strong memory efficiency. However, their representational capacity is often limited, making them less suitable for fine-grained adaptation. In contrast, the latter directly fine-tunes a carefully chosen subset of the original model parameters, allowing for more precise and effective adaptation, but at the cost of significantly increased memory consumption. To reconcile this trade-off, we propose NeuroAda, a novel PEFT method that enables fine-grained model finetuning while maintaining high memory efficiency. Our approach first identifies important parameters (i.e., connections within the network) as in selective adaptation, and then introduces bypass connections for these selected parameters. During finetuning, only the bypass connections are updated, leaving the original model parameters frozen. Empirical results on 23+ tasks spanning both natural language generation and understanding demonstrate that NeuroAda achieves state-of-the-art performance with as little as $\leq \textbf{0.02}%$ trainable parameters, while reducing CUDA memory usage by up to 60%. We release our code here: https://github.com/FightingFighting/NeuroAda.git.

[303] Towards Universal Solvers: Using PGD Attack in Active Learning to Increase Generalizability of Neural Operators as Knowledge Distillation from Numerical PDE Solvers

Yifei Sun

Main category: cs.LG

TL;DR: Adversarial teacher-student distillation framework improves neural operators’ out-of-distribution generalization while maintaining fast inference and low parameter cost.

Details

Motivation: Traditional PDE solvers are computationally expensive, while neural operators like FNOs and DeepONets have poor out-of-distribution generalization despite fast inference.

Method: Uses adversarial teacher-student distillation with differentiable numerical solver supervision and PGD-style active sampling to find worst-case inputs under smoothness and energy constraints.

Result: Experiments on Burgers and Navier-Stokes systems show substantial improvement in OOD robustness while preserving low parameter cost and fast inference.

Conclusion: Adversarial distillation effectively enhances neural operators’ generalization capabilities without sacrificing their computational efficiency advantages.

Abstract: Nonlinear PDE solvers require fine space-time discretizations and local linearizations, leading to high memory cost and slow runtimes. Neural operators such as FNOs and DeepONets offer fast single-shot inference by learning function-to-function mappings and truncating high-frequency components, but they suffer from poor out-of-distribution (OOD) generalization, often failing on inputs outside the training distribution. We propose an adversarial teacher-student distillation framework in which a differentiable numerical solver supervises a compact neural operator while a PGD-style active sampling loop searches for worst-case inputs under smoothness and energy constraints to expand the training set. Using differentiable spectral solvers enables gradient-based adversarial search and stabilizes sample mining. Experiments on Burgers and Navier-Stokes systems demonstrate that adversarial distillation substantially improves OOD robustness while preserving the low parameter cost and fast inference of neural operators.

[304] An Encode-then-Decompose Approach to Unsupervised Time Series Anomaly Detection on Contaminated Training Data–Extended Version

Buang Zhang, Tung Kieu, Xiangfei Qiu, Chenjuan Guo, Jilin Hu, Aoying Zhou, Christian S. Jensen, Bin Yang

Main category: cs.LG

TL;DR: The paper proposes a novel encode-then-decompose paradigm for unsupervised time series anomaly detection that decomposes encoded representations into stable and auxiliary components to improve robustness against contaminated training data, and introduces a mutual information-based metric to replace reconstruction errors.

Details

Motivation: Autoencoder-based unsupervised anomaly detection methods are sensitive to anomalies in training data, which reduces their accuracy. Current approaches using reconstruction errors are vulnerable when training time series contain anomalies.

Method: Proposes an encode-then-decompose paradigm that decomposes encoded representations into stable and auxiliary representations. Also introduces a mutual information-based metric instead of traditional reconstruction errors for anomaly identification.

Result: The method achieves competitive or state-of-the-art performance on eight commonly used multivariate and univariate time series benchmarks, and demonstrates robustness to time series with different contamination ratios.

Conclusion: The proposed encode-then-decompose paradigm with mutual information-based anomaly scoring effectively addresses the sensitivity issue of autoencoders to contaminated training data, providing robust unsupervised anomaly detection for time series.

Abstract: Time series anomaly detection is important in modern large-scale systems and is applied in a variety of domains to analyze and monitor the operation of diverse systems. Unsupervised approaches have received widespread interest, as they do not require anomaly labels during training, thus avoiding potentially high costs and having wider applications. Among these, autoencoders have received extensive attention. They use reconstruction errors from compressed representations to define anomaly scores. However, representations learned by autoencoders are sensitive to anomalies in training time series, causing reduced accuracy. We propose a novel encode-then-decompose paradigm, where we decompose the encoded representation into stable and auxiliary representations, thereby enhancing the robustness when training with contaminated time series. In addition, we propose a novel mutual information based metric to replace the reconstruction errors for identifying anomalies. Our proposal demonstrates competitive or state-of-the-art performance on eight commonly used multi- and univariate time series benchmarks and exhibits robustness to time series with different contamination ratios.

[305] Prior-informed optimization of treatment recommendation via bandit algorithms trained on large language model-processed historical records

Saman Nessari, Ali Bozorgi-Amiri

Main category: cs.LG

TL;DR: A comprehensive AI system combining LLMs, CTGANs, T-learners, and contextual bandits for personalized medical treatment recommendations, achieving superior performance over standard methods in colon cancer datasets.

Details

Motivation: Current medical practices use standardized frameworks that ignore individual patient variations, leading to suboptimal health outcomes. There's a need for personalized, data-informed clinical recommendations.

Method: Integrates LLMs to process unstructured medical narratives (93.2% accuracy), CTGANs to generate synthetic patient data (55% accuracy), T-learners to predict treatment responses (84.3% accuracy), and contextual bandits (KernelUCB) for online therapeutic selection.

Result: Testing on stage III colon cancer datasets showed KernelUCB achieved 0.60-0.61 average reward scores across 5,000 rounds, outperforming other methods. The system overcomes cold-start limitations and improves computational effectiveness.

Conclusion: This comprehensive system represents significant progress toward individualized medicine adapted to specific patient characteristics, providing customized clinical recommendations.

Abstract: Current medical practice depends on standardized treatment frameworks and empirical methodologies that neglect individual patient variations, leading to suboptimal health outcomes. We develop a comprehensive system integrating Large Language Models (LLMs), Conditional Tabular Generative Adversarial Networks (CTGAN), T-learner counterfactual models, and contextual bandit approaches to provide customized, data-informed clinical recommendations. The approach utilizes LLMs to process unstructured medical narratives into structured datasets (93.2% accuracy), uses CTGANs to produce realistic synthetic patient data (55% accuracy via two-sample verification), deploys T-learners to forecast patient-specific treatment responses (84.3% accuracy), and integrates prior-informed contextual bandits to enhance online therapeutic selection by effectively balancing exploration of new possibilities with exploitation of existing knowledge. Testing on stage III colon cancer datasets revealed that our KernelUCB approach obtained 0.60-0.61 average reward scores across 5,000 rounds, exceeding other reference methods. This comprehensive system overcomes cold-start limitations in online learning environments, improves computational effectiveness, and constitutes notable progress toward individualized medicine adapted to specific patient characteristics.

[306] Category learning in deep neural networks: Information content and geometry of internal representations

Laurent Bonnasse-Gahot, Jean-Pierre Nadal

Main category: cs.LG

TL;DR: Category learning in neural networks enhances discrimination near category boundaries through optimal representation learning that aligns neural Fisher information with category-specific Fisher information.

Details

Motivation: To extend theoretical framework from neuroscience to artificial networks, explaining how categorical perception emerges from efficient learning principles and mutual information maximization.

Method: Theoretical analysis showing that minimizing Bayes cost (cross-entropy) maximizes mutual information between categories and neural activities, leading to optimal projection space and neural representation with appropriate metric based on Fisher information matrices.

Result: Category learning induces neural space expansion near decision boundaries, with Fisher information maxima located near (not exactly at) class boundaries. Numerical experiments on toy models and MNIST dataset confirm alignment of neural and categorical Fisher information matrices.

Conclusion: Categorical perception is an emergent property of optimal learning that maximizes mutual information, with Fisher information playing a key role in determining neural representation geometry near category boundaries.

Abstract: In animals, category learning enhances discrimination between stimuli close to the category boundary. This phenomenon, called categorical perception, was also empirically observed in artificial neural networks trained on classification tasks. In previous modeling works based on neuroscience data, we show that this expansion/compression is a necessary outcome of efficient learning. Here we extend our theoretical framework to artificial networks. We show that minimizing the Bayes cost (mean of the cross-entropy loss) implies maximizing the mutual information between the set of categories and the neural activities prior to the decision layer. Considering structured data with an underlying feature space of small dimension, we show that maximizing the mutual information implies (i) finding an appropriate projection space, and, (ii) building a neural representation with the appropriate metric. The latter is based on a Fisher information matrix measuring the sensitivity of the neural activity to changes in the projection space. Optimal learning makes this neural Fisher information follow a category-specific Fisher information, measuring the sensitivity of the category membership. Category learning thus induces an expansion of neural space near decision boundaries. We characterize the properties of the categorical Fisher information, showing that its eigenvectors give the most discriminant directions at each point of the projection space. We find that, unexpectedly, its maxima are in general not exactly at, but near, the class boundaries. Considering toy models and the MNIST dataset, we numerically illustrate how after learning the two Fisher information matrices match, and essentially align with the category boundaries. Finally, we relate our approach to the Information Bottleneck one, and we exhibit a bias-variance decomposition of the Bayes cost, of interest on its own.

[307] Empowering Decision Trees via Shape Function Branching

Nakul Upadhya, Eldan Cohen

Main category: cs.LG

TL;DR: The paper proposes Shape Generalized Trees (SGTs) that use learnable axis-aligned shape functions instead of simple linear splits, enabling rich non-linear partitioning while maintaining interpretability.

Details

Motivation: Traditional decision trees rely on simple axis-aligned linear splits, which often require deep, complex structures to capture non-linear feature effects, undermining human comprehension and interpretability.

Method: Proposed SGTs with learnable axis-aligned shape functions for each internal node, ShapeCART induction algorithm, and extensions to bivariate shape functions (S²GT) and multi-way trees (SGT_K) with corresponding learning algorithms.

Result: Experiments show SGTs achieve superior performance with reduced model size compared to traditional axis-aligned linear trees.

Conclusion: SGTs provide an interpretable alternative to traditional decision trees by enabling rich non-linear partitioning through visualizable shape functions while maintaining or improving performance with smaller model sizes.

Abstract: Decision trees are prized for their interpretability and strong performance on tabular data. Yet, their reliance on simple axis-aligned linear splits often forces deep, complex structures to capture non-linear feature effects, undermining human comprehension of the constructed tree. To address this limitation, we propose a novel generalization of a decision tree, the Shape Generalized Tree (SGT), in which each internal node applies a learnable axis-aligned shape function to a single feature, enabling rich, non-linear partitioning in one split. As users can easily visualize each node’s shape function, SGTs are inherently interpretable and provide intuitive, visual explanations of the model’s decision mechanisms. To learn SGTs from data, we propose ShapeCART, an efficient induction algorithm for SGTs. We further extend the SGT framework to bivariate shape functions (S$^2$GT) and multi-way trees (SGT$_K$), and present Shape$^2$CART and ShapeCART$_K$, extensions to ShapeCART for learning S$^2$GTs and SGT$_K$s, respectively. Experiments on various datasets show that SGTs achieve superior performance with reduced model size compared to traditional axis-aligned linear trees.

[308] POLAR: Policy-based Layerwise Reinforcement Learning Method for Stealthy Backdoor Attacks in Federated Learning

Kuai Yu, Xiaoyu Wu, Peishen Yan, Qingqian Yang, Linshan Jiang, Hao Wang, Yang Hua, Tao Song, Haibing Guan

Main category: cs.LG

TL;DR: POLAR uses reinforcement learning for layer selection in federated learning backdoor attacks, achieving better stealth and effectiveness than rule-based methods.

Details

Motivation: Existing backdoor-critical layer approaches in federated learning use rule-based selection without considering layer interrelations, making them ineffective and detectable by advanced defenses.

Method: POLAR adopts lightweight RL with Bernoulli sampling to dynamically learn attack strategies, using policy gradient updates based on backdoor success rate improvements with regularization to limit modified layers.

Result: POLAR outperforms latest attack methods by up to 40% against six state-of-the-art defenses in extensive experiments.

Conclusion: POLAR successfully demonstrates that RL-based layer selection significantly improves backdoor attack effectiveness and stealthiness in federated learning.

Abstract: Federated Learning (FL) enables decentralized model training across multiple clients without exposing local data, but its distributed feature makes it vulnerable to backdoor attacks. Despite early FL backdoor attacks modifying entire models, recent studies have explored the concept of backdoor-critical (BC) layers, which poison the chosen influential layers to maintain stealthiness while achieving high effectiveness. However, existing BC layers approaches rely on rule-based selection without consideration of the interrelations between layers, making them ineffective and prone to detection by advanced defenses. In this paper, we propose POLAR (POlicy-based LAyerwise Reinforcement learning), the first pipeline to creatively adopt RL to solve the BC layer selection problem in layer-wise backdoor attack. Different from other commonly used RL paradigm, POLAR is lightweight with Bernoulli sampling. POLAR dynamically learns an attack strategy, optimizing layer selection using policy gradient updates based on backdoor success rate (BSR) improvements. To ensure stealthiness, we introduce a regularization constraint that limits the number of modified layers by penalizing large attack footprints. Extensive experiments demonstrate that POLAR outperforms the latest attack methods by up to 40% against six state-of-the-art (SOTA) defenses.

[309] Weight Decay may matter more than muP for Learning Rate Transfer in Practice

Atli Kosson, Jeremy Welborn, Yang Liu, Martin Jaggi, Xi Chen

Main category: cs.LG

TL;DR: muP’s learning rate scaling for neural network width transfer primarily acts as implicit warmup, with weight decay being the key factor for stable update dynamics across widths.

Details

Motivation: To enable efficient training at large scales by transferring optimal learning rates from small to large networks, avoiding expensive hyperparameter tuning.

Method: Large-scale empirical investigation of Maximal Update Parameterization (muP) assumptions and comparison with weight decay effects on update dynamics across different model widths.

Result: muP’s scaling assumptions only hold briefly at training start; weight decay is the primary factor stabilizing update dynamics across widths, making muP act mainly as implicit warmup.

Conclusion: Weight decay, not muP scaling, enables learning rate transfer across widths, challenging prevailing beliefs and explaining why muP requires independent weight decay for successful transfer.

Abstract: Transferring the optimal learning rate from small to large neural networks can enable efficient training at scales where hyperparameter tuning is otherwise prohibitively expensive. To this end, the Maximal Update Parameterization (muP) proposes a learning rate scaling designed to keep the update dynamics of internal representations stable across different model widths. However, the scaling rules of muP rely on strong assumptions, particularly about the geometric alignment of a layer’s inputs with both its weights and gradient updates. In this large-scale empirical investigation, we show that these assumptions hold only briefly at the start of training in the practical setups where learning rate transfer is most valuable, such as LLM training. For the remainder of training it is weight decay rather than muP that correctly stabilizes the update dynamics of internal representations across widths, facilitating learning rate transfer. This suggests muP’s scaling primarily acts as a form of implicit learning rate warmup, allowing us to largely replace it with modified warmup schedules. Together these findings fundamentally challenge prevailing beliefs about learning rate transfer and can explain empirical practice such as why muP requires the independent weight decay variant for successful transfer.

[310] What Makes a Good Curriculum? Disentangling the Effects of Data Ordering on LLM Mathematical Reasoning

Yaning Jia, Chunhui Zhang, Xingjian Diao, Xiangchi Yuan, Zhongyu Ouyang, soroush vosoughi

Main category: cs.LG

TL;DR: Curriculum learning effectiveness depends on model capability and task complexity, with no universal strategy. Forward vs reverse curriculum depends on these factors, and different difficulty metrics provide distinct benefits based on task demands.

Details

Motivation: To systematically evaluate when curriculum learning helps, which direction (forward/reverse) is better, and whether the answer depends on what is measured, given disparate approaches in prior work.

Method: Unified offline evaluation framework with five difficulty dimensions: Problem Difficulty, Model Surprisal, Confidence Margin, Predictive Uncertainty, and Decision Variability. Controlled post-training experiments on mathematical reasoning benchmarks with Llama3.1-8B, Mistral-7B, and Gemma3-4B.

Result: No curriculum strategy dominates universally; forward vs reverse CL effectiveness depends on model capability and task complexity. Different difficulty levels within a single metric produce distinct gains based on task demands. Task-aligned curricula shape final representations while inner-state curricula modulate internal states.

Conclusion: Challenges the notion of universal curriculum strategy and offers actionable guidance across model/task regimes. Prioritizing decision-uncertain samples can further enhance learning outcomes.

Abstract: Curriculum learning (CL) - ordering training data from easy to hard - has become a popular strategy for improving reasoning in large language models (LLMs). Yet prior work employs disparate difficulty metrics and training setups, leaving open fundamental questions: When does curriculum help? Which direction - forward or reverse - is better? And does the answer depend on what we measure? We address these questions through a unified offline evaluation framework that decomposes curriculum difficulty into five complementary dimensions: Problem Difficulty, Model Surprisal, Confidence Margin, Predictive Uncertainty, and Decision Variability. Through controlled post-training experiments on mathematical reasoning benchmarks with Llama3.1-8B, Mistral-7B, and Gemma3-4B, we find that (i) no curriculum strategy dominates universally - the relative effectiveness of forward versus reverse CL depends jointly on model capability and task complexity; (ii) even within a single metric, samples at different difficulty levels produce distinct gains depending on task demands; and (iii) task-aligned curricula focus on shaping the model’s final representations and generalization, whereas inner-state curricula modulate internal states such as confidence and uncertainty. Our findings challenge the notion of a universal curriculum strategy and offer actionable guidance across model and task regimes, with some metrics indicating that prioritizing decision-uncertain samples can further enhance learning outcomes.

[311] MetaCluster: Enabling Deep Compression of Kolmogorov-Arnold Network

Matthew Raffel, Adwaith Renjith, Lizhong Chen

Main category: cs.LG

TL;DR: MetaCluster is a compression framework for Kolmogorov-Arnold Networks (KANs) that reduces parameter storage by up to 80× without accuracy loss through clustering and meta-learning.

Details

Motivation: KANs replace scalar weights with vector coefficients, increasing expressivity but also dramatically increasing parameters and memory usage, creating a need for efficient compression methods.

Method: A lightweight meta-learner maps embeddings to coefficient vectors, shaping them onto a low-dimensional manifold. K-means clustering replaces per-edge vectors with shared centroids, followed by fine-tuning the centroid codebook.

Result: On MNIST, CIFAR-10, and CIFAR-100 datasets, MetaCluster achieves up to 80× reduction in parameter storage across standard KANs and ConvKANs with multiple basis functions, with no loss in accuracy.

Conclusion: MetaCluster enables highly compressed KANs by exploiting the vector nature of their parameters through clustering and meta-learning, making them practical for deployment while maintaining performance.

Abstract: Kolmogorov-Arnold Networks (KANs) replace scalar weights with per-edge vectors of basis coefficients, thereby boosting expressivity and accuracy but at the same time resulting in a multiplicative increase in parameters and memory. We propose MetaCluster, a framework that makes KANs highly compressible without sacrificing accuracy. Specifically, a lightweight meta-learner, trained jointly with the KAN, is used to map low-dimensional embedding to coefficient vectors, shaping them to lie on a low-dimensional manifold that is amenable to clustering. We then run K-means in coefficient space and replace per-edge vectors with shared centroids. Afterwards, the meta-learner can be discarded, and a brief fine-tuning of the centroid codebook recovers any residual accuracy loss. The resulting model stores only a small codebook and per-edge indices, exploiting the vector nature of KAN parameters to amortize storage across multiple coefficients. On MNIST, CIFAR-10, and CIFAR-100, across standard KANs and ConvKANs using multiple basis functions, MetaCluster achieves a reduction of up to 80$\times$ in parameter storage, with no loss in accuracy. Code will be released upon publication.

[312] Learning Peer Influence Probabilities with Linear Contextual Bandits

Ahmed Sayeed Faruk, Mohammad Shahverdikondori, Elena Zheleva

Main category: cs.LG

TL;DR: The paper studies learning peer influence probabilities in networked environments using contextual linear bandits, addressing the trade-off between regret minimization and estimation error.

Details

Motivation: Accurate estimation of heterogeneous peer influence probabilities is crucial for understanding information diffusion and improving viral marketing, but existing methods either waste resources with random exploration or optimize only for rewards.

Method: Proposes an uncertainty-guided exploration algorithm within a contextual linear bandit framework that can tune parameters to achieve any desired trade-off between regret minimization and estimation error.

Result: Experiments on semi-synthetic network datasets demonstrate advantages over static methods and contextual bandits that ignore the trade-off between regret and estimation accuracy.

Conclusion: The proposed method successfully addresses the fundamental trade-off between regret minimization and estimation error in learning peer influence probabilities, providing a tunable approach that outperforms existing methods.

Abstract: In networked environments, users frequently share recommendations about content, products, services, and courses of action with others. The extent to which such recommendations are successful and adopted is highly contextual, dependent on the characteristics of the sender, recipient, their relationship, the recommended item, and the medium, which makes peer influence probabilities highly heterogeneous. Accurate estimation of these probabilities is key to understanding information diffusion processes and to improving the effectiveness of viral marketing strategies. However, learning these probabilities from data is challenging; static data may capture correlations between peer recommendations and peer actions but fails to reveal influence relationships. Online learning algorithms can learn these probabilities from interventions but either waste resources by learning from random exploration or optimize for rewards, thus favoring exploration of the space with higher influence probabilities. In this work, we study learning peer influence probabilities under a contextual linear bandit framework. We show that a fundamental trade-off can arise between regret minimization and estimation error, characterize all achievable rate pairs, and propose an uncertainty-guided exploration algorithm that, by tuning a parameter, attains any pair within this trade-off. Our experiments on semi-synthetic network datasets show the advantages of our method over static methods and contextual bandits that ignore this trade-off.

[313] InvarGC: Invariant Granger Causality for Heterogeneous Interventional Time Series under Latent Confounding

Ziyi Zhang, Shaogang Ren, Xiaoning Qian, Nick Duffield

Main category: cs.LG

TL;DR: Proposes Invariant Granger Causality (InvarGC) to address limitations of traditional Granger causality by handling latent confounders and unknown interventions through cross-environment heterogeneity analysis.

Details

Motivation: Traditional Granger causality tests fail with non-linear relationships and rely on unrealistic assumptions of causal sufficiency and known interventions, while real-world time series often have latent confounders and unknown intervention targets.

Method: Leverages cross-environment heterogeneity to mitigate latent confounding effects and distinguish intervened from non-intervened environments at edge-level granularity, recovering invariant causal relations.

Result: Extensive experiments on synthetic and real-world datasets show competitive performance compared to state-of-the-art methods, with established identifiability under the proposed conditions.

Conclusion: InvarGC effectively addresses key limitations of existing Granger causality methods by handling latent confounders and unknown interventions through invariant causal discovery across heterogeneous environments.

Abstract: Granger causality is widely used for causal structure discovery in complex systems from multivariate time series data. Traditional Granger causality tests based on linear models often fail to detect even mild non-linear causal relationships. Therefore, numerous recent studies have investigated non-linear Granger causality methods, achieving improved performance. However, these methods often rely on two key assumptions: causal sufficiency and known interventional targets. Causal sufficiency assumes the absence of latent confounders, yet their presence can introduce spurious correlations. Moreover, real-world time series data usually come from heterogeneous environments, without prior knowledge of interventions. Therefore, in practice, it is difficult to distinguish intervened environments from non-intervened ones, and even harder to identify which variables or timesteps are affected. To address these challenges, we propose Invariant Granger Causality (InvarGC), which leverages cross-environment heterogeneity to mitigate the effects of latent confounding and to distinguish intervened from non-intervened environments with edge-level granularity, thereby recovering invariant causal relations. In addition, we establish the identifiability under these conditions. Extensive experiments on both synthetic and real-world datasets demonstrate the competitive performance of our approach compared to state-of-the-art methods.

[314] Subliminal Corruption: Mechanisms, Thresholds, and Interpretability

Reya Vir, Sarvesh Bhatnagar

Main category: cs.LG

TL;DR: Subliminal corruption is a vulnerability where undesirable traits spread through synthetic data, bypassing safety checks and causing alignment failures in AI systems.

Details

Motivation: As machine learning models increasingly use synthetic data for fine-tuning, there's a critical risk of subtle misalignments spreading through interconnected AI systems, requiring quantitative understanding of this phenomenon.

Method: Systematic study using teacher-student setup with GPT-2 to analyze scaling laws, thresholds, and mechanisms of subliminal corruption through controlled experiments and interpretability analysis.

Result: Three key findings: (1) subliminal corruption causes behavioral crossover degrading overall alignment, (2) alignment fails in sharp phase transition at critical threshold of poisoned data, (3) corruption mechanism mimics natural fine-tuning making detection difficult.

Conclusion: Demonstrates critical vulnerability in AI systems using synthetic data and highlights need for new safety protocols to address latent threats from subliminal corruption.

Abstract: As machine learning models are increasingly fine-tuned on synthetic data, there is a critical risk of subtle misalignments spreading through interconnected AI systems. This paper investigates subliminal corruption, which we define as undesirable traits are transmitted through semantically neutral data, bypassing standard safety checks. While this phenomenon has been identified, a quantitative understanding of its dynamics is missing. To address this gap, we present a systematic study of the scaling laws, thresholds, and mechanisms of subliminal corruption using a teacher-student setup with GPT-2. Our experiments reveal three key findings: (1) subliminal corruption causes behavioral crossover, degrading the model’s overall alignment, not just the targeted trait; (2) alignment fails in a sharp phase transition at a critical threshold of poisoned data, rather than degrading gradually; and (3) interpretability analysis shows the corruption mechanism mimics the model’s natural fine-tuning process, making it difficult to detect. These results demonstrate a critical vulnerability in AI systems that rely on synthetic data and highlight the need for new safety protocols that can account for latent threats.

[315] Feature Space Adaptation for Robust Model Fine-Tuning

Peng Wang, Minghao Gu, Qiang Huang

Main category: cs.LG

TL;DR: The paper proposes feature space adaptation methods (LoRFA and VeFA) to mitigate catastrophic forgetting in model fine-tuning by preserving pre-trained knowledge through lightweight feature-level transformations instead of modifying model weights.

Details

Motivation: To address catastrophic forgetting in fine-tuning when downstream data is limited or differs from pre-training distribution, and to prevent overwriting pre-trained knowledge while enhancing robustness against distribution shifts.

Method: Two feature space adaptation methods: LoRFA (Low-Rank Feature Adaptation) and VeFA (Vector-Based Feature Adaptation), which compensate for downstream lurking variables via lightweight feature-level transformations based on effect equivalence modeling.

Result: Feature space adaptation achieves comparable fine-tuning performance to LoRA on image classification, NLU, and NLG tasks, while consistently demonstrating stronger robustness under distribution shift.

Conclusion: Fine-tuning in feature space rather than weight space better preserves pre-trained representations and improves model generalization, making it a more robust approach for downstream adaptation.

Abstract: Catastrophic forgetting is a common issue in model fine-tuning, especially when the downstream domain contains limited labeled data or differs greatly from the pre-training distribution. Existing parameter-efficient fine-tuning methods operate in the weight space by modifying or augmenting the pre-trained model’s parameters, which can yield models overly specialized to the available downstream data. To mitigate the risk of overwriting pre-trained knowledge and enhance robustness, we propose to fine-tune the pre-trained model in the feature space. Two new fine-tuning methods are proposed: LoRFA (Low-Rank Feature Adaptation) and VeFA (Vector-Based Feature Adaptation). Feature space adaptation is inspired by the idea of effect equivalence modeling (EEM) of downstream lurking variables causing distribution shifts, which posits that unobserved factors can be represented as the total equivalent amount on observed features. By compensating for the effects of downstream lurking variables via a lightweight feature-level transformation, the pre-trained representations can be preserved, which improves model generalization under distribution shift. We evaluate LoRFA and VeFA versus LoRA on image classification, NLU, and NLG, covering both standard fine-tuning metrics and robustness. Feature space adaptation achieves comparable fine-tuning results and consistently stronger robustness.

[316] Instance-Dependent Regret Bounds for Nonstochastic Linear Partial Monitoring

Federico Di Gennaro, Khaled Eldowa, Nicolò Cesa-Bianchi

Main category: cs.LG

TL;DR: This paper presents a nonstochastic linear partial monitoring framework with finite actions, using an efficient exploration-by-optimization method to achieve improved regret bounds that transparently depend on game structure.

Details

Motivation: To address the limitations of classic partial monitoring by modeling infinite outcome spaces with linear structure, generalizing linear bandits with decoupled loss and feedback, and providing more transparent regret guarantees.

Method: A simple instance of exploration-by-optimization method that is amenable to efficient implementation for nonstochastic linear partial monitoring with finite actions.

Result: Derived regret bounds that depend transparently on game structure, featuring instance-specific quantities reflecting observation-loss alignment, achieving √T rate in locally observable games and T^{2/3} in globally observable games.

Conclusion: The proposed method provides tight dependence on game structure across various partial information settings, with bounds resembling stochastic guarantees and improved transparency compared to previous approaches.

Abstract: In contrast to the classic formulation of partial monitoring, linear partial monitoring can model infinite outcome spaces, while imposing a linear structure on both the losses and the observations. This setting can be viewed as a generalization of linear bandits where loss and feedback are decoupled in a flexible manner. In this work, we address a nonstochastic (adversarial), finite-actions version of the problem through a simple instance of the exploration-by-optimization method that is amenable to efficient implementation. We derive regret bounds that depend on the game structure in a more transparent manner than previous theoretical guarantees for this paradigm. Our bounds feature instance-specific quantities that reflect the degree of alignment between observations and losses, and resemble known guarantees in the stochastic setting. Notably, they achieve the standard $\sqrt{T}$ rate in easy (locally observable) games and $T^{2/3}$ in hard (globally observable) games, where $T$ is the time horizon. We instantiate these bounds in a selection of old and new partial information settings subsumed by this model, and illustrate that the achieved dependence on the game structure can be tight in interesting cases.

[317] Preliminary Use of Vision Language Model Driven Extraction of Mouse Behavior Towards Understanding Fear Expression

Paimon Goulart, Jordan Steinhauser, Kylene Shuler, Edward Korzus, Jia Chen, Evangelos E. Papalexakis

Main category: cs.LG

TL;DR: A vision-language model (VLM) using Qwen2.5-VL is developed to classify mouse behaviors from videos, producing behavioral vectors over time without model fine-tuning through prompts, in-context learning, and frame preprocessing.

Details

Motivation: To improve scientific exploration by integrating diverse data and enabling researchers to study mouse behavior across multiple time points and environments through automated behavioral classification.

Method: Uses Qwen2.5-VL model with prompts, in-context learning with labeled examples, and frame-level preprocessing to classify mouse behaviors from videos without fine-tuning the model.

Result: Achieves strong F1 scores across all behaviors, including rare classes like freezing and fleeing, and produces valuable behavioral vector datasets with high accuracy and minimal user input.

Conclusion: The model supports interdisciplinary researchers by enabling integration of diverse behavioral features into comprehensive datasets for addressing complex research questions about mouse behavior.

Abstract: Integration of diverse data will be a pivotal step towards improving scientific explorations in many disciplines. This work establishes a vision-language model (VLM) that encodes videos with text input in order to classify various behaviors of a mouse existing in and engaging with their environment. Importantly, this model produces a behavioral vector over time for each subject and for each session the subject undergoes. The output is a valuable dataset that few programs are able to produce with as high accuracy and with minimal user input. Specifically, we use the open-source Qwen2.5-VL model and enhance its performance through prompts, in-context learning (ICL) with labeled examples, and frame-level preprocessing. We found that each of these methods contributes to improved classification, and that combining them results in strong F1 scores across all behaviors, including rare classes like freezing and fleeing, without any model fine-tuning. Overall, this model will support interdisciplinary researchers studying mouse behavior by enabling them to integrate diverse behavioral features, measured across multiple time points and environments, into a comprehensive dataset that can address complex research questions.

[318] Natural Gradient VI: Guarantees for Non-Conjugate Models

Fangyuan Sun, Ilyas Fatkhullin, Niao He

Main category: cs.LG

TL;DR: This paper provides theoretical analysis of Stochastic Natural Gradient Variational Inference (NGVI) for non-conjugate likelihoods, establishing convergence guarantees and uncovering hidden convexity properties.

Details

Motivation: Despite NGVI's empirical success in variational inference, its theoretical foundations remain limited, especially for non-conjugate likelihoods where the variational loss becomes non-convex and harder to analyze.

Method: The authors focus on mean-field parameterization and: 1) derive sufficient conditions for relative smoothness, 2) propose a modified NGVI algorithm with non-Euclidean projections, and 3) analyze hidden convexity properties under additional structural assumptions.

Result: The paper proves global non-asymptotic convergence to a stationary point for the modified NGVI algorithm, and establishes fast global convergence to a global optimum under additional structural assumptions about the likelihood.

Conclusion: These results provide new insights into the geometry and convergence behavior of NGVI in challenging non-conjugate inference settings, advancing the theoretical understanding of this widely used variational inference method.

Abstract: Stochastic Natural Gradient Variational Inference (NGVI) is a widely used method for approximating posterior distribution in probabilistic models. Despite its empirical success and foundational role in variational inference, its theoretical underpinnings remain limited, particularly in the case of non-conjugate likelihoods. While NGVI has been shown to be a special instance of Stochastic Mirror Descent, and recent work has provided convergence guarantees using relative smoothness and strong convexity for conjugate models, these results do not extend to the non-conjugate setting, where the variational loss becomes non-convex and harder to analyze. In this work, we focus on mean-field parameterization and advance the theoretical understanding of NGVI in three key directions. First, we derive sufficient conditions under which the variational loss satisfies relative smoothness with respect to a suitable mirror map. Second, leveraging this structure, we propose a modified NGVI algorithm incorporating non-Euclidean projections and prove its global non-asymptotic convergence to a stationary point. Finally, under additional structural assumptions about the likelihood, we uncover hidden convexity properties of the variational loss and establish fast global convergence of NGVI to a global optimum. These results provide new insights into the geometry and convergence behavior of NGVI in challenging inference settings.

[319] Imbalanced Gradients in RL Post-Training of Multi-Task LLMs

Runzhe Wu, Ankur Samanta, Ayush Jain, Scott Fujimoto, Jeongyeol Kwon, Ben Kretzu, Youliang Yu, Kaveh Hassani, Boris Vidolov, Yonathan Efroni

Main category: cs.LG

TL;DR: Multi-task RL post-training of LLMs suffers from gradient imbalance where certain tasks produce much larger gradients, biasing optimization toward those tasks despite not necessarily yielding better learning gains.

Details

Motivation: Standard multi-task post-training assumes all tasks contribute gradients of similar magnitudes, but this assumption fails in RL post-training, leading to biased optimization that favors large-gradient tasks regardless of actual learning benefits.

Method: The paper analyzes gradient magnitudes across different tasks in RL post-training, examining the relationship between gradient sizes and actual learning gains (performance improvements).

Result: Large-gradient tasks don’t necessarily achieve better learning gains than small-gradient ones - they can have similar or even lower performance improvements. Gradient imbalances cannot be explained by typical training statistics like rewards or advantages.

Conclusion: Naive dataset mixing in multi-task RL post-training is problematic due to inherent gradient imbalances between tasks. Future work should develop principled gradient-level corrections for LLMs to address this issue.

Abstract: Multi-task post-training of large language models (LLMs) is typically performed by mixing datasets from different tasks and optimizing them jointly. This approach implicitly assumes that all tasks contribute gradients of similar magnitudes; when this assumption fails, optimization becomes biased toward large-gradient tasks. In this paper, however, we show that this assumption fails in RL post-training: certain tasks produce significantly larger gradients, thus biasing updates toward those tasks. Such gradient imbalance would be justified only if larger gradients implied larger learning gains on the tasks (i.e., larger performance improvements) – but we find this is not true. Large-gradient tasks can achieve similar or even much lower learning gains than small-gradient ones. Further analyses reveal that these gradient imbalances cannot be explained by typical training statistics such as training rewards or advantages, suggesting that they arise from the inherent differences between tasks. This cautions against naive dataset mixing and calls for future work on principled gradient-level corrections for LLMs.

[320] A Communication-Efficient Decentralized Actor-Critic Algorithm

Xiaoxing Ren, Nicola Bastianello, Thomas Parisini, Andreas A. Malikopoulos

Main category: cs.LG

TL;DR: Decentralized actor-critic framework for multi-agent reinforcement learning with limited communication, using local policy updates and neural network value approximation to reduce communication burden while maintaining coordination.

Details

Motivation: Address the problem of communication limitations in multi-agent reinforcement learning systems where frequent communication among agents is costly or impractical.

Method: Developed a decentralized actor-critic learning framework where each agent performs multiple local updates of its policy and neural network-approximated value function before exchanging information with neighbors.

Result: Established finite-time convergence with sample complexity O(ε⁻³) and communication complexity O(ε⁻¹τ⁻¹), where τ is number of local training steps. Numerical experiments in cooperative control validate theoretical findings.

Conclusion: The proposed framework effectively reduces communication burden while maintaining coordination, with theoretical guarantees on convergence and complexity bounds that depend on neural network approximation quality.

Abstract: In this paper, we study the problem of reinforcement learning in multi-agent systems where communication among agents is limited. We develop a decentralized actor-critic learning framework in which each agent performs several local updates of its policy and value function, where the latter is approximated by a multi-layer neural network, before exchanging information with its neighbors. This local training strategy substantially reduces the communication burden while maintaining coordination across the network. We establish finite-time convergence analysis for the algorithm under Markov-sampling. Specifically, to attain the $\varepsilon$-accurate stationary point, the sample complexity is of order $\mathcal{O}(\varepsilon^{-3})$ and the communication complexity is of order $\mathcal{O}(\varepsilon^{-1}\tau^{-1})$, where tau denotes the number of local training steps. We also show how the final error bound depends on the neural network’s approximation quality. Numerical experiments in a cooperative control setting illustrate and validate the theoretical findings.

[321] An Active Diffusion Neural Network for Graphs

Mengying Jiang

Main category: cs.LG

TL;DR: ADGNN introduces active diffusion to overcome over-smoothing in GNNs by integrating external information sources and enabling true infinite diffusion through closed-form solutions.

Details

Motivation: Current diffusion-based GNNs suffer from over-smoothing and limited global graph information capture, similar to passive heat diffusion where node representations converge to identical vectors over time.

Method: Proposes Active Diffusion-based Graph Neural Network (ADGNN) that integrates multiple external information sources to dynamically influence diffusion, and uses closed-form solutions for true infinite diffusion.

Result: ADGNN significantly improves both accuracy and efficiency compared to state-of-the-art GNN models across various graph tasks.

Conclusion: ADGNN effectively captures global graph information while maintaining node distinctiveness, overcoming the over-smoothing problem in traditional diffusion-based GNNs.

Abstract: The analogy to heat diffusion has enhanced our understanding of information flow in graphs and inspired the development of Graph Neural Networks (GNNs). However, most diffusion-based GNNs emulate passive heat diffusion, which still suffers from over-smoothing and limits their ability to capture global graph information. Inspired by the heat death of the universe, which posits that energy distribution becomes uniform over time in a closed system, we recognize that, without external input, node representations in a graph converge to identical feature vectors as diffusion progresses. To address this issue, we propose the Active Diffusion-based Graph Neural Network (ADGNN). ADGNN achieves active diffusion by integrating multiple external information sources that dynamically influence the diffusion process, effectively overcoming the over-smoothing problem. Furthermore, our approach realizes true infinite diffusion by directly calculating the closed-form solution of the active diffusion iterative formula. This allows nodes to preserve their unique characteristics while efficiently gaining comprehensive insights into the graph’s global structure. We evaluate ADGNN against several state-of-the-art GNN models across various graph tasks. The results demonstrate that ADGNN significantly improves both accuracy and efficiency, highlighting its effectiveness in capturing global graph information and maintaining node distinctiveness.

[322] Enhancing Graph Neural Networks: A Mutual Learning Approach

Paul Agbaje, Akajyoti Mitra, Afia Anjum, Pranali Khose, Ebelechukwu Nwafor, Habeeb Olufowobi

Main category: cs.LG

TL;DR: This paper proposes a collaborative learning framework for GNNs where student models mutually teach each other instead of using traditional knowledge distillation with a teacher model.

Details

Motivation: To enable efficient model deployment on resource-constrained devices by developing collaborative learning among GNNs without needing pre-trained teacher models.

Method: A collaborative learning framework with ensembles of student GNNs that mutually teach each other, using adaptive logit weighting and entropy enhancement techniques for efficient knowledge exchange.

Result: The approach demonstrates effectiveness across three datasets for both node and graph classification tasks, showing improved performance particularly for multiple tasks.

Conclusion: Collaborative learning among GNNs can achieve better inference performance than traditional knowledge distillation, especially for handling multiple tasks, without requiring pre-trained teacher models.

Abstract: Knowledge distillation (KD) techniques have emerged as a powerful tool for transferring expertise from complex teacher models to lightweight student models, particularly beneficial for deploying high-performance models in resource-constrained devices. This approach has been successfully applied to graph neural networks (GNNs), harnessing their expressive capabilities to generate node embeddings that capture structural and feature-related information. In this study, we depart from the conventional KD approach by exploring the potential of collaborative learning among GNNs. In the absence of a pre-trained teacher model, we show that relatively simple and shallow GNN architectures can synergetically learn efficient models capable of performing better during inference, particularly in tackling multiple tasks. We propose a collaborative learning framework where ensembles of student GNNs mutually teach each other throughout the training process. We introduce an adaptive logit weighting unit to facilitate efficient knowledge exchange among models and an entropy enhancement technique to improve mutual learning. These components dynamically empower the models to adapt their learning strategies during training, optimizing their performance for downstream tasks. Extensive experiments conducted on three datasets each for node and graph classification demonstrate the effectiveness of our approach.

[323] Controllable Machine Unlearning via Gradient Pivoting

Youngsik Hwang, Dong-Young Lim

Main category: cs.LG

TL;DR: The paper reframes machine unlearning as a multi-objective optimization problem and introduces CUP algorithm with a pivoting gradient mechanism to navigate the Pareto frontier between unlearning efficacy and model fidelity.

Details

Motivation: Current approximate unlearning methods face critical trade-offs between unlearning efficacy and model fidelity, leading to over-forgetting, lack of fine-grained control, and absence of holistic evaluation metrics.

Method: Proposes Controllable Unlearning by Pivoting Gradient (CUP) algorithm that treats MU as MOO problem, featuring a unique pivoting mechanism to navigate the entire Pareto frontier using a single ‘unlearning intensity’ hyperparameter.

Result: CUP produces superior Pareto-optimal solutions, consistently outperforming existing methods across various vision tasks, as measured by the hypervolume indicator capturing both solution quality and diversity.

Conclusion: Reframing MU as MOO with CUP’s pivoting mechanism provides fine-grained control over the unlearning process and enables holistic evaluation of the trade-off between unlearning efficacy and model fidelity.

Abstract: Machine unlearning (MU) aims to remove the influence of specific data from a trained model. However, approximate unlearning methods, often formulated as a single-objective optimization (SOO) problem, face a critical trade-off between unlearning efficacy and model fidelity. This leads to three primary challenges: the risk of over-forgetting, a lack of fine-grained control over the unlearning process, and the absence of metrics to holistically evaluate the trade-off. To address these issues, we reframe MU as a multi-objective optimization (MOO) problem. We then introduce a novel algorithm, Controllable Unlearning by Pivoting Gradient (CUP), which features a unique pivoting mechanism. Unlike traditional MOO methods that converge to a single solution, CUP’s mechanism is designed to controllably navigate the entire Pareto frontier. This navigation is governed by a single intuitive hyperparameter, the `unlearning intensity’, which allows for precise selection of a desired trade-off. To evaluate this capability, we adopt the hypervolume indicator, a metric that captures both the quality and diversity of the entire set of solutions an algorithm can generate. Our experimental results demonstrate that CUP produces a superior set of Pareto-optimal solutions, consistently outperforming existing methods across various vision tasks.

[324] Brain-Inspired Perspective on Configurations: Unsupervised Similarity and Early Cognition

Juntang Wang, Yihan Wang, Hao Wu, Dongmian Zou, Shixin Xu

Main category: cs.LG

TL;DR: Configurations is a brain-inspired clustering framework that achieves hierarchical organization, novelty detection, and adaptive learning using attraction-repulsion dynamics and a single resolution parameter.

Details

Motivation: To develop AI systems that can learn categories, detect novelty, and adapt to new contexts without supervision, similar to how infants learn, which is challenging for current machine learning approaches.

Method: Uses a finite-resolution clustering framework with attraction-repulsion dynamics and a single resolution parameter. Introduces mheatmap for proportional heatmaps and reassignment algorithm to evaluate multi-resolution and dynamic behavior.

Result: Competitive performance on standard clustering metrics, 87% AUC in novelty detection, and 35% better stability during dynamic category evolution across datasets.

Conclusion: Configurations represents a principled computational model of early cognitive categorization and advances brain-inspired AI by mimicking infant learning capabilities.

Abstract: Infants discover categories, detect novelty, and adapt to new contexts without supervision – a challenge for current machine learning. We present a brain-inspired perspective on configurations, a finite-resolution clustering framework that uses a single resolution parameter and attraction-repulsion dynamics to yield hierarchical organization, novelty sensitivity, and flexible adaptation. To evaluate these properties, we introduce mheatmap, which provides proportional heatmaps and a reassignment algorithm to fairly assess multi-resolution and dynamic behavior. Across datasets, configurations are competitive on standard clustering metrics, achieve 87% AUC in novelty detection, and show 35% better stability during dynamic category evolution. These results position configurations as a principled computational model of early cognitive categorization and a step toward brain-inspired AI.

[325] Understanding the Implicit Biases of Design Choices for Time Series Foundation Models

Annan Yu, Danielle C. Maddix, Boran Han, Xiyuan Zhang, Abdul Fatir Ansari, Oleksandr Shchur, Christos Faloutsos, Andrew Gordon Wilson, Michael W. Mahoney, Yuyang Wang

Main category: cs.LG

TL;DR: This paper analyzes how subtle design choices in time series foundation models (TSFMs) create implicit biases that affect model behavior, rather than proposing a new model.

Details

Motivation: To understand how various training process "knobs" affect model quality and identify implicit biases in TSFMs, instead of just developing another model that performs better on benchmarks.

Method: Using a mix of theory and controlled empirical evaluation to examine design choices like patch size, embedding choice, and training objectives, and analyzing how they lead to biases in temporal behavior, geometric structure, and regression tendencies.

Result: Identified that design choices create implicit biases that can be either intuitive or counterintuitive depending on model and data properties, and showed how multiple biases interact in complex ways through a case study on outlier handling.

Conclusion: Provides insights into the implications for learning the “bitter lesson” and building better TSFMs by understanding these fundamental biases and their interactions.

Abstract: Time series foundation models (TSFMs) are a class of potentially powerful, general-purpose tools for time series forecasting and related temporal tasks, but their behavior is strongly shaped by subtle inductive biases in their design. Rather than developing a new model and claiming that it is better than existing TSFMs, e.g., by winning on existing well-established benchmarks, our objective is to understand how the various ``knobs’’ of the training process affect model quality. Using a mix of theory and controlled empirical evaluation, we identify several design choices (patch size, embedding choice, training objective, etc.) and show how they lead to implicit biases in fundamental model properties (temporal behavior, geometric structure, how aggressively or not the model regresses to the mean, etc.); and we show how these biases can be intuitive or very counterintuitive, depending on properties of the model and data. We also illustrate in a case study on outlier handling how multiple biases can interact in complex ways; and we discuss implications of our results for learning the bitter lesson and building TSFMs.

[326] SPOT: Scalable Policy Optimization with Trees for Markov Decision Processes

Xuyuan Xiong, Pedro Chumpitaz-Flores, Kaixun Hua, Cheng Hua

Main category: cs.LG

TL;DR: SPOT is a novel method that formulates decision tree policy optimization in MDPs as a mixed-integer linear program (MILP) using reduced-space branch-and-bound for efficient parallel search, achieving significant speedup and scalability while maintaining interpretable policies.

Details

Motivation: Interpretable reinforcement learning policies are essential for high-stakes decision-making, but optimizing decision tree policies in MDPs remains challenging due to computational complexity and scalability issues.

Method: SPOT formulates the optimization problem as a mixed-integer linear program (MILP) and employs a reduced-space branch-and-bound approach that decouples MDP dynamics from tree-structure constraints, enabling efficient parallel search.

Result: Experimental results show SPOT achieves substantial speedup, scales to larger MDPs with more states, and produces interpretable, compact decision tree policies without compromising performance - delivering high-quality policies an order of magnitude faster than existing approaches.

Conclusion: SPOT simultaneously achieves interpretability and scalability in reinforcement learning policies, providing an efficient solution for computing optimal decision tree policies in MDPs.

Abstract: Interpretable reinforcement learning policies are essential for high-stakes decision-making, yet optimizing decision tree policies in Markov Decision Processes (MDPs) remains challenging. We propose SPOT, a novel method for computing decision tree policies, which formulates the optimization problem as a mixed-integer linear program (MILP). To enhance efficiency, we employ a reduced-space branch-and-bound approach that decouples the MDP dynamics from tree-structure constraints, enabling efficient parallel search. This significantly improves runtime and scalability compared to previous methods. Our approach ensures that each iteration yields the optimal decision tree. Experimental results on standard benchmarks demonstrate that SPOT achieves substantial speedup and scales to larger MDPs with a significantly higher number of states. The resulting decision tree policies are interpretable and compact, maintaining transparency without compromising performance. These results demonstrate that our approach simultaneously achieves interpretability and scalability, delivering high-quality policies an order of magnitude faster than existing approaches.

[327] Interpret Policies in Deep Reinforcement Learning using SILVER with RL-Guided Labeling: A Model-level Approach to High-dimensional and Multi-action Environments

Yiyu Qian, Su Nguyen, Chao Chen, Qinyue Zhou, Liyuan Zhao

Main category: cs.LG

TL;DR: SILVER with RL-guided labeling extends the original SILVER framework to handle multi-action and high-dimensional environments by incorporating RL policy outputs into boundary identification, improving interpretability while maintaining performance.

Details

Motivation: Deep RL achieves strong performance but lacks interpretability, limiting trust in policy behavior. Existing SILVER framework is restricted to low-dimensional, binary-action domains.

Method: Extracts compact feature representations from images, performs SHAP-based feature attribution, uses RL-guided labeling for boundary datasets, and trains surrogate models (decision trees, regression functions) to interpret policy decisions.

Result: Maintains competitive task performance while substantially improving transparency and human understanding of agent behavior in Atari environments with three deep RL algorithms.

Conclusion: Transforms SILVER into a scalable, behavior-aware framework for interpreting deep RL agents in high-dimensional, multi-action settings, advancing explainable RL.

Abstract: Deep reinforcement learning (RL) achieves remarkable performance but lacks interpretability, limiting trust in policy behavior. The existing SILVER framework (Li, Siddique, and Cao 2025) explains RL policy via Shapley-based regression but remains restricted to low-dimensional, binary-action domains. We propose SILVER with RL-guided labeling, an enhanced variant that extends SILVER to multi-action and high-dimensional environments by incorporating the RL policy’s own action outputs into the boundary points identification. Our method first extracts compact feature representations from image observations, performs SHAP-based feature attribution, and then employs RL-guided labeling to generate behaviorally consistent boundary datasets. Surrogate models, such as decision trees and regression-based functions, are subsequently trained to interpret RL policy’s decision structure. We evaluate the proposed framework on two Atari environments using three deep RL algorithms and conduct human-subject study to assess the clarity and trustworthiness of the derived interpretable policy. Results show that our approach maintains competitive task performance while substantially improving transparency and human understanding of agent behavior. This work advances explainable RL by transforming SILVER into a scalable and behavior-aware framework for interpreting deep RL agents in high-dimensional, multi-action settings.

[328] Mixing Configurations for Downstream Prediction

Juntang Wang, Hao Wu, Runkun Guo, Yihan Wang, Dongmian Zou, Shixin Xu

Main category: cs.LG

TL;DR: GraMixC is a plug-and-play module that extracts hierarchical clustering configurations from Vision Transformers, aligns them using RMS technique, and fuses them via attention to improve downstream task performance.

Details

Motivation: To emulate human grouping ability and leverage emergent hierarchical clustering structures found in register tokens of Vision Transformers, while eliminating redundancy and ad hoc selection requirements.

Method: Extract configurations, align them using Reverse Merge/Split (RMS) technique, and fuse them via attention heads before forwarding to downstream predictors.

Result: On DSN1 16S rRNA cultivation-media prediction, improved R2 score from 0.6 to 0.9; consistently outperformed single-resolution and static-feature baselines on standard tabular benchmarks.

Conclusion: GraMixC effectively leverages hierarchical clustering configurations to significantly improve performance across multiple tasks, setting new state-of-the-art results.

Abstract: Humans possess an innate ability to group objects by similarity, a cognitive mechanism that clustering algorithms aim to emulate. Recent advances in community detection have enabled the discovery of configurations – valid hierarchical clusterings across multiple resolution scales – without requiring labeled data. In this paper, we formally characterize these configurations and identify similar emergent structures in register tokens within Vision Transformers. Unlike register tokens, configurations exhibit lower redundancy and eliminate the need for ad hoc selection. They can be learned through unsupervised or self-supervised methods, yet their selection or composition remains specific to the downstream task and input. Building on these insights, we introduce GraMixC, a plug-and-play module that extracts configurations, aligns them using our Reverse Merge/Split (RMS) technique, and fuses them via attention heads before forwarding them to any downstream predictor. On the DSN1 16S rRNA cultivation-media prediction task, GraMixC improves the R2 score from 0.6 to 0.9 across multiple methods, setting a new state of the art. We further validate GraMixC on standard tabular benchmarks, where it consistently outperforms single-resolution and static-feature baselines.

[329] FnRGNN: Distribution-aware Fairness in Graph Neural Network

Soyoung Park, Sungsu Lim

Main category: cs.LG

TL;DR: FnRGNN is a fairness-aware framework for GNN-based node regression that applies interventions at structure, representation, and prediction levels to reduce group disparities while maintaining performance.

Details

Motivation: Fairness in GNN regression tasks is underexplored, with existing approaches mainly focusing on classification and representation-level debiasing, which cannot fully address the continuous nature of node-level regression.

Method: Multi-level framework with structure-level edge reweighting, representation-level alignment via MMD, and prediction-level normalization through Sinkhorn-based distribution matching.

Result: Experiments on four real-world datasets show FnRGNN reduces group disparities without sacrificing performance.

Conclusion: The proposed multi-level strategy ensures robust fairness under complex graph topologies for GNN-based node regression tasks.

Abstract: Graph Neural Networks (GNNs) excel at learning from structured data, yet fairness in regression tasks remains underexplored. Existing approaches mainly target classification and representation-level debiasing, which cannot fully address the continuous nature of node-level regression. We propose FnRGNN, a fairness-aware in-processing framework for GNN-based node regression that applies interventions at three levels: (i) structure-level edge reweighting, (ii) representation-level alignment via MMD, and (iii) prediction-level normalization through Sinkhorn-based distribution matching. This multi-level strategy ensures robust fairness under complex graph topologies. Experiments on four real-world datasets demonstrate that FnRGNN reduces group disparities without sacrificing performance. Code is available at https://github.com/sybeam27/FnRGNN.

[330] Data Efficient Any Transformer-to-Mamba Distillation via Attention Bridge

Penghao Wang, Yuhao Zhou, Mengxuan Wu, Panpan Zhang, Zhangyang Wang, Kai Wang

Main category: cs.LG

TL;DR: CAB is a novel distillation framework that efficiently transfers attention knowledge from Transformer teachers to state-space student models using token-level supervision and flexible layer-wise alignment.

Details

Motivation: State-space models offer superior scalability but have costly training and less mature ecosystem compared to Transformers. Structural heterogeneity makes it challenging to distill knowledge from pretrained attention models.

Method: Cross-architecture distillation via Attention Bridge (CAB) with token-level supervision via lightweight bridge and flexible layer-wise alignment strategies to accommodate architectural discrepancies.

Result: Extensive experiments across vision and language domains show consistent improvement in state-space model performance, even with limited training data, outperforming standard and cross-architecture distillation methods.

Conclusion: Attention-based knowledge can be efficiently transferred to recurrent models, enabling rapid utilization of Transformer expertise for building a stronger SSM community.

Abstract: State-space models (SSMs) have emerged as efficient alternatives to Transformers for sequence modeling, offering superior scalability through recurrent structures. However, their training remains costly and the ecosystem around them is far less mature than that of Transformers. Moreover, the structural heterogeneity between SSMs and Transformers makes it challenging to efficiently distill knowledge from pretrained attention models. In this work, we propose Cross-architecture distillation via Attention Bridge (CAB), a novel data-efficient distillation framework that efficiently transfers attention knowledge from Transformer teachers to state-space student models. Unlike conventional knowledge distillation that transfers knowledge only at the output level, CAB enables token-level supervision via a lightweight bridge and flexible layer-wise alignment, improving both efficiency and transferability. We further introduce flexible layer-wise alignment strategies to accommodate architectural discrepancies between teacher and student. Extensive experiments across vision and language domains demonstrate that our method consistently improves the performance of state-space models, even under limited training data, outperforming both standard and cross-architecture distillation methods. Our findings suggest that attention-based knowledge can be efficiently transferred to recurrent models, enabling rapid utilization of Transformer expertise for building a stronger SSM community.

[331] Knowledge Distillation of Uncertainty using Deep Latent Factor Model

Sehyun Park, Jongjin Lee, Yunseop Shin, Ilsang Ohn, Yongdai Kim

Main category: cs.LG

TL;DR: Gaussian distillation compresses teacher ensembles into student distributions using deep latent factor models, outperforming existing methods in uncertainty preservation while reducing computational costs.

Details

Motivation: Deep ensembles provide excellent uncertainty quantification but are computationally expensive for real-world applications like on-device AI. Knowledge distillation struggles to preserve uncertainty when compressing ensembles.

Method: Proposes Gaussian distillation using deep latent factor models to estimate teacher ensemble distributions. Uses EM algorithm for stable estimation of mean and covariance functions.

Result: Outperforms existing baselines on multiple benchmark datasets. Works well for language model fine-tuning and distribution shift problems.

Conclusion: Gaussian distillation effectively compresses ensembles while preserving uncertainty, making it suitable for practical deployments where computational resources are limited.

Abstract: Deep ensembles deliver state-of-the-art, reliable uncertainty quantification, but their heavy computational and memory requirements hinder their practical deployments to real applications such as on-device AI. Knowledge distillation compresses an ensemble into small student models, but existing techniques struggle to preserve uncertainty partly because reducing the size of DNNs typically results in variation reduction. To resolve this limitation, we introduce a new method of distribution distillation (i.e. compressing a teacher ensemble into a student distribution instead of a student ensemble) called Gaussian distillation, which estimates the distribution of a teacher ensemble through a special Gaussian process called the deep latent factor model (DLF) by treating each member of the teacher ensemble as a realization of a certain stochastic process. The mean and covariance functions in the DLF model are estimated stably by using the expectation-maximization (EM) algorithm. By using multiple benchmark datasets, we demonstrate that the proposed Gaussian distillation outperforms existing baselines. In addition, we illustrate that Gaussian distillation works well for fine-tuning of language models and distribution shift problems.

[332] QiMeng-SALV: Signal-Aware Learning for Verilog Code Generation

Yang Zhang, Rui Zhang, Jiaming Guo, Lei Huang, Di Huang, Yunpu Zhao, Shuyao Cheng, Pengwei Jin, Chongxiao Li, Zidong Du, Xing Hu, Qi Guo, Yunji Chen

Main category: cs.LG

TL;DR: QiMeng-SALV introduces signal-aware learning for Verilog code generation, using functionally correct output signal segments to optimize RL training through signal-level DPO, achieving state-of-the-art performance with a 7B model matching DeepSeek v3 671B.

Details

Motivation: The lack of meaningful functional rewards hinders RL-based preference optimization for generating functionally correct Verilog code in automated circuit design.

Method: Extracts verified signal-aware implementations from partially incorrect modules by comparing signal functionality with reference modules, uses AST to identify correct signal-level code segments, and applies signal-aware DPO for optimization.

Result: Achieves state-of-the-art performance on VerilogEval and RTLLM benchmarks, with a 7B parameter model matching DeepSeek v3 671B’s performance and significantly outperforming CodeV.

Conclusion: The method enables a paradigm shift from module-level to fine-grained signal-level optimization in Verilog code generation, effectively addressing insufficient functional rewards.

Abstract: The remarkable progress of Large Language Models (LLMs) presents promising opportunities for Verilog code generation which is significantly important for automated circuit design. The lacking of meaningful functional rewards hinders the preference optimization based on Reinforcement Learning (RL) for producing functionally correct Verilog code. In this paper, we propose Signal-Aware Learning for Verilog code generation (QiMeng-SALV) by leveraging code segments of functionally correct output signal to optimize RL training. Considering Verilog code specifies the structural interconnection of hardware gates and wires so that different output signals are independent, the key insight of QiMeng-SALV is to extract verified signal-aware implementations in partially incorrect modules, so as to enhance the extraction of meaningful functional rewards. Roughly, we verify the functional correctness of signals in generated module by comparing with that of reference module in the training data. Then abstract syntax tree (AST) is employed to identify signal-aware code segments which can provide meaningful functional rewards from erroneous modules. Finally, we introduce signal-aware DPO which is optimized on the correct signal-level code segments, thereby preventing noise and interference from incorrect signals. The proposed QiMeng-SALV underscores the paradigm shift from conventional module-level to fine-grained signal-level optimization in Verilog code generation, addressing the issue of insufficient functional rewards. Experiments demonstrate that our method achieves state-of-the-art performance on VerilogEval and RTLLM, with a 7B parameter model matching the performance of the DeepSeek v3 671B model and significantly outperforming the leading open-source model CodeV trained on the same dataset. Our code is available at https://github.com/zy1xxx/SALV.

[333] Loopholing Discrete Diffusion: Deterministic Bypass of the Sampling Wall

Mingyu Jo, Jaesik Yoon, Justin Deschenaux, Caglar Gulcehre, Sungjin Ahn

Main category: cs.LG

TL;DR: Loopholing Discrete Diffusion Models (LDDMs) introduce a deterministic latent pathway to preserve distributional information during sampling, overcoming the sampling wall problem in discrete diffusion models and achieving substantial improvements in text generation quality.

Details

Motivation: Discrete diffusion models suffer from a sampling wall where categorical sampling collapses rich distributional information into one-hot vectors, preventing information propagation across steps and forcing subsequent steps to operate with limited information.

Method: Introduce Loopholing mechanism that preserves distributional information via a deterministic latent pathway, trained efficiently with self-conditioning strategy to create Loopholing Discrete Diffusion Models (LDDMs).

Result: LDDMs reduce generative perplexity by up to 61% over prior baselines, close (and sometimes surpass) the gap with autoregressive models, produce more coherent text, and improve performance on reasoning tasks like arithmetic benchmarks (Countdown and Game of 24).

Conclusion: Loopholing mitigates idle steps and oscillations, providing a scalable path toward high-quality non-autoregressive text generation by preserving distributional information throughout the sampling process.

Abstract: Discrete diffusion models offer a promising alternative to autoregressive generation through parallel decoding, but they suffer from a sampling wall: once categorical sampling occurs, rich distributional information collapses into one-hot vectors and cannot be propagated across steps, forcing subsequent steps to operate with limited information. To mitigate this problem, we introduce Loopholing, a novel and simple mechanism that preserves this information via a deterministic latent pathway, leading to Loopholing Discrete Diffusion Models (LDDMs). Trained efficiently with a self-conditioning strategy, LDDMs achieve substantial gains-reducing generative perplexity by up to 61% over prior baselines, closing (and in some cases surpassing) the gap with autoregressive models, and producing more coherent text. Applied to reasoning tasks, LDDMs also improve performance on arithmetic benchmarks such as Countdown and Game of 24. These results also indicate that loopholing mitigates idle steps and oscillations, providing a scalable path toward high-quality non-autoregressive text generation.

[334] FrogDeepSDM: Improving Frog Counting and Occurrence Prediction Using Multimodal Data and Pseudo-Absence Imputation

Chirag Padubidri, Pranesh Velmurugan, Andreas Lanitis, Andreas Kamilaris

Main category: cs.LG

TL;DR: Enhanced Species Distribution Modelling for frogs using deep learning and data imputation, achieving significant improvements in counting accuracy and habitat classification through multimodal ensemble models.

Details

Motivation: Traditional species monitoring methods have limited coverage and completeness, requiring improved predictive models for better conservation strategies.

Method: Applied deep learning with data balancing, feature selection, and multimodal ensemble modeling combining land cover, NDVI, and environmental data.

Result: Reduced MAE from 189 to 29 in frog counting; achieved 84.9% accuracy with 0.90 AUC in habitat classification; multimodal ensemble outperformed individual models.

Conclusion: Multimodal learning and data preprocessing techniques can significantly improve ecological modeling accuracy for biodiversity monitoring with sparse data.

Abstract: Monitoring species distribution is vital for conservation efforts, enabling the assessment of environmental impacts and the development of effective preservation strategies. Traditional data collection methods, including citizen science, offer valuable insights but remain limited in coverage and completeness. Species Distribution Modelling (SDM) helps address these gaps by using occurrence data and environmental variables to predict species presence across large regions. In this study, we enhance SDM accuracy for frogs (Anura) by applying deep learning and data imputation techniques using data from the “EY - 2022 Biodiversity Challenge.” Our experiments show that data balancing significantly improved model performance, reducing the Mean Absolute Error (MAE) from 189 to 29 in frog counting tasks. Feature selection identified key environmental factors influencing occurrence, optimizing inputs while maintaining predictive accuracy. The multimodal ensemble model, integrating land cover, NDVI, and other environmental inputs, outperformed individual models and showed robust generalization across unseen regions. The fusion of image and tabular data improved both frog counting and habitat classification, achieving 84.9% accuracy with an AUC of 0.90. This study highlights the potential of multimodal learning and data preprocessing techniques such as balancing and imputation to improve predictive ecological modeling when data are sparse or incomplete, contributing to more precise and scalable biodiversity monitoring.

[335] Calibration and Discrimination Optimization Using Clusters of Learned Representation

Tomer Lavi, Bracha Shapira, Nadav Rappoport

Main category: cs.LG

TL;DR: A novel calibration pipeline using ensemble of calibration functions on clustered representations improves calibration scores up to 100% while maintaining discrimination.

Details

Motivation: Machine learning models need highly reliable predictions for critical decisions like clinical predictions, where calibration is crucial but often overlooked.

Method: Leverages ensemble of calibration functions trained on clusters of learned representations of input samples, with a unique matching metric for model selection.

Result: Improves calibration score of various methods from 82.28% up to 100% while optimizing both discrimination and calibration.

Conclusion: The generic scheme adapts to any underlying representation, clustering, calibration methods and metric, offering flexibility and superior performance across commonly used calibration methods.

Abstract: Machine learning models are essential for decision-making and risk assessment, requiring highly reliable predictions in terms of both discrimination and calibration. While calibration often receives less attention, it is crucial for critical decisions, such as those in clinical predictions. We introduce a novel calibration pipeline that leverages an ensemble of calibration functions trained on clusters of learned representations of the input samples to enhance overall calibration. This approach not only improves the calibration score of various methods from 82.28% up to 100% but also introduces a unique matching metric that ensures model selection optimizes both discrimination and calibration. Our generic scheme adapts to any underlying representation, clustering, calibration methods and metric, offering flexibility and superior performance across commonly used calibration methods.

[336] Every Attention Matters: An Efficient Hybrid Architecture for Long-Context Reasoning

Ling Team, Bin Han, Caizhi Tang, Chen Liang, Donghao Zhang, Fan Yuan, Feng Zhu, Jie Gao, Jingyu Hu, Longfei Li, Meng Li, Mingyang Zhang, Peijie Jiang, Peng Jiao, Qian Zhao, Qingyuan Yang, Wenbo Shen, Xinxing Yang, Yalin Zhang, Yankun Ren, Yao Zhao, Yibo Cao, Yixuan Sun, Yue Zhang, Yuchen Fang, Zibin Lin, Zixuan Cheng, Jun Zhou

Main category: cs.LG

TL;DR: The Ring-linear model series introduces hybrid attention architectures that combine linear and softmax attention, achieving significant inference cost reductions while maintaining state-of-the-art performance on complex reasoning tasks.

Details

Motivation: To address the high I/O and computational overhead in long-context inference scenarios by developing more efficient attention mechanisms that reduce costs while maintaining performance.

Method: Developed Ring-mini-linear-2.0 (16B params) and Ring-flash-linear-2.0 (104B params) models with hybrid linear+softmax attention architecture, systematically explored attention mechanism ratios, and used custom FP8 operator library (linghe) for efficiency.

Result: Achieved 10x inference cost reduction vs 32B dense model and >50% cost reduction vs original Ring series. Improved training efficiency by 50% and maintained SOTA performance across multiple complex reasoning benchmarks.

Conclusion: The hybrid linear attention architecture successfully balances efficiency and performance, enabling long-term stable optimization during RL phases while significantly reducing computational costs for long-context inference.

Abstract: In this technical report, we present the Ring-linear model series, specifically including Ring-mini-linear-2.0 and Ring-flash-linear-2.0. Ring-mini-linear-2.0 comprises 16B parameters and 957M activations, while Ring-flash-linear-2.0 contains 104B parameters and 6.1B activations. Both models adopt a hybrid architecture that effectively integrates linear attention and softmax attention, significantly reducing I/O and computational overhead in long-context inference scenarios. Compared to a 32 billion parameter dense model, this series reduces inference cost to 1/10, and compared to the original Ring series, the cost is also reduced by over 50%. Furthermore, through systematic exploration of the ratio between different attention mechanisms in the hybrid architecture, we have identified the currently optimal model structure. Additionally, by leveraging our self-developed high-performance FP8 operator library-linghe, overall training efficiency has been improved by 50%. Benefiting from the high alignment between the training and inference engine operators, the models can undergo long-term, stable, and highly efficient optimization during the reinforcement learning phase, consistently maintaining SOTA performance across multiple challenging complex reasoning benchmarks.

[337] Foundation Model Forecasts: Form and Function

Alvaro Perez-Diaz, James C. Loach, Danielle E. Toutoungi, Lee Middleton

Main category: cs.LG

TL;DR: Time-series foundation models often produce limited forecast types (point or parametric) while many operational tasks require trajectory ensembles that preserve temporal dependence. The paper establishes when forecast types can be converted and proves that marginals cannot determine path-dependent event probabilities.

Details

Motivation: Current time-series foundation models focus on forecast accuracy but neglect the practical utility of different forecast types. Many operational tasks require trajectory ensembles with temporal dependence, which most models don't provide.

Method: Surveyed recent TSFMs, established conversion rules between forecast types, proved mathematical limitations of marginal distributions, and mapped operational tasks to minimal sufficient forecast types.

Result: Found that two-thirds of TSFMs produce only point or parametric forecasts. Proved that marginals cannot determine path-dependent event probabilities. Developed a task-aligned evaluation framework.

Conclusion: Forecast type, not just accuracy, determines practical utility. Trajectory ensembles are essential for many operational tasks, and conversion from simpler forecast types requires imposing temporal dependence.

Abstract: Time-series foundation models (TSFMs) achieve strong forecast accuracy, yet accuracy alone does not determine practical value. The form of a forecast – point, quantile, parametric, or trajectory ensemble – fundamentally constrains which operational tasks it can support. We survey recent TSFMs and find that two-thirds produce only point or parametric forecasts, while many operational tasks require trajectory ensembles that preserve temporal dependence. We establish when forecast types can be converted and when they cannot: trajectory ensembles convert to simpler forms via marginalization without additional assumptions, but the reverse requires imposing temporal dependence through copulas or conformal methods. We prove that marginals cannot determine path-dependent event probabilities – infinitely many joint distributions share identical marginals but yield different answers to operational questions. We map six fundamental forecasting tasks to minimal sufficient forecast types and provide a task-aligned evaluation framework. Our analysis clarifies when forecast type, not accuracy, differentiates practical utility.

[338] A New Type of Adversarial Examples

Xingyang Nie, Guojie Xiao, Su Pan, Biao Wang, Huilin Ge, Tao Fang

Main category: cs.LG

TL;DR: The paper introduces ’negative adversarial examples’ - inputs that are significantly different from original examples but produce the same model output, using novel algorithms like NI-FGSM and NMI-FGSM.

Details

Motivation: Most ML models are vulnerable to traditional adversarial examples that cause misclassification. This work explores the opposite: creating inputs that are visually different but maintain the same classification, revealing broader distribution of adversarial examples in sample space.

Method: Proposed negative adversarial example algorithms including negative iterative fast gradient sign method (NI-FGSM), negative iterative fast gradient method (NI-FGM), and their momentum variants (NMI-FGSM, NMI-FGM).

Result: Successfully generated adversarial examples that are substantially different from original inputs but yield identical model outputs, demonstrating that adversarial examples are extensively distributed throughout the sample space, not just near original examples.

Conclusion: Adversarial examples exist broadly across the sample space, not just in local neighborhoods of training data, and the proposed negative adversarial examples could be used to attack ML systems in certain scenarios.

Abstract: Most machine learning models are vulnerable to adversarial examples, which poses security concerns on these models. Adversarial examples are crafted by applying subtle but intentionally worst-case modifications to examples from the dataset, leading the model to output a different answer from the original example. In this paper, adversarial examples are formed in an exactly opposite manner, which are significantly different from the original examples but result in the same answer. We propose a novel set of algorithms to produce such adversarial examples, including the negative iterative fast gradient sign method (NI-FGSM) and the negative iterative fast gradient method (NI-FGM), along with their momentum variants: the negative momentum iterative fast gradient sign method (NMI-FGSM) and the negative momentum iterative fast gradient method (NMI-FGM). Adversarial examples constructed by these methods could be used to perform an attack on machine learning systems in certain occasions. Moreover, our results show that the adversarial examples are not merely distributed in the neighbourhood of the examples from the dataset; instead, they are distributed extensively in the sample space.

[339] A Markov Decision Process for Variable Selection in Branch & Bound

Paul Strang, Zacharie Alès, Côme Bissuel, Olivier Juan, Safia Kedad-Sidhoum, Emmanuel Rachelson

Main category: cs.LG

TL;DR: The paper introduces BBMDP, a principled MDP formulation for variable selection in Branch and Bound (B&B) algorithms, enabling the use of RL algorithms to learn optimal branching heuristics for MILP problems.

Details

Motivation: Variable selection heuristics significantly impact B&B solver performance in MILP. Recent RL approaches have been adapted to B&B, but lack principled MDP formulations and broad RL algorithm applicability.

Method: Proposed BBMDP, a vanilla MDP formulation for variable selection in B&B that allows leveraging a wide range of RL algorithms to learn optimal branching policies.

Result: Computational experiments show the proposed branching agent outperforms prior state-of-the-art RL agents on four standard MILP benchmarks.

Conclusion: BBMDP provides a principled framework for applying RL to B&B variable selection, achieving superior performance compared to existing RL approaches.

Abstract: Mixed-Integer Linear Programming (MILP) is a powerful framework used to address a wide range of NP-hard combinatorial optimization problems, often solved by Branch and Bound (B&B). A key factor influencing the performance of B&B solvers is the variable selection heuristic governing branching decisions. Recent contributions have sought to adapt reinforcement learning (RL) algorithms to the B&B setting to learn optimal branching policies, through Markov Decision Processes (MDP) inspired formulations, and ad hoc convergence theorems and algorithms. In this work, we introduce BBMDP, a principled vanilla MDP formulation for variable selection in B&B, allowing to leverage a broad range of RL algorithms for the purpose of learning optimal B&B heuristics. Computational experiments validate our model empirically, as our branching agent outperforms prior state-of-the-art RL agents on four standard MILP benchmarks.

[340] Scalable LinUCB: Low-Rank Design Matrix Updates for Recommenders with Large Action Spaces

Evgenia Shustova, Marina Sheshukova, Sergey Samsonov, Evgeny Frolov

Main category: cs.LG

TL;DR: Scalable LinUCB algorithm reduces computational and memory costs of linear contextual bandits by using dynamical low-rank parametrization of inverse Cholesky factors, achieving O(dr) complexity per step.

Details

Motivation: Traditional LinUCB has high training, inference, and memory costs that grow with feature dimensionality and action space size, mainly due to updating, inverting, and storing large design matrices.

Method: Uses dynamical low-rank parametrization of inverse Cholesky-style factors with numerically stable rank-1 and batched updates, employing projector-splitting integrator for dynamical low-rank approximation.

Result: Achieves average per-step update cost O(dr) and memory O(dr) for approximation rank r, with inference complexity O(dr) per action evaluation.

Conclusion: Experiments on recommender system datasets demonstrate the effectiveness of the Scalable LinUCB algorithm in providing fast and memory-efficient operations.

Abstract: Linear contextual bandits, especially LinUCB, are widely used in recommender systems. However, its training, inference, and memory costs grow with feature dimensionality and the size of the action space. The key bottleneck becomes the need to update, invert and store a design matrix that absorbs contextual information from interaction history. In this paper, we introduce Scalable LinUCB, the algorithm that enables fast and memory efficient operations with the inverse regularized design matrix. We achieve this through a dynamical low-rank parametrization of its inverse Cholesky-style factors. We derive numerically stable rank-1 and batched updates that maintain the inverse without directly forming the entire matrix. To control memory growth, we employ a projector-splitting integrator for dynamical low-rank approximation, yielding average per-step update cost $O(dr)$ and memory $O(dr)$ for approximation rank $r$. Inference complexity of the suggested algorithm is $O(dr)$ per action evaluation. Experiments on recommender system datasets demonstrate the effectiveness of our algorithm.

[341] FairNet: Dynamic Fairness Correction without Performance Loss via Contrastive Conditional LoRA

Songqi Zhou, Zeyuan Liu, Benben Jiang

Main category: cs.LG

TL;DR: FairNet is a dynamic fairness correction framework that uses bias detection and conditional LoRA to selectively apply fairness corrections only to biased instances, preserving performance on unbiased cases while handling various levels of sensitive attribute labeling.

Details

Motivation: Existing debiasing methods often compromise performance, use static correction strategies, struggle with data sparsity in minority groups, and have suboptimal use of sensitive attributes - either over-relying on complete labeling or ignoring them entirely.

Method: Integrates bias detector with conditional LoRA for selective activation of fairness correction only for biased instances. Uses contrastive loss to minimize intra-class representation disparities across sensitive groups and address underfitting in minority groups. Handles complete, partial, or absent sensitive attribute labels.

Result: Theoretical analysis shows FairNet can enhance worst-group performance without diminishing overall performance, potentially yielding slight improvements. Comprehensive empirical evaluations across vision and language benchmarks validate effectiveness.

Conclusion: FairNet provides an effective dynamic fairness correction framework that addresses limitations of existing methods while maintaining performance and flexibility in handling various sensitive attribute labeling scenarios.

Abstract: Ensuring fairness in machine learning models is a critical challenge. Existing debiasing methods often compromise performance, rely on static correction strategies, and struggle with data sparsity, particularly within minority groups. Furthermore, their utilization of sensitive attributes is often suboptimal, either depending excessively on complete attribute labeling or disregarding these attributes entirely. To overcome these limitations, we propose FairNet, a novel framework for dynamic, instance-level fairness correction. FairNet integrates a bias detector with conditional low-rank adaptation (LoRA), which enables selective activation of the fairness correction mechanism exclusively for instances identified as biased, and thereby preserve performance on unbiased instances. A key contribution is a new contrastive loss function for training the LoRA module, specifically designed to minimize intra-class representation disparities across different sensitive groups and effectively address underfitting in minority groups. The FairNet framework can flexibly handle scenarios with complete, partial, or entirely absent sensitive attribute labels. Theoretical analysis confirms that, under moderate TPR/FPR for the bias detector, FairNet can enhance the performance of the worst group without diminishing overall model performance, and potentially yield slight performance improvements. Comprehensive empirical evaluations across diverse vision and language benchmarks validate the effectiveness of FairNet.

[342] ConvXformer: Differentially Private Hybrid ConvNeXt-Transformer for Inertial Navigation

Omer Tariq, Muhammad Bilal, Muneeb Ul Hassan, Dongsoo Han, Jon Crowcroft

Main category: cs.LG

TL;DR: ConvXformer is a hybrid architecture combining ConvNeXt blocks and Transformer encoders for robust inertial navigation with differential privacy protection, achieving 40%+ positioning accuracy improvement while maintaining privacy guarantees.

Details

Motivation: Deep learning-based inertial tracking systems are vulnerable to privacy breaches that expose sensitive training data, and existing differential privacy solutions often compromise model performance with excessive noise in high-frequency inertial measurements.

Method: Proposes ConvXformer hybrid architecture fusing ConvNeXt blocks with Transformer encoders, and an efficient differential privacy mechanism with adaptive gradient clipping and gradient-aligned noise injection (GANI), using truncated singular value decomposition for gradient processing.

Result: Achieves more than 40% improvement in positioning accuracy on benchmark datasets (OxIOD, RIDI, RoNIN) while ensuring (ε,δ)-differential privacy guarantees, and demonstrates robustness under severe environmental distortions in real-world testing.

Conclusion: The framework is well-suited for secure and intelligent navigation in cyber-physical systems, providing robust performance under challenging conditions while maintaining strong privacy protection.

Abstract: Data-driven inertial sequence learning has revolutionized navigation in GPS-denied environments, offering superior odometric resolution compared to traditional Bayesian methods. However, deep learning-based inertial tracking systems remain vulnerable to privacy breaches that can expose sensitive training data. \hl{Existing differential privacy solutions often compromise model performance by introducing excessive noise, particularly in high-frequency inertial measurements.} In this article, we propose ConvXformer, a hybrid architecture that fuses ConvNeXt blocks with Transformer encoders in a hierarchical structure for robust inertial navigation. We propose an efficient differential privacy mechanism incorporating adaptive gradient clipping and gradient-aligned noise injection (GANI) to protect sensitive information while ensuring model performance. Our framework leverages truncated singular value decomposition for gradient processing, enabling precise control over the privacy-utility trade-off. Comprehensive performance evaluations on benchmark datasets (OxIOD, RIDI, RoNIN) demonstrate that ConvXformer surpasses state-of-the-art methods, achieving more than 40% improvement in positioning accuracy while ensuring $(\epsilon,\delta)$-differential privacy guarantees. To validate real-world performance, we introduce the Mech-IO dataset, collected from the mechanical engineering building at KAIST, where intense magnetic fields from industrial equipment induce significant sensor perturbations. This demonstrated robustness under severe environmental distortions makes our framework well-suited for secure and intelligent navigation in cyber-physical systems.

[343] Neural Variational Dropout Processes

Insu Jeon, Youngjin Park, Gunhee Kim

Main category: cs.LG

TL;DR: NVDPs is a Bayesian meta-learning method that uses task-specific dropout with a low-rank Bernoulli model for efficient few-shot learning adaptation.

Details

Motivation: To develop a robust meta-learning approach that can effectively infer conditional posterior distributions for quick adaptation to new tasks with limited data.

Method: Uses Neural Variational Dropout Processes with task-specific dropout, low-rank Bernoulli experts meta-model, and amortized variational inference with novel prior conditioned on task data.

Result: NVDPs showed excellent performance in few-shot learning tasks including 1D stochastic regression, image inpainting, and classification, outperforming other meta-learning approaches.

Conclusion: NVDPs provide robust approximation of task-specific dropout rates that handle functional ambiguities and uncertainties, enabling efficient multi-task few-shot learning.

Abstract: Learning to infer the conditional posterior model is a key step for robust meta-learning. This paper presents a new Bayesian meta-learning approach called Neural Variational Dropout Processes (NVDPs). NVDPs model the conditional posterior distribution based on a task-specific dropout; a low-rank product of Bernoulli experts meta-model is utilized for a memory-efficient mapping of dropout rates from a few observed contexts. It allows for a quick reconfiguration of a globally learned and shared neural network for new tasks in multi-task few-shot learning. In addition, NVDPs utilize a novel prior conditioned on the whole task data to optimize the conditional \textit{dropout} posterior in the amortized variational inference. Surprisingly, this enables the robust approximation of task-specific dropout rates that can deal with a wide range of functional ambiguities and uncertainties. We compared the proposed method with other meta-learning approaches in the few-shot learning tasks such as 1D stochastic regression, image inpainting, and classification. The results show the excellent performance of NVDPs.

[344] Optimization Benchmark for Diffusion Models on Dynamical Systems

Fabian Schaipp

Main category: cs.LG

TL;DR: Benchmarking optimization algorithms for diffusion model training shows Muon and SOAP outperform AdamW with 18% lower final loss, while examining learning-rate schedules and Adam vs SGD performance gaps.

Details

Motivation: The training of diffusion models is often overlooked in optimization technique evaluations, creating a need to benchmark recent algorithms specifically for diffusion model training.

Method: Benchmark recent optimization algorithms by training a diffusion model for denoising flow trajectories, comparing Muon, SOAP, AdamW, and examining learning-rate schedules and Adam vs SGD performance.

Result: Muon and SOAP are highly efficient alternatives to AdamW, achieving 18% lower final loss. The study also reveals insights about learning-rate schedule impacts and the performance gap between Adam and SGD.

Conclusion: Recent optimization algorithms like Muon and SOAP offer significant improvements over AdamW for diffusion model training, and understanding training dynamics through learning-rate schedules and optimizer comparisons is crucial for effective diffusion model development.

Abstract: The training of diffusion models is often absent in the evaluation of new optimization techniques. In this work, we benchmark recent optimization algorithms for training a diffusion model for denoising flow trajectories. We observe that Muon and SOAP are highly efficient alternatives to AdamW (18% lower final loss). We also revisit several recent phenomena related to the training of models for text or image applications in the context of diffusion model training. This includes the impact of the learning-rate schedule on the training dynamics, and the performance gap between Adam and SGD.

[345] LMFD: Latent Monotonic Feature Discovery

Guus Toussaint, Arno Knobbe

Main category: cs.LG

TL;DR: The paper proposes a method to extract monotonic proxies for system ‘age’ from multivariate time series data by combining sensors with low individual monotonicity into latent features with high monotonicity.

Details

Motivation: Many systems age or degrade over time, but available sensors may not directly provide this 'age' information. The goal is to discover functions of available sensors that can serve as proxies for this latent aging process.

Method: Uses a carefully defined grammar to generate candidate equations, then optimizes them for monotonicity using absolute Spearman’s Rank Correlation between time and candidate formulas. The approach assesses and fits candidate features based on monotonicity.

Result: The method successfully combines sensors with low individual monotonicity into features with high monotonicity. In one real-world case, sensors with Spearman’s ρ values of 0.13 and 0.09 were combined into a proxy with ρ of 0.95.

Conclusion: The proposed method can find interpretable equations that serve as effective proxies for system ‘age’, demonstrating the ability to extract meaningful aging information from multivariate sensor data.

Abstract: Many systems in our world age, degrade or otherwise move slowly but steadily in a certain direction. When monitoring such systems by means of sensors, one often assumes that some form of age' is latently present in the data, but perhaps the available sensors do not readily provide this useful information. The task that we study in this paper is to extract potential proxies for this age’ from the available multi-variate time series without having clear data on what age' actually is. We argue that when we find a sensor, or more likely some discovered function of the available sensors, that is sufficiently monotonic, that function can act as the proxy we are searching for. Using a carefully defined grammar and optimising the resulting equations in terms of monotonicity, defined as the absolute Spearman's Rank Correlation between time and the candidate formula, the proposed approach generates a set of candidate features which are then fitted and assessed on monotonicity. The proposed system is evaluated against an artificially generated dataset and two real-world datasets. In all experiments, we show that the system is able to combine sensors with low individual monotonicity into latent features with high monotonicity. For the real-world dataset of InfraWatch, a structural health monitoring project, we show that two features with individual absolute Spearman's $\rho$ values of $0.13$ and $0.09$ can be combined into a proxy with an absolute Spearman's $\rho$ of $0.95$. This demonstrates that our proposed method can find interpretable equations which can serve as a proxy for the age’ of the system.

[346] Learning Noise-Resilient and Transferable Graph-Text Alignment via Dynamic Quality Assessment

Yuhang Liu, Minglai Shao, Zengyi Wo, Yunlong Chu, Bing Hao, Shengzhong Liu, Ruijie Wang, Jianxin Li

Main category: cs.LG

TL;DR: ADAligner is a dynamic graph-text alignment framework that adaptively switches between many-to-many and one-to-one alignment strategies based on real-time supervision quality assessment, achieving superior performance and robustness in graph foundation model pre-training.

Details

Motivation: Existing graph-text aligners assume strict one-to-one correspondences and use static alignment objectives, which fail to handle real-world many-to-many relations and are brittle under noisy supervision, creating a dilemma between expressiveness and noise robustness.

Method: ADAligner dynamically estimates batch-level alignment reliability and adapts optimization accordingly - using soft, subgraph-level many-to-many alignment when supervision is clean, and emphasizing reliable one-to-one alignment by filtering low-confidence pairs under noise.

Result: ADAligner consistently outperforms prior methods on zero-/few-shot node classification, link prediction and cross-modal retrieval across nine datasets, maintains strong robustness under noise, and accelerates pre-training by 2-3x compared to baselines.

Conclusion: ADAligner establishes a scalable and reliable foundation for graph-text representation learning by providing a theoretically stable dynamic alignment mechanism that balances expressiveness and robustness in real-world web environments.

Abstract: Pre-training Graph Foundation Models (GFMs) on text-attributed graphs (TAGs) is central to web-scale applications such as search, recommendation, and knowledge discovery. However, existing CLIP-style graph-text aligners face two key limitations: they assume strict one-to-one correspondences between nodes and texts, overlooking the inherent many-to-many relations in real-world graphs; and they rely on static alignment objectives that cannot adapt to varying data quality, making them brittle under noisy supervision. Together, these limitations expose a core dilemma: embracing expressive many-to-many alignment amplifies noise, while reverting to strict one-to-one strategies sacrifices semantic diversity and fails to handle inherently mismatched pairs. To address these challenges, we propose ADAligner, a dynamic, quality-aware graph-text alignment framework that dynamically adjusts between expressive many-to-many and conservative one-to-one objectives according to supervision quality. ADAligner estimates batch-level alignment reliability in real time and adapts its optimization accordingly, promoting soft, subgraph-level many-to-many alignment when supervision is clean, while emphasizing reliable one-to-one alignment by dynamically filtering low-confidence pairs under noise. Theoretically, we prove that this dynamic mechanism forms a stable negative feedback process, ensuring convergence and robustness. Comprehensive experiments on nine diverse TAG datasets demonstrate that ADAligner consistently outperforms prior graph-text aligners on zero-/few-shot node classification, link prediction and cross-modal retrieval tasks. It maintains strong robustness under noisy supervision and accelerates pre-training by approximately 2 to 3 times compared to multimodal baselines, establishing a scalable and reliable foundation for graph-text representation learning in real-world web environments.

[347] A Concrete Roadmap towards Safety Cases based on Chain-of-Thought Monitoring

Julian Schulz

Main category: cs.LG

TL;DR: A roadmap for AI safety cases using chain-of-thought monitoring to detect dangerous capabilities, addressing threats like neuralese and encoded reasoning, with evaluation of faithfulness techniques and prediction markets for feasibility assessment.

Details

Motivation: As AI systems approach dangerous capability levels where inability safety cases become insufficient, alternative approaches are needed to ensure safety.

Method: Proposes a two-part safety case: (1) models lack dangerous capabilities without CoT, (2) dangerous capabilities enabled by CoT are detectable by monitoring. Examines threats to monitorability (neuralese, encoded reasoning) and evaluates techniques for maintaining CoT faithfulness.

Result: Systematically categorizes threats into three forms (linguistic drift, steganography, alien reasoning) and analyzes their drivers. Explores extracting monitorable CoT from non-monitorable reasoning.

Conclusion: Establishes prediction markets to aggregate forecasts on technical milestones influencing the feasibility of CoT monitoring safety cases.

Abstract: As AI systems approach dangerous capability levels where inability safety cases become insufficient, we need alternative approaches to ensure safety. This paper presents a roadmap for constructing safety cases based on chain-of-thought (CoT) monitoring in reasoning models and outlines our research agenda. We argue that CoT monitoring might support both control and trustworthiness safety cases. We propose a two-part safety case: (1) establishing that models lack dangerous capabilities when operating without their CoT, and (2) ensuring that any dangerous capabilities enabled by a CoT are detectable by CoT monitoring. We systematically examine two threats to monitorability: neuralese and encoded reasoning, which we categorize into three forms (linguistic drift, steganography, and alien reasoning) and analyze their potential drivers. We evaluate existing and novel techniques for maintaining CoT faithfulness. For cases where models produce non-monitorable reasoning, we explore the possibility of extracting a monitorable CoT from a non-monitorable CoT. To assess the viability of CoT monitoring safety cases, we establish prediction markets to aggregate forecasts on key technical milestones influencing their feasibility.

[348] CPSVD: Enhancing Large Language Model Compression via Column-Preserving Singular Value Decomposition

Lin Xv, Jingsheng Gao, Xian Gao, Ting Li, Yuzhuo Fu

Main category: cs.LG

TL;DR: CPSVD is a novel SVD-based LLM compression method that preserves high-error columns directly and applies SVD only to low-error columns, with adaptive non-uniform compression rates across modules.

Details

Motivation: Existing SVD-based LLM compression methods treat parameter matrices uniformly, ignoring that SVD approximation errors vary significantly across different matrix parts, leading to suboptimal compression.

Method: CPSVD segments parameter matrices, preserves columns with high decomposition errors directly, applies SVD only to columns with low errors, and adaptively allocates non-uniform compression rates to different modules while maintaining target layer-wise compression ratios.

Result: Extensive experiments show CPSVD consistently outperforms state-of-the-art SVD-based LLM compression methods, achieving lower perplexity and higher accuracy on zero-shot tasks.

Conclusion: CPSVD provides an effective approach for LLM compression by addressing the heterogeneity in SVD approximation errors and adaptively optimizing compression strategies across different matrix components.

Abstract: The rapid advancement of Large Language Models (LLMs) faces a critical bottleneck in their immense size, necessitating efficient compression techniques. While Singular Value Decomposition (SVD) is a promising approach, existing SVD-based methods treat the entire parameter matrix uniformly, overlooking that SVD approximation errors vary significantly across different matrix parts, which often leads to suboptimal compression. To address this, we propose \textbf{C}olumn-\textbf{P}reserving \textbf{S}ingular \textbf{V}alue \textbf{D}ecomposition (CPSVD), a novel method that refines SVD-based LLM compression by intelligently segmenting the parameter matrix. Unlike traditional SVD, CPSVD identifies and directly preserves matrix columns with high decomposition errors, applying SVD only to columns with low decomposition errors, while precisely determining the optimal balance point between these two strategies to minimize error. Furthermore, leveraging the inherent heterogeneity in decomposition errors across different matrices within an LLM, CPSVD adaptively allocates non-uniform compression rates to modules within that layer, while adhering to a target layer-wise compression ratio, thereby further enhancing compression performance. Extensive experiments demonstrate that CPSVD consistently outperforms state-of-the-art SVD-based LLM compression methods, achieving lower perplexity and higher accuracy on zero-shot tasks.

[349] Graph Unlearning Meets Influence-aware Negative Preference Optimization

Qiang Chen, Zhongze Wu, Ang He, Xi Lin, Shuo Jiang, Shan You, Chang Xu, Yi Chen, Xiu Su

Main category: cs.LG

TL;DR: INPO is an influence-aware negative preference optimization framework for graph unlearning that slows divergence speed and improves model utility robustness by focusing on high-influence edges and using topological entropy loss.

Details

Motivation: Current graph unlearning models using gradient ascent on forget sets cause drastic degradation in model utility due to rapid divergence speed during unlearning.

Method: Proposes influence-aware message function to amplify influence of unlearned edges and mitigate topological coupling between forget and retain sets. Uses removal-based method for quick edge influence estimation and topological entropy loss to avoid excessive local structure information loss.

Result: Extensive experiments on five real-world datasets show INPO achieves state-of-the-art performance on all forget quality metrics while maintaining model utility.

Conclusion: INPO framework effectively addresses the divergence speed problem in graph unlearning and maintains model utility while achieving high forget quality.

Abstract: Recent advancements in graph unlearning models have enhanced model utility by preserving the node representation essentially invariant, while using gradient ascent on the forget set to achieve unlearning. However, this approach causes a drastic degradation in model utility during the unlearning process due to the rapid divergence speed of gradient ascent. In this paper, we introduce \textbf{INPO}, an \textbf{I}nfluence-aware \textbf{N}egative \textbf{P}reference \textbf{O}ptimization framework that focuses on slowing the divergence speed and improving the robustness of the model utility to the unlearning process. Specifically, we first analyze that NPO has slower divergence speed and theoretically propose that unlearning high-influence edges can reduce impact of unlearning. We design an influence-aware message function to amplify the influence of unlearned edges and mitigate the tight topological coupling between the forget set and the retain set. The influence of each edge is quickly estimated by a removal-based method. Additionally, we propose a topological entropy loss from the perspective of topology to avoid excessive information loss in the local structure during unlearning. Extensive experiments conducted on five real-world datasets demonstrate that INPO-based model achieves state-of-the-art performance on all forget quality metrics while maintaining the model’s utility. Codes are available at \href{https://github.com/sh-qiangchen/INPO}{https://github.com/sh-qiangchen/INPO}.

[350] ARA: Adaptive Rank Allocation for Efficient Large Language Model SVD Compression

Lin Xv, Jingsheng Gao, Xian Gao, Ting Liu, Yuzhuo Fu

Main category: cs.LG

TL;DR: Proposes Adaptive Rank Allocation (ARA) method for optimal rank allocation in SVD-based LLM compression, addressing limitations of existing heuristic and mask-based approaches.

Details

Motivation: Existing SVD compression methods for LLMs suffer from limitations: heuristic algorithms explore restricted solution spaces, mask-based training inefficiently captures relationships between singular value spectra and parameters, and current methods overlook the non-smooth gain function property at compression ratio 1, leading to suboptimal local minima.

Method: ARA introduces: (1) dedicated mask design for efficient mapping between retained ranks and trainable parameters, and (2) additional loss function to guide parameter selection toward globally optimal solutions.

Result: On LLaMA2-7B with 80% compression ratio, ARA reduces perplexity on WikiText2 from 8.38 to 6.42 and improves average zero-shot task accuracy by 9.72 percentage points compared to uniform compression.

Conclusion: ARA achieves state-of-the-art performance in rank allocation for SVD-based LLM compression, demonstrating effectiveness through significant improvements in perplexity and zero-shot accuracy.

Abstract: In the field of large language model (LLM) compression, singular value decomposition (SVD) is a widely studied and adopted low-rank decomposition technique. Since SVD operates exclusively on linear modules, and these modules in LLMs are separated by nonlinear components, SVD can only be applied independently to each linear module. Under a global compression ratio constraint, determining the appropriate rank for different linear modules becomes a critical problem. Existing approaches, such as heuristic algorithms and mask-based training, have made progress in addressing this challenge. However, these methods still suffer from several limitations: heuristic algorithms explore the solution space within restricted regions, while mask-based training struggles to efficiently capture the relationship between singular value spectra and trainable parameters. More importantly, current methods overlook the key property that the gain function is non-smooth at a compression ratio of 1, which often leads the training process to suboptimal local minima. To address these issues, we propose an Adaptive Rank Allocation (ARA) method. Specifically, (1) ARA introduces a dedicated mask design that enables efficient mapping and updating between retained ranks and trainable parameters; and (2) it employs an additional loss function to guide parameter selection toward globally optimal solutions. Experimental results demonstrate that ARA achieves state-of-the-art performance. On the LLaMA2-7B model with a 80% compression ratio, ARA reduces perplexity on WikiText2 from 8.38 to 6.42 and improves average zero-shot task accuracy by 9.72 percentage points compared with uniform compression. These results highlight the effectiveness of our method for rank allocation in SVD-based LLM compression.

[351] Iterative Training of Physics-Informed Neural Networks with Fourier-enhanced Features

Yulun Wu, Miguel Aguiar, Karl H. Johansson, Matthieu Barreau

Main category: cs.LG

TL;DR: IFeF-PINN: An iterative training algorithm for physics-informed neural networks that uses Random Fourier Features to overcome spectral bias and better approximate high-frequency PDEs.

Details

Motivation: To address spectral bias in PINNs, where neural networks tend to learn low-frequency features first, limiting their ability to solve high-frequency PDE problems.

Method: Proposes iterative training with Fourier-enhanced features using Random Fourier Features to enrich the latent space. Creates a two-stage training: (i) estimate basis in feature space, (ii) perform regression for coefficients of enhanced basis functions.

Result: Shows convexity for linear models and proves convergence of iterative training. Empirical evidence demonstrates enhanced expressive capacity for high-frequency PDE approximation. Extensive numerical evaluation shows superior performance over state-of-the-art methods.

Conclusion: IFeF-PINN effectively overcomes spectral bias in PINNs through Fourier feature enhancement, enabling accurate approximation of high-frequency PDEs with improved performance across the frequency domain.

Abstract: Spectral bias, the tendency of neural networks to learn low-frequency features first, is a well-known issue with many training algorithms for physics-informed neural networks (PINNs). To overcome this issue, we propose IFeF-PINN, an algorithm for iterative training of PINNs with Fourier-enhanced features. The key idea is to enrich the latent space using high-frequency components through Random Fourier Features. This creates a two-stage training problem: (i) estimate a basis in the feature space, and (ii) perform regression to determine the coefficients of the enhanced basis functions. For an underlying linear model, it is shown that the latter problem is convex, and we prove that the iterative training scheme converges. Furthermore, we empirically establish that Random Fourier Features enhance the expressive capacity of the network, enabling accurate approximation of high-frequency PDEs. Through extensive numerical evaluation on classical benchmark problems, the superior performance of our method over state-of-the-art algorithms is shown, and the improved approximation across the frequency domain is illustrated.

[352] LLM Unlearning with LLM Beliefs

Kemou Li, Qizhou Wang, Yue Wang, Fengpeng Li, Jun Liu, Bo Han, Jiantao Zhou

Main category: cs.LG

TL;DR: The paper identifies a ‘squeezing effect’ in current unlearning methods where probability mass shifts to semantically related rephrasings of target content, leading to spurious unlearning. It proposes a bootstrapping framework that suppresses both target responses and model beliefs to achieve more thorough forgetting.

Details

Motivation: Current unlearning methods using gradient ascent have a critical side effect where they redistribute probability mass to semantically related rephrasings of target content, resulting in only spurious unlearning that automated metrics fail to detect properly.

Method: Proposes a bootstrapping (BS) framework that links the squeezing effect with the model’s own high-confidence generations (model beliefs). BS-T suppresses high-probability tokens while BS-S removes entire high-confidence generations, jointly suppressing both target responses and model beliefs.

Result: Extensive experiments across diverse benchmarks with various model families confirm the effectiveness of the approach in achieving more thorough forgetting while preserving utility.

Conclusion: The bootstrapping framework directly counters the squeezing effect by incorporating model beliefs into the unlearning objective, enabling more comprehensive removal of sensitive content without merely shifting it to semantically equivalent forms.

Abstract: Large language models trained on vast corpora inherently risk memorizing sensitive or harmful content, which may later resurface in their outputs. Prevailing unlearning methods generally rely on gradient ascent and its variants to lower the probability of specific target responses. However, we find that this strategy induces a critical side effect: probability mass is redistributed into high-likelihood regions, often corresponding to semantically related rephrasings of the targets. We refer to this as the squeezing effect, which explains why many methods yield merely spurious unlearning, a problem further obscured by automated metrics (e.g., ROUGE, truth ratio) that misreport actual success. To address this, we propose a bootstrapping (BS) framework that explicitly links the squeezing effect with the model’s own high-confidence generations, namely its model beliefs. Since model beliefs inherently capture the very high-likelihood regions where probability mass is squeezed, incorporating them into the unlearning objective directly counters the squeezing effect. By jointly suppressing both target responses and model beliefs, BS-T (token) attenuates high-probability tokens, whereas BS-S (sequence) removes entire high-confidence generations, together achieving more thorough forgetting while preserving utility. Extensive experiments across diverse benchmarks with various model families confirm the effectiveness of our approach.

[353] From Prototypes to Sparse ECG Explanations: SHAP-Driven Counterfactuals for Multivariate Time-Series Multi-class Classification

Maciej Mozolewski, Betül Bayrak, Kerstin Bach, Grzegorz J. Nalepa

Main category: cs.LG

TL;DR: A prototype-driven framework for generating sparse counterfactual explanations for 12-lead ECG classification models that modifies only 78% of original signals while maintaining 81.3% validity across classes and achieving 43% improvement in temporal stability.

Details

Motivation: Addressing explainability challenges of state-of-the-art models in time series XAI, particularly for actionable insights in healthcare domains like ECG classification.

Method: Uses SHAP-based thresholds to identify critical signal segments, converts them to interval rules, employs DTW and medoid clustering for prototype extraction, and aligns prototypes to query R-peaks for coherence.

Result: Achieves 81.3% validity across all classes with class-specific performance ranging from 98.9% for MI to 13.2% for HYP detection. Enables near realtime generation (<1 second) of clinically valid counterfactuals.

Conclusion: Establishes design principles for physiologically-aware counterfactual explanations in AI-based diagnosis systems and outlines pathways toward user-controlled explanation interfaces for clinical deployment.

Abstract: In eXplainable Artificial Intelligence (XAI), instance-based explanations for time series have gained increasing attention due to their potential for actionable and interpretable insights in domains such as healthcare. Addressing the challenges of explainability of state-of-the-art models, we propose a prototype-driven framework for generating sparse counterfactual explanations tailored to 12-lead ECG classification models. Our method employs SHAP-based thresholds to identify critical signal segments and convert them into interval rules, uses Dynamic Time Warping (DTW) and medoid clustering to extract representative prototypes, and aligns these prototypes to query R-peaks for coherence with the sample being explained. The framework generates counterfactuals that modify only 78% of the original signal while maintaining 81.3% validity across all classes and achieving 43% improvement in temporal stability. We evaluate three variants of our approach, Original, Sparse, and Aligned Sparse, with class-specific performance ranging from 98.9% validity for myocardial infarction (MI) to challenges with hypertrophy (HYP) detection (13.2%). This approach supports near realtime generation (< 1 second) of clinically valid counterfactuals and provides a foundation for interactive explanation platforms. Our findings establish design principles for physiologically-aware counterfactual explanations in AI-based diagnosis systems and outline pathways toward user-controlled explanation interfaces for clinical deployment.

[354] Revisiting the Relation Between Robustness and Universality

M. Klabunde, L. Caspari, F. Lemmerich

Main category: cs.LG

TL;DR: The modified universality hypothesis suggests adversarially robust models become highly similar, but this study finds only partial universality in specific settings, not consistent across datasets.

Details

Motivation: To test the generality of the modified universality hypothesis proposed by Jones et al. (2022) which claims adversarially robust models trained for the same task become highly similar.

Method: Revisiting the hypothesis by testing representational similarity across different datasets, examining predictive behavior convergence with increasing robustness, and analyzing where differing predictions originate.

Result: Verified high representational similarity in specific settings but found inconsistent results across datasets. Predictive behavior does not converge with increasing robustness. Differing predictions originate in classification layer, but more universal predictive behavior can be achieved with simple retraining of classifiers.

Conclusion: The work points towards partial universality of neural networks in specific settings and away from notions of strict universality, suggesting the universality hypothesis has limited generality.

Abstract: The modified universality hypothesis proposed by Jones et al. (2022) suggests that adversarially robust models trained for a given task are highly similar. We revisit the hypothesis and test its generality. While we verify Jones’ main claim of high representational similarity in specific settings, results are not consistent across different datasets. We also discover that predictive behavior does not converge with increasing robustness and thus is not universal. We find that differing predictions originate in the classification layer, but show that more universal predictive behavior can be achieved with simple retraining of the classifiers. Overall, our work points towards partial universality of neural networks in specific settings and away from notions of strict universality.

[355] Optimizing the Unknown: Black Box Bayesian Optimization with Energy-Based Model and Reinforcement Learning

Ruiyao Miao, Junren Xiao, Shiya Tsang, Hui Xiong, Yingnian Wu

Main category: cs.LG

TL;DR: REBMBO integrates Gaussian Processes with Energy-Based Models and uses PPO for adaptive multi-step lookahead to overcome one-step bias in Bayesian Optimization.

Details

Motivation: Traditional BO methods suffer from one-step bias leading to local optima convergence and poor performance in complex tasks, while BBO has shown success in costly function evaluation scenarios.

Method: Combines GP for local guidance with EBM for global structure, defines BO iterations as MDP, and uses PPO for adaptive multi-step lookahead to dynamically adjust exploration depth and direction.

Result: Extensive experiments on synthetic and real-world benchmarks confirm superior performance, with additional analyses showing adaptability and robustness across various GP configurations.

Conclusion: REBMBO effectively overcomes limitations of traditional BO methods through its integrated approach and adaptive exploration strategy.

Abstract: Existing Bayesian Optimization (BO) methods typically balance exploration and exploitation to optimize costly objective functions. However, these methods often suffer from a significant one-step bias, which may lead to convergence towards local optima and poor performance in complex or high-dimensional tasks. Recently, Black-Box Optimization (BBO) has achieved success across various scientific and engineering domains, particularly when function evaluations are costly and gradients are unavailable. Motivated by this, we propose the Reinforced Energy-Based Model for Bayesian Optimization (REBMBO), which integrates Gaussian Processes (GP) for local guidance with an Energy-Based Model (EBM) to capture global structural information. Notably, we define each Bayesian Optimization iteration as a Markov Decision Process (MDP) and use Proximal Policy Optimization (PPO) for adaptive multi-step lookahead, dynamically adjusting the depth and direction of exploration to effectively overcome the limitations of traditional BO methods. We conduct extensive experiments on synthetic and real-world benchmarks, confirming the superior performance of REBMBO. Additional analyses across various GP configurations further highlight its adaptability and robustness.

[356] g-DPO: Scalable Preference Optimization for Protein Language Models

Constance Ferragu, Jonathan D. Ziegler, Nicolas Deutschmann, Arthur Lindoulsi, Eli Bixby, Cradle ML Team

Main category: cs.LG

TL;DR: g-DPO is a scalable framework that accelerates Direct Preference Optimization for protein language models by clustering sequences and using group-based approximations, achieving similar performance with 1.8-3.7x faster convergence.

Details

Motivation: Standard DPO faces scalability issues as training pairs grow quadratically with labeled sequences, leading to prohibitive training times for protein engineering datasets.

Method: Uses sequence space clustering to prune redundant pairs and amortizes likelihood computations with group-based approximations.

Result: Maintains in-silico and in-vitro performance statistically indistinguishable from standard DPO while converging 1.8 to 3.7 times faster across three protein engineering tasks.

Conclusion: g-DPO provides an effective solution to DPO’s scalability bottleneck, with greater efficiency gains expected for larger datasets.

Abstract: Direct Preference Optimization (DPO) is an effective approach for aligning protein language models with experimental design goals. However, DPO faces a scalability bottleneck: the number of possible training pairs grows quadratically with the number of labeled sequences, leading to prohibitive training times even for modestly sized datasets. We introduce g-DPO, a framework that (i) uses sequence space clustering to prune redundant pairs while preserving training signal, and (ii) amortizes likelihood computations with group-based approximations. Across three protein engineering tasks, g-DPO maintains in-silico and in-vitro performance that is statistically indistinguishable from standard DPO, while converging 1.8 to 3.7 times faster, with greater gains expected as the size of the dataset increases.

[357] Insights into the Unknown: Federated Data Diversity Analysis on Molecular Data

Markus Bujotzek, Evelyn Trautmann, Calum Hand, Ian Hales

Main category: cs.LG

TL;DR: The paper investigates federated clustering methods for analyzing distributed molecular data in pharmaceutical drug discovery, benchmarking three approaches against centralized methods using both standard and chemistry-informed metrics.

Details

Motivation: AI methods in drug discovery rely on public datasets lacking scale and diversity of proprietary pharmaceutical data. Federated learning enables privacy-preserving collaboration but complicates data-centric tasks like estimating dataset diversity and understanding chemical space structure.

Method: Benchmarked three federated clustering approaches (Federated kMeans, Federated PCA + Federated kMeans, and Federated Locality-Sensitive Hashing) against centralized counterparts on eight diverse molecular datasets using both standard mathematical and novel chemistry-informed SF-ICF metrics.

Result: Large-scale benchmarking with in-depth explainability analysis showed the importance of incorporating domain knowledge through chemistry-informed metrics and on-client explainability analyses for federated diversity analysis on molecular data.

Conclusion: Federated clustering methods can effectively disentangle and represent distributed molecular data, with domain-specific metrics and explainability being crucial for meaningful analysis in pharmaceutical applications.

Abstract: AI methods are increasingly shaping pharmaceutical drug discovery. However, their translation to industrial applications remains limited due to their reliance on public datasets, lacking scale and diversity of proprietary pharmaceutical data. Federated learning (FL) offers a promising approach to integrate private data into privacy-preserving, collaborative model training across data silos. This federated data access complicates important data-centric tasks such as estimating dataset diversity, performing informed data splits, and understanding the structure of the combined chemical space. To address this gap, we investigate how well federated clustering methods can disentangle and represent distributed molecular data. We benchmark three approaches, Federated kMeans (Fed-kMeans), Federated Principal Component Analysis combined with Fed-kMeans (Fed-PCA+Fed-kMeans), and Federated Locality-Sensitive Hashing (Fed-LSH), against their centralized counterparts on eight diverse molecular datasets. Our evaluation utilizes both, standard mathematical and a chemistry-informed evaluation metrics, SF-ICF, that we introduce in this work. The large-scale benchmarking combined with an in-depth explainability analysis shows the importance of incorporating domain knowledge through chemistry-informed metrics, and on-client explainability analyses for federated diversity analysis on molecular data.

[358] ELUTQ: Efficient LUT-Aware Quantization for Deploying Large Language Models on Edge Devices

Xin Nie, Liang Dong, HaiCheng Zhang, JiaWang Xiao, G. Sun

Main category: cs.LG

TL;DR: ELUTQ is an efficient quantization framework for deploying LLMs on CPU-based edge devices, featuring Hierarchical Linear Quantization (HLQ) that reduces memory usage and latency while eliminating dequantization overhead.

Details

Motivation: Enable LLM deployment on CPU-based edge devices with limited memory and computational resources by addressing the limitations of uniform quantization which poorly fits weight distributions and incurs high dequantization overhead at low bit widths.

Method: Proposes Hierarchical Linear Quantization (HLQ) format that captures weight statistical characteristics without increasing computational cost of Bit-serial LUT-based GEMM operations. Provides optimized CPU kernels for end-to-end inference and integrates with existing quantization pipelines.

Result: For LLaMA3-8B, HLQ reduces perplexity by ~8% at 3-bit and 85% at 2-bit precision under post-training quantization (completed within 1 hour). With finetuning, further improves 2-bit performance within 2 hours. Achieves over 25 tokens/s on Apple M2 chip for 2-bit LLaMA2-7B.

Conclusion: ELUTQ with HLQ format effectively enables efficient LLM deployment on edge devices by significantly reducing memory consumption and latency while maintaining model quality, making on-device AI more accessible.

Abstract: The deployment of Large Language Models (LLMs) on CPU-based edge devices is crucial for enabling on-device intelligence and expanding AI accessibility. However, it remains challenging due to limited memory and computational resources. During edge inference, memory usage and latency are the primary bottlenecks. Although weight quantization can effectively reduce memory consumption, existing hardware-friendly approaches often rely on uniform quantization, which poorly fits weight distributions and incurs high dequantization overhead at low bit widths. To address these limitations, we propose ELUTQ, an efficient quantization framework introducing a novel quantization format, Hierarchical Linear Quantization (HLQ). HLQ better captures the statistical characteristics of weights without increasing the computational cost of Bit-serial LUT-based GEMM operations, thereby eliminating dequantization overhead. It is orthogonal to existing quantization algorithms and can be seamlessly integrated into various quantization pipelines. For efficient on-device deployment, ELUTQ provides optimized CPU kernels for end-to-end inference. Experiments show that for LLaMA3-8B, HLQ reduces perplexity by about 8% at 3-bit and 85% at 2-bit precision under post-training quantization, completing quantization within one hour. With efficient finetuning, HLQ further improves 2-bit performance within two hours. In terms of inference efficiency, our 2-bit LLaMA2-7B achieves over 25 tokens/s on an Apple M2 chip (4 threads, batch size = 1).

[359] Energy-Efficient and Dequantization-Free Q-LLMs: A Spiking Neural Network Approach to Salient Value Mitigation

Chenyu Wang, Zhanglu Yan, Zhi Zhou, Xu Chen, Weng-Fai Wong

Main category: cs.LG

TL;DR: SpikeQuant is a brain-inspired quantization method that converts LLM activations to binary spikes, enabling mixed-precision storage and energy-efficient computation by replacing MAC operations with temporal accumulation.

Details

Motivation: Address three key challenges in LLM quantization: (1) MAC operations still dominate energy consumption, (2) dequantization adds extra overhead, and (3) uniform bit widths clip salient values while mixed precision is impractical on current hardware.

Method: Selectively applies mixed-precision quantization to activations with salient values, re-encodes them into binary spike counts, and embeds quantization scale into IF mechanism threshold to avoid explicit dequantization.

Result: Achieves near-FP16 perplexity under W4A4 quantization while reducing energy cost by up to 4.6 times compared to existing methods.

Conclusion: SpikeQuant provides an effective solution for accurate and energy-efficient LLM deployment by leveraging SNN principles to overcome traditional quantization limitations.

Abstract: In the era of large language models (LLMs), weight-activation quantization helps fit models on edge device by reducing memory and compute bit-widths. However, three challenges persist for energy constrained hardware: (1) even after quantization, multiply-accumulate (MAC) operations remain unavoidable and continue to dominate energy consumption; (2) dequantization (or per-tensor/channel rescaling) introduces extra arithmetic and data movement, increasing latency and energy; (3) uniform parameters bit widths clip salient values-while intra-channel mixed precision is generally impractical on current matrix hardware and memory. In contrast, brain-inspired Spiking Neural Networks (SNNs), owing to their binary spike-based information representation and the Integrate-and-Fire (IF) paradigm, naturally support mixed-precision storage and energy-efficient computation by replacing complex MACs with temporal Accumulate (ACCs). Motivated by this property, we propose SpikeQuant, which selectively applies mixed-precision quantization to activations with salient values and re-encodes them into binary spike counts, thereby enabling dynamic mixed storage of different bitwidths. Furthermore, by embedding the quantization scale into the threshold of the IF mechanism, our approach performs energy-efficient linear transformations on weights and activations while avoiding explicit dequantization. Experimental results demonstrate that SpikeQuant consistently achieves near-FP16 perplexity under W4A4 quantization while reducing energy cost by up to 4.6 times compared to existing methods, highlighting its effectiveness for accurate and energy-efficient LLM deployment.

[360] Teaming LLMs to Detect and Mitigate Hallucinations

Demian Till, John Smeaton, Peter Haubrick, Gouse Saheb, Florian Graef, David Berman

Main category: cs.LG

TL;DR: Extending single-model consistency methods to combine responses from multiple LLMs with different training data, training schemes and model architectures improves hallucination detection and mitigation while reducing inference costs.

Details

Motivation: To address LLM hallucinations stemming from imperfect training data limitations like biases and under-representation, and to overcome the limitations of single-model consistency approaches.

Method: Proposed consortium consistency approach that aggregates responses from multiple LLMs with different training data, training schemes, and model architectures, evaluated across teams from a pool of 15 LLMs.

Result: Substantial improvements in hallucination detection and mitigation capabilities beyond single-model consistency methods, with performance improvements often accompanied by reduced inference costs.

Conclusion: Multi-model consortium consistency is an effective approach for enhancing LLM hallucination detection and mitigation while addressing the inference cost drawback of single-model methods.

Abstract: Recent work has demonstrated state-of-the-art results in large language model (LLM) hallucination detection and mitigation through consistency-based approaches which involve aggregating multiple responses sampled from a single LLM for a given prompt. These approaches help offset limitations stemming from the imperfect data on which LLMs are trained, which includes biases and under-representation of information required at deployment time among other limitations which can lead to hallucinations. We show that extending these single-model consistency methods to combine responses from multiple LLMs with different training data, training schemes and model architectures can result in substantial further improvements in hallucination detection and mitigation capabilities beyond their single-model consistency counterparts. We evaluate this \emph{consortium consistency} approach across many model teams from a pool of 15 LLMs and explore under what conditions it is beneficial to team together different LLMs in this manner. Further, we show that these performance improvements often come with reduced inference costs, offsetting a significant drawback with single-model consistency methods.

[361] Bi-Level Decision-Focused Causal Learning for Large-Scale Marketing Optimization: Bridging Observational and Experimental Data

Shuli Zhang, Hao Zhou, Jiaqi Zheng, Guibin Jiang, Bing Cheng, Wei Lin, Guihai Chen

Main category: cs.LG

TL;DR: Bi-level Decision-Focused Causal Learning (Bi-DFCL) addresses prediction-decision misalignment and bias-variance dilemma in marketing optimization by jointly leveraging observational and experimental data through bi-level optimization.

Details

Motivation: Traditional two-stage ML-OR pipelines suffer from prediction-decision misalignment (ML focuses on accuracy without considering optimization objectives) and bias-variance dilemma (observational data has biases, experimental data is scarce/high-variance).

Method: Develops unbiased estimator of OR decision quality using experimental data, creates surrogate loss functions to bridge discrete optimization gradients, and establishes bi-level optimization framework solved via implicit differentiation to jointly leverage both data types.

Result: Extensive evaluations on public benchmarks, industrial datasets, and large-scale online A/B tests show statistically significant improvements over state-of-the-art methods. Successfully deployed at Meituan.

Conclusion: Bi-DFCL effectively addresses fundamental challenges in marketing optimization by enabling unbiased OR estimators to correct learning from biased observational data, achieving optimal bias-variance tradeoff and improving decision quality.

Abstract: Online Internet platforms require sophisticated marketing strategies to optimize user retention and platform revenue – a classical resource allocation problem. Traditional solutions adopt a two-stage pipeline: machine learning (ML) for predicting individual treatment effects to marketing actions, followed by operations research (OR) optimization for decision-making. This paradigm presents two fundamental technical challenges. First, the prediction-decision misalignment: Conventional ML methods focus solely on prediction accuracy without considering downstream optimization objectives, leading to improved predictive metrics that fail to translate to better decisions. Second, the bias-variance dilemma: Observational data suffers from multiple biases (e.g., selection bias, position bias), while experimental data (e.g., randomized controlled trials), though unbiased, is typically scarce and costly – resulting in high-variance estimates. We propose Bi-level Decision-Focused Causal Learning (Bi-DFCL) that systematically addresses these challenges. First, we develop an unbiased estimator of OR decision quality using experimental data, which guides ML model training through surrogate loss functions that bridge discrete optimization gradients. Second, we establish a bi-level optimization framework that jointly leverages observational and experimental data, solved via implicit differentiation. This novel formulation enables our unbiased OR estimator to correct learning directions from biased observational data, achieving optimal bias-variance tradeoff. Extensive evaluations on public benchmarks, industrial marketing datasets, and large-scale online A/B tests demonstrate the effectiveness of Bi-DFCL, showing statistically significant improvements over state-of-the-art. Currently, Bi-DFCL has been deployed at Meituan, one of the largest online food delivery platforms in the world.

[362] The Confusing Instance Principle for Online Linear Quadratic Control

Waris Radji, Odalric-Ambrym Maillard

Main category: cs.LG

TL;DR: MED-LQ: A model-based reinforcement learning method for linear quadratic control that uses the Confusing Instance principle and extends Minimum Empirical Divergence algorithms to achieve competitive performance in various control scenarios.

Details

Motivation: Traditional methods like Optimism in the Face of Uncertainty and Thompson Sampling have practical limitations in controlling linear systems with quadratic cost under unknown dynamics. The authors aim to develop a more effective alternative.

Method: Leverages the Confusing Instance principle and Minimum Empirical Divergence algorithms, combined with LQR policy structure, sensitivity analysis, and stability analysis to develop MED-LQ control strategy.

Result: MED-LQ achieves competitive performance across various control scenarios in comprehensive benchmarks and demonstrates potential for broader applications in large-scale Markov Decision Processes.

Conclusion: The MED-LQ framework successfully extends CI and MED principles beyond small-scale settings, offering a promising approach for model-based reinforcement learning in linear quadratic control problems.

Abstract: We revisit the problem of controlling linear systems with quadratic cost under unknown dynamics with model-based reinforcement learning. Traditional methods like Optimism in the Face of Uncertainty and Thompson Sampling, rooted in multi-armed bandits (MABs), face practical limitations. In contrast, we propose an alternative based on the Confusing Instance (CI) principle, which underpins regret lower bounds in MABs and discrete Markov Decision Processes (MDPs) and is central to the Minimum Empirical Divergence (MED) family of algorithms, known for their asymptotic optimality in various settings. By leveraging the structure of LQR policies along with sensitivity and stability analysis, we develop MED-LQ. This novel control strategy extends the principles of CI and MED beyond small-scale settings. Our benchmarks on a comprehensive control suite demonstrate that MED-LQ achieves competitive performance in various scenarios while highlighting its potential for broader applications in large-scale MDPs.

[363] Study of Training Dynamics for Memory-Constrained Fine-Tuning

Aël Quélennec, Nour Hezbri, Pavlo Mozharovskyi, Van-Tam Nguyen, Enzo Tartaglione

Main category: cs.LG

TL;DR: TraDy is a memory-efficient transfer learning method that uses dynamic stochastic channel selection in preselected layers to achieve high sparsity and computational savings.

Details

Motivation: Memory-efficient training is crucial as models grow larger while deployment environments have strict resource constraints.

Method: Uses dynamic channel selection that stochastically resamples channels between epochs within preselected layers, leveraging architecture-dependent layer importance and superior gradient approximation.

Result: Achieves state-of-the-art performance across various tasks and architectures with up to 99% activation sparsity, 95% weight derivative sparsity, and 97% reduction in FLOPs for weight derivative computation.

Conclusion: TraDy effectively enables memory-efficient training while maintaining performance through dynamic stochastic channel selection.

Abstract: Memory-efficient training of deep neural networks has become increasingly important as models grow larger while deployment environments impose strict resource constraints. We propose TraDy, a novel transfer learning scheme leveraging two key insights: layer importance for updates is architecture-dependent and determinable a priori, while dynamic stochastic channel selection provides superior gradient approximation compared to static approaches. We introduce a dynamic channel selection approach that stochastically resamples channels between epochs within preselected layers. Extensive experiments demonstrate TraDy achieves state-of-the-art performance across various downstream tasks and architectures while maintaining strict memory constraints, achieving up to 99% activation sparsity, 95% weight derivative sparsity, and 97% reduction in FLOPs for weight derivative computation.

[364] A Climate-Aware Deep Learning Framework for Generalizable Epidemic Forecasting

Jinpyo Hong, Rachel E. Baker

Main category: cs.LG

TL;DR: ForecastNet-XCL is a hybrid deep learning ensemble model that combines XGBoost, CNN, and BiLSTM to create accurate multi-week RSV forecasts up to 100 weeks ahead using climate and temporal data without real-time surveillance.

Details

Motivation: To address the gap in using ML models for endemic disease forecasting, particularly for RSV, as current methods are underexplored despite the success of ML in COVID-19 outbreak prediction.

Method: A hybrid framework combining XGBoost+CNN+BiLSTM with high-resolution feature learning, long-range temporal dependency capturing, and an autoregressive module trained on climate-controlled lagged relations, using stochastic inference for probabilistic intervals.

Result: Outperformed statistical baselines, individual neural networks, and conventional ensemble methods across 34 U.S. states in both within- and cross-state scenarios, maintaining accuracy over extended forecast horizons and showing enhanced generalization in locations with irregular RSV patterns.

Conclusion: ForecastNet-XCL is an efficient, high-performing, uncertainty-aware early-warning tool suitable for deployment amid increasing climate pressures and limited surveillance resources.

Abstract: Precise outbreak forecasting of infectious diseases is essential for effective public health responses and epidemic control. The increased availability of machine learning (ML) methods for time-series forecasting presents an enticing avenue to enhance outbreak forecasting. Though the COVID-19 outbreak demonstrated the value of applying ML models to predict epidemic profiles, using ML models to forecast endemic diseases remains underexplored. In this work, we present ForecastNet-XCL (an ensemble model based on XGBoost+CNN+BiLSTM), a deep learning hybrid framework designed to addresses this gap by creating accurate multi-week RSV forecasts up to 100 weeks in advance based on climate and temporal data, without access to real-time surveillance on RSV. The framework combines high-resolution feature learning with long-range temporal dependency capturing mechanisms, bolstered by an autoregressive module trained on climate-controlled lagged relations. Stochastic inference returns probabilistic intervals to inform decision-making. Evaluated across 34 U.S. states, ForecastNet-XCL reliably outperformed statistical baselines, individual neural nets, and conventional ensemble methods in both within- and cross-state scenarios, sustaining accuracy over extended forecast horizons. Training on climatologically diverse datasets enhanced generalization furthermore, particularly in locations having irregular or biennial RSV patterns. ForecastNet-XCL’s efficiency, performance, and uncertainty-aware design make it a deployable early-warning tool amid escalating climate pressures and constrained surveillance resources.

[365] Learning and Simulating Building Evacuation Patterns for Enhanced Safety Design Using Generative Models

Jin Han, Zhe Zheng, Yi Gu, Jia-Rui Lin, Xin-Zheng Lu

Main category: cs.LG

TL;DR: DiffEvac uses diffusion models to learn building evacuation patterns from simulated heatmaps, achieving faster and more accurate evacuation simulation than traditional methods.

Details

Motivation: Traditional evacuation simulation requires refined modeling with extensive parameters, making it unsuitable for rapid iteration in early design stages.

Method: Proposes a diffusion model with decoupled feature representation of layouts and occupant density to learn evacuation patterns from simulated evacuation heatmaps.

Result: Achieves 37.6% improvement in SSIM, 142% in PSNR, and delivers results 16 times faster (2 minutes vs traditional methods).

Conclusion: The method lowers modeling burden, enables large-scale what-if exploration, and facilitates coupling with multi-objective design tools for intelligent building safety optimization.

Abstract: Evacuation simulation is essential for building safety design, ensuring properly planned evacuation routes. However, traditional evacuation simulation relies heavily on refined modeling with extensive parameters, making it challenging to adopt such methods in a rapid iteration process in early design stages. Thus, this study proposes DiffEvac, a novel method to learn building evacuation patterns based on Generative Models (GMs), for efficient evacuation simulation and enhanced safety design. Initially, a dataset of 399 diverse functional layouts and corresponding evacuation heatmaps of buildings was established. Then, a decoupled feature representation is proposed to embed physical features like layouts and occupant density for GMs. Finally, a diffusion model based on image prompts is proposed to learn evacuation patterns from simulated evacuation heatmaps. Compared to existing research using Conditional GANs with RGB representation, DiffEvac achieves up to a 37.6% improvement in SSIM, 142% in PSNR, and delivers results 16 times faster, thereby cutting simulation time to 2 minutes. Case studies further demonstrate that the proposed method not only significantly enhances the rapid design iteration and adjustment process with efficient evacuation simulation but also offers new insights and technical pathways for future safety optimization in intelligent building design. The research implication is that the approach lowers the modeling burden, enables large-scale what-if exploration, and facilitates coupling with multi-objective design tools.

[366] Matrix-Free Least Squares Solvers: Values, Gradients, and What to Do With Them

Hrittik Roy, Søren Hauberg, Nicholas Krämer

Main category: cs.LG

TL;DR: The paper demonstrates that least squares can be transformed into a differentiable operator for diverse ML applications beyond linear model fitting.

Details

Motivation: To unlock the unfulfilled potential of least squares in modern machine learning by making it a differentiable operator that can be integrated into neural networks.

Method: Derive custom gradients to transform least squares solver into a differentiable operator, enabling its use as a neural network layer.

Result: Successfully applied to: (i) enforcing weight sparsity on 50M parameter models, (ii) imposing conservativeness in score-based generative models, and (iii) hyperparameter tuning of Gaussian processes.

Conclusion: This work advances differentiable linear-algebra tools and makes them accessible to ML practitioners, representing the next iteration in developing such computational building blocks.

Abstract: This paper argues that the method of least squares has significant unfulfilled potential in modern machine learning, far beyond merely being a tool for fitting linear models. To release its potential, we derive custom gradients that transform the solver into a differentiable operator, like a neural network layer, enabling many diverse applications. Empirically, we demonstrate: (i) scalability by enforcing weight sparsity on a 50 million parameter model; (ii) imposing conservativeness constraints in score-based generative models; and (iii) hyperparameter tuning of Gaussian processes based on predictive performance. By doing this, our work represents the next iteration in developing differentiable linear-algebra tools and making them widely accessible to machine learning practitioners.

[367] Enabling Granular Subgroup Level Model Evaluations by Generating Synthetic Medical Time Series

Mahmoud Ibrahim, Bart Elen, Chang Sun, Gökhan Ertaylan, Michel Dumontier

Main category: cs.LG

TL;DR: A framework using synthetic ICU time-series data to train and evaluate predictive models, with Enhanced TimeAutoDiff reducing the gap between real-on-synthetic and real-on-real evaluation by over 70% while maintaining training utility.

Details

Motivation: To enable trustworthy, granular model evaluation in critical care without exposing sensitive EHR data, allowing robust performance analysis across diverse patient populations.

Method: Enhanced TimeAutoDiff, which augments latent diffusion objective with distribution-alignment penalties, building on prior diffusion and VAE-based generators.

Result: Reduced TRTS gap by over 70% (Δ_TRTS ≤ 0.014 AUROC), preserved training utility (Δ_TSTR ≈ 0.01), and cut subgroup-level AUROC estimation error by up to 50% relative to small real test sets.

Conclusion: Provides a practical, privacy-preserving roadmap for trustworthy model evaluation in critical care, contributing to Medical AI trustworthiness.

Abstract: We present a novel framework for leveraging synthetic ICU time-series data not only to train but also to rigorously and trustworthily evaluate predictive models, both at the population level and within fine-grained demographic subgroups. Building on prior diffusion and VAE-based generators (TimeDiff, HealthGen, TimeAutoDiff), we introduce \textit{Enhanced TimeAutoDiff}, which augments the latent diffusion objective with distribution-alignment penalties. We extensively benchmark all models on MIMIC-III and eICU, on 24-hour mortality and binary length-of-stay tasks. Our results show that Enhanced TimeAutoDiff reduces the gap between real-on-synthetic and real-on-real evaluation (``TRTS gap’’) by over 70%, achieving $\Delta_{TRTS} \leq 0.014$ AUROC, while preserving training utility ($\Delta_{TSTR} \approx 0.01$). Crucially, for 32 intersectional subgroups, large synthetic cohorts cut subgroup-level AUROC estimation error by up to 50% relative to small real test sets, and outperform them in 72–84% of subgroups. This work provides a practical, privacy-preserving roadmap for trustworthy, granular model evaluation in critical care, enabling robust and reliable performance analysis across diverse patient populations without exposing sensitive EHR data, contributing to the overall trustworthiness of Medical AI.

[368] Latent Space Factorization in LoRA

Shashi Kumar, Yacouba Kaloga, John Mitros, Petr Motlicek, Ina Kodrasi

Main category: cs.LG

TL;DR: FVAE-LoRA improves standard LoRA by using a VAE to learn two factorized latent spaces - one for task-salient features and another for residual information, enhancing performance and robustness.

Details

Motivation: Existing LoRA variants lack mechanisms to explicitly disambiguate task-relevant information within the learned low-rank subspace, potentially limiting downstream performance.

Method: Propose Factorized Variational Autoencoder LoRA (FVAE-LoRA) which leverages a VAE to learn two distinct latent spaces with a novel Evidence Lower Bound formulation that explicitly promotes factorization between latent spaces.

Result: Extensive experiments on text, audio, and image tasks demonstrate that FVAE-LoRA consistently outperforms standard LoRA. Spurious correlation evaluations confirm better isolation of task-relevant signals and improved robustness under distribution shifts.

Conclusion: FVAE-LoRA effectively addresses the limitation of standard LoRA by explicitly factorizing task-salient and residual information, leading to superior performance and robustness across multiple domains.

Abstract: Low-rank adaptation (LoRA) is a widely used method for parameter-efficient finetuning. However, existing LoRA variants lack mechanisms to explicitly disambiguate task-relevant information within the learned low-rank subspace, potentially limiting downstream performance. We propose Factorized Variational Autoencoder LoRA (FVAE-LoRA), which leverages a VAE to learn two distinct latent spaces. Our novel Evidence Lower Bound formulation explicitly promotes factorization between the latent spaces, dedicating one latent space to task-salient features and the other to residual information. Extensive experiments on text, audio, and image tasks demonstrate that FVAE-LoRA consistently outperforms standard LoRA. Moreover, spurious correlation evaluations confirm that FVAE-LoRA better isolates task-relevant signals, leading to improved robustness under distribution shifts. Our code is publicly available at: https://github.com/idiap/FVAE-LoRA

[369] Overlap-weighted orthogonal meta-learner for treatment effect estimation over time

Konstantin Hess, Dennis Frauen, Mihaela van der Schaar, Stefan Feuerriegel

Main category: cs.LG

TL;DR: A novel overlap-weighted orthogonal meta-learner for estimating heterogeneous treatment effects in time-varying settings that addresses severe overlap problems by targeting high-probability treatment sequences.

Details

Motivation: Existing meta-learners for time-varying HTE estimation suffer from exploding variance when treatment overlap is low, as probabilities of observing certain treatment sequences decrease exponentially with longer horizons.

Method: Developed a Neyman-orthogonal population risk function that minimizes overlap-weighted oracle risk, targeting regions with high probability of receiving interventional treatment sequences. The method is model-agnostic and robust to nuisance function misspecification.

Result: The WO-learner demonstrates improved stability and reliability in HTE estimation compared to existing methods, as shown through extensive experiments with transformer and LSTM backbones.

Conclusion: The proposed overlap-weighted orthogonal meta-learner provides a fully data-driven approach that effectively counteracts instabilities in existing methods and offers more reliable heterogeneous treatment effect estimates in time-varying settings with low overlap.

Abstract: Estimating heterogeneous treatment effects (HTEs) in time-varying settings is particularly challenging, as the probability of observing certain treatment sequences decreases exponentially with longer prediction horizons. Thus, the observed data contain little support for many plausible treatment sequences, which creates severe overlap problems. Existing meta-learners for the time-varying setting typically assume adequate treatment overlap, and thus suffer from exploding estimation variance when the overlap is low. To address this problem, we introduce a novel overlap-weighted orthogonal (WO) meta-learner for estimating HTEs that targets regions in the observed data with high probability of receiving the interventional treatment sequences. This offers a fully data-driven approach through which our WO-learner can counteract instabilities as in existing meta-learners and thus obtain more reliable HTE estimates. Methodologically, we develop a novel Neyman-orthogonal population risk function that minimizes the overlap-weighted oracle risk. We show that our WO-learner has the favorable property of Neyman-orthogonality, meaning that it is robust against misspecification in the nuisance functions. Further, our WO-learner is fully model-agnostic and can be applied to any machine learning model. Through extensive experiments with both transformer and LSTM backbones, we demonstrate the benefits of our novel WO-learner.

Jiacheng Liu, Xinyu Wang, Yuqi Lin, Zhikai Wang, Peiru Wang, Peiliang Cai, Qinming Zhou, Zhengan Yan, Zexuan Yan, Zhengyi Shi, Chang Zou, Yue Ma, Linfeng Zhang

Main category: cs.LG

TL;DR: Diffusion Caching is a training-free, architecture-agnostic acceleration technique that reduces computational overhead in diffusion models by reusing intrinsic computational redundancies through feature-level cross-step reuse and inter-layer scheduling.

Details

Motivation: Diffusion models suffer from prohibitive computational overhead and generation latency due to multi-step iterations and complex backbone networks, creating bottlenecks for real-time applications. Existing acceleration techniques face limitations in applicability, training costs, or quality degradation.

Method: Diffusion Caching identifies and reuses intrinsic computational redundancies in the diffusion process through feature-level cross-step reuse and inter-layer scheduling, without modifying model parameters. It evolves from static reuse to dynamic prediction approaches.

Result: The technique reduces computation while maintaining quality, enhances caching flexibility across diverse tasks, and enables integration with other acceleration techniques like sampling optimization and model distillation.

Conclusion: Diffusion Caching represents a promising paradigm for real-time and efficient generative AI, with potential to become a key enabler for future multimodal and interactive applications, advancing both theory and practice of Efficient Generative Intelligence.

Abstract: Diffusion Models have become a cornerstone of modern generative AI for their exceptional generation quality and controllability. However, their inherent \textit{multi-step iterations} and \textit{complex backbone networks} lead to prohibitive computational overhead and generation latency, forming a major bottleneck for real-time applications. Although existing acceleration techniques have made progress, they still face challenges such as limited applicability, high training costs, or quality degradation. Against this backdrop, \textbf{Diffusion Caching} offers a promising training-free, architecture-agnostic, and efficient inference paradigm. Its core mechanism identifies and reuses intrinsic computational redundancies in the diffusion process. By enabling feature-level cross-step reuse and inter-layer scheduling, it reduces computation without modifying model parameters. This paper systematically reviews the theoretical foundations and evolution of Diffusion Caching and proposes a unified framework for its classification and analysis. Through comparative analysis of representative methods, we show that Diffusion Caching evolves from \textit{static reuse} to \textit{dynamic prediction}. This trend enhances caching flexibility across diverse tasks and enables integration with other acceleration techniques such as sampling optimization and model distillation, paving the way for a unified, efficient inference framework for future multimodal and interactive applications. We argue that this paradigm will become a key enabler of real-time and efficient generative AI, injecting new vitality into both theory and practice of \textit{Efficient Generative Intelligence}.

[371] Policy Learning with Abstention

Ayush Sawarni, Jikai Jin, Justin Whitehouse, Vasilis Syrgkanis

Main category: cs.LG

TL;DR: Policy learning with abstention allows deferring to safe defaults when uncertain, achieving O(1/n) regret with known propensities and extending to unknown propensities via doubly robust methods.

Details

Motivation: Most policy learning methods force decisions even when uncertain, which is risky in high-stakes settings like personalized medicine and advertising.

Method: Two-stage learner: first identifies near-optimal policies, then constructs abstention rule from their disagreements. Uses doubly robust objective for unknown propensities.

Result: Establishes fast O(1/n)-type regret guarantees with known propensities, extends to unknown propensities via doubly robust methods.

Conclusion: Abstention is a versatile tool that improves guarantees under margin conditions, connects to distributionally robust learning, and enables safe policy improvement over baselines.

Abstract: Policy learning algorithms are widely used in areas such as personalized medicine and advertising to develop individualized treatment regimes. However, most methods force a decision even when predictions are uncertain, which is risky in high-stakes settings. We study policy learning with abstention, where a policy may defer to a safe default or an expert. When a policy abstains, it receives a small additive reward on top of the value of a random guess. We propose a two-stage learner that first identifies a set of near-optimal policies and then constructs an abstention rule from their disagreements. We establish fast O(1/n)-type regret guarantees when propensities are known, and extend these guarantees to the unknown-propensity case via a doubly robust (DR) objective. We further show that abstention is a versatile tool with direct applications to other core problems in policy learning: it yields improved guarantees under margin conditions without the common realizability assumption, connects to distributionally robust policy learning by hedging against small data shifts, and supports safe policy improvement by ensuring improvement over a baseline policy with high probability.

[372] Fast Inference via Hierarchical Speculative Decoding

Amir Globerson, Haim Kaplan, Yishay Mansour, Clara Mohri, Tal Schuster

Main category: cs.LG

TL;DR: Hierarchical Speculative Decoding (HSD) stacks multiple draft models in a hierarchy to reduce transformer inference latency, achieving up to 1.2x speed-up over single-draft baselines.

Details

Motivation: Transformer inference latency scales with output length, and while speculative decoding helps, there exists a spectrum of draft models with different speed-accuracy tradeoffs that could be better utilized.

Method: HSD arranges draft models in a hierarchy where each model proposes tokens and the next larger model verifies them in parallel, with the target model providing final verification.

Result: The algorithm enables polynomial-time selection of optimal hierarchies and achieves up to 1.2x speed-up over the best single-draft baseline.

Conclusion: HSD provides a practical approach to further reduce generation latency beyond existing speculative decoding techniques by leveraging hierarchical model arrangements.

Abstract: Transformer language models generate text autoregressively, making inference latency proportional to the number of tokens generated. Speculative decoding reduces this latency without sacrificing output quality, by leveraging a small draft model to propose tokens that the larger target model verifies in parallel. In practice, however, there may exist a set of potential draft models- ranging from faster but less inaccurate, to slower yet more reliable. We introduce Hierarchical Speculative Decoding (HSD), an algorithm that stacks these draft models into a hierarchy, where each model proposes tokens, and the next larger model verifies them in a single forward pass, until finally the target model verifies tokens. We derive an expression for the expected latency of any such hierarchy and show that selecting the latency-optimal hierarchy can be done in polynomial time. Empirically, HSD gives up to 1.2x speed-up over the best single-draft baseline, demonstrating the practicality of our algorithm in reducing generation latency beyond previous techniques.

[373] SEMPO: Lightweight Foundation Models for Time Series Forecasting

Hui He, Kun Yi, Yuanchi Ma, Qi Zhang, Zhendong Niu, Guansong Pang

Main category: cs.LG

TL;DR: SEMPO is a lightweight foundation model for time series forecasting that achieves strong generalization with reduced pre-training data and model size through energy-aware spectral decomposition and mixture-of-prompts transformer.

Details

Motivation: Existing time series foundation models require massive architectures and large-scale pre-training, making them unsuitable for resource-constrained environments. There's a need for more affordable yet versatile models.

Method: Two key modules: 1) Energy-aware spectral decomposition that models both high-energy and low-energy frequency signals for better data utilization; 2) Mixture-of-Prompts enabled Transformer that uses dataset-specific prompts and routes tokens to prompt-based experts for parameter-efficient adaptation.

Result: Extensive experiments on 16 datasets show superior performance in both zero-shot and few-shot forecasting compared to state-of-the-art methods, while significantly reducing pre-training data scale and model size.

Conclusion: SEMPO successfully addresses the tension between versatility and affordability in time series foundation models, achieving strong generalization with lightweight architecture and reduced pre-training requirements.

Abstract: The recent boom of large pre-trained models witnesses remarkable success in developing foundation models (FMs) for time series forecasting. Despite impressive performance across diverse downstream forecasting tasks, existing time series FMs possess massive network architectures and require substantial pre-training on large-scale datasets, which significantly hinders their deployment in resource-constrained environments. In response to this growing tension between versatility and affordability, we propose SEMPO, a novel lightweight foundation model that requires pretraining on relatively small-scale data, yet exhibits strong general time series forecasting. Concretely, SEMPO comprises two key modules: 1) energy-aware SpEctral decomposition module, that substantially improves the utilization of pre-training data by modeling not only the high-energy frequency signals but also the low-energy yet informative frequency signals that are ignored in current methods; and 2) Mixture-of-PrOmpts enabled Transformer, that learns heterogeneous temporal patterns through small dataset-specific prompts and adaptively routes time series tokens to prompt-based experts for parameter-efficient model adaptation across different datasets and domains. Equipped with these modules, SEMPO significantly reduces both pre-training data scale and model size, while achieving strong generalization. Extensive experiments on two large-scale benchmarks covering 16 datasets demonstrate the superior performance of SEMPO in both zero-shot and few-shot forecasting scenarios compared with state-of-the-art methods. Code and data are available at https://github.com/mala-lab/SEMPO.

[374] Statistical Inference for Linear Functionals of Online Least-squares SGD when $t \gtrsim d^{1+δ}$

Bhavya Agrawalla, Krishnakumar Balasubramanian, Promit Ghosal

Main category: cs.LG

TL;DR: Establishes non-asymptotic Berry-Esseen bounds for online least-squares SGD, providing Gaussian CLT in growing-dimensional regime with improved computational efficiency over existing methods.

Details

Motivation: Deploying SGD in high-stakes applications requires rigorous uncertainty quantification, but existing high-dimensional inference methods are computationally expensive and restrictive in dimensional scaling.

Method: Develops online SGD-based procedure with non-asymptotic Berry-Esseen bounds for linear functionals, plus an online variance estimator for the asymptotic variance in CLT.

Result: Achieves CLT for SGD iterates when t ≳ d^(1+δ) for any δ>0, extending dimensional regime while operating in O(td) time and O(d) memory, compared to O(td²+d³) of existing methods.

Conclusion: Provides first fully online and data-driven framework for constructing confidence intervals for SGD iterates in near-optimal scaling regime t ≳ d^(1+δ).

Abstract: Stochastic Gradient Descent (SGD) has become a cornerstone method in modern data science. However, deploying SGD in high-stakes applications necessitates rigorous quantification of its inherent uncertainty. In this work, we establish \emph{non-asymptotic Berry–Esseen bounds} for linear functionals of online least-squares SGD, thereby providing a Gaussian Central Limit Theorem (CLT) in a \emph{growing-dimensional regime}. Existing approaches to high-dimensional inference for projection parameters, such as~\cite{chang2023inference}, rely on inverting empirical covariance matrices and require at least $t \gtrsim d^{3/2}$ iterations to achieve finite-sample Berry–Esseen guarantees, rendering them computationally expensive and restrictive in the allowable dimensional scaling. In contrast, we show that a CLT holds for SGD iterates when the number of iterations grows as $t \gtrsim d^{1+\delta}$ for any $\delta

0$, significantly extending the dimensional regime permitted by prior works while improving computational efficiency. The proposed online SGD-based procedure operates in $\mathcal{O}(td)$ time and requires only $\mathcal{O}(d)$ memory, in contrast to the $\mathcal{O}(td^2 + d^3)$ runtime of covariance-inversion methods. To render the theory practically applicable, we further develop an \emph{online variance estimator} for the asymptotic variance appearing in the CLT and establish \emph{high-probability deviation bounds} for this estimator. Collectively, these results yield the first fully online and data-driven framework for constructing confidence intervals for SGD iterates in the near-optimal scaling regime $t \gtrsim d^{1+\delta}$.

[375] BATIS: Bayesian Approaches for Targeted Improvement of Species Distribution Models

Catherine Villeneuve, Benjamin Akera, Mélisande Teng, David Rolnick

Main category: cs.LG

TL;DR: BATIS is a Bayesian deep learning framework for species distribution models that iteratively updates prior predictions with limited observational data, improving reliability in data-scarce locations.

Details

Motivation: Current deep learning SDMs perform well on complex datasets but are limited by spatial biases in data, particularly in data-scarce locations.

Method: Introduces BATIS framework using Bayesian deep learning to capture both aleatoric and epistemic uncertainty, combining fine-grained local insights with broader ecological patterns through iterative updating of prior predictions.

Result: Extensive benchmarking on eBird citizen science data shows Bayesian deep learning approaches greatly improve SDM reliability in data-scarce locations.

Conclusion: Bayesian deep learning can significantly enhance species distribution model reliability, contributing to better ecological understanding and conservation efforts.

Abstract: Species distribution models (SDMs), which aim to predict species occurrence based on environmental variables, are widely used to monitor and respond to biodiversity change. Recent deep learning advances for SDMs have been shown to perform well on complex and heterogeneous datasets, but their effectiveness remains limited by spatial biases in the data. In this paper, we revisit deep SDMs from a Bayesian perspective and introduce BATIS, a novel and practical framework wherein prior predictions are updated iteratively using limited observational data. Models must appropriately capture both aleatoric and epistemic uncertainty to effectively combine fine-grained local insights with broader ecological patterns. We benchmark an extensive set of uncertainty quantification approaches on a novel dataset including citizen science observations from the eBird platform. Our empirical study shows how Bayesian deep learning approaches can greatly improve the reliability of SDMs in data-scarce locations, which can contribute to ecological understanding and conservation efforts.

[376] Semantic World Models

Jacob Berg, Chuning Zhu, Yanda Bao, Ishan Durugkar, Abhishek Gupta

Main category: cs.LG

TL;DR: The paper proposes using vision language models as semantic world models that predict task-relevant semantic information about the future instead of reconstructing pixels, enabling better planning for robotic control.

Details

Motivation: Conventional world models predict future pixels, but pixel reconstruction doesn't always correlate with good planning decisions. The authors argue world models only need to predict task-relevant semantic information.

Method: Frame world modeling as visual question answering about semantic information in future frames. Train vision language models as semantic world models through supervised finetuning on image-action-text data.

Result: The semantic world model enables policy improvement on open-ended robotics tasks and shows significant generalization improvements over reconstruction-based world modeling approaches.

Conclusion: Vision language models can effectively serve as semantic world models for robotic planning, inheriting generalization and robustness properties while focusing on task-relevant information rather than pixel reconstruction.

Abstract: Planning with world models offers a powerful paradigm for robotic control. Conventional approaches train a model to predict future frames conditioned on current frames and actions, which can then be used for planning. However, the objective of predicting future pixels is often at odds with the actual planning objective; strong pixel reconstruction does not always correlate with good planning decisions. This paper posits that instead of reconstructing future frames as pixels, world models only need to predict task-relevant semantic information about the future. For such prediction the paper poses world modeling as a visual question answering problem about semantic information in future frames. This perspective allows world modeling to be approached with the same tools underlying vision language models. Thus vision language models can be trained as “semantic” world models through a supervised finetuning process on image-action-text data, enabling planning for decision-making while inheriting many of the generalization and robustness properties from the pretrained vision-language models. The paper demonstrates how such a semantic world model can be used for policy improvement on open-ended robotics tasks, leading to significant generalization improvements over typical paradigms of reconstruction-based action-conditional world modeling. Website available at https://weirdlabuw.github.io/swm.

[377] When Do Transformers Learn Heuristics for Graph Connectivity?

Qilin Ye, Deqing Fu, Robin Jia, Vatsal Sharan

Main category: cs.LG

TL;DR: Transformers fail to learn generalizable algorithms and rely on brittle heuristics. Using graph connectivity as a testbed, the paper shows that transformers can only solve graphs with diameters up to 3^L, implementing an adjacency matrix power algorithm. Training dynamics depend on whether graphs are within or beyond this capacity.

Details

Motivation: To explain why transformers often fail to learn generalizable algorithms and instead rely on brittle heuristics, using graph connectivity as a concrete testbed to analyze this phenomenon both theoretically and empirically.

Method: Theoretical analysis of a simplified Transformer architecture (disentangled Transformer), proving that an L-layer model can solve graphs with diameters up to 3^L, implementing an algorithm equivalent to computing powers of the adjacency matrix. Analysis of training dynamics and empirical validation.

Result: Transformers learn correct algorithmic solutions for within-capacity graphs (diameter ≤ 3^L) but learn simple degree-based heuristics for beyond-capacity graphs. Restricting training data to within model capacity enables both standard and disentangled transformers to learn the exact algorithm.

Conclusion: The capacity limitations of transformers determine whether they learn generalizable algorithms or brittle heuristics. Training within model capacity boundaries is crucial for learning correct algorithmic solutions rather than simple heuristics.

Abstract: Transformers often fail to learn generalizable algorithms, instead relying on brittle heuristics. Using graph connectivity as a testbed, we explain this phenomenon both theoretically and empirically. We consider a simplified Transformer architecture, the disentangled Transformer, and prove that an $L$-layer model has capacity to solve for graphs with diameters up to exactly $3^L$, implementing an algorithm equivalent to computing powers of the adjacency matrix. We analyze the training-dynamics, and show that the learned strategy hinges on whether most training instances are within this model capacity. Within-capacity graphs (diameter $\leq 3^L$) drive the learning of a correct algorithmic solution while beyond-capacity graphs drive the learning of a simple heuristic based on node degrees. Finally, we empirically demonstrate that restricting training data within a model’s capacity leads to both standard and disentangled transformers learning the exact algorithm rather than the degree-based heuristic.

[378] CONFEX: Uncertainty-Aware Counterfactual Explanations with Conformal Guarantees

Aman Bilkhoo, Milad Kazemi, Nicola Paoletti, Mehran Hosseini

Main category: cs.LG

TL;DR: CONFEX is a novel method that generates uncertainty-aware counterfactual explanations using Conformal Prediction and Mixed-Integer Linear Programming, providing local coverage guarantees for robust and reliable explanations.

Details

Motivation: Existing counterfactual explanation methods often neglect predictive uncertainty, which can lead to misleading or inapplicable explanations in regions of high uncertainty.

Method: CONFEX combines Conformal Prediction (CP) with Mixed-Integer Linear Programming (MILP) using a novel localized CP procedure that leverages tree-based partitioning of the input space for efficient MILP encoding.

Result: The method generates counterfactual explanations with rigorous guarantees on both predictive uncertainty and optimality, outperforming state-of-the-art methods across diverse benchmarks and metrics.

Conclusion: CONFEX provides robust and plausible uncertainty-aware counterfactual explanations with formal guarantees, addressing the limitations of existing methods that lack principled uncertainty incorporation.

Abstract: Counterfactual explanations (CFXs) provide human-understandable justifications for model predictions, enabling actionable recourse and enhancing interpretability. To be reliable, CFXs must avoid regions of high predictive uncertainty, where explanations may be misleading or inapplicable. However, existing methods often neglect uncertainty or lack principled mechanisms for incorporating it with formal guarantees. We propose CONFEX, a novel method for generating uncertainty-aware counterfactual explanations using Conformal Prediction (CP) and Mixed-Integer Linear Programming (MILP). CONFEX explanations are designed to provide local coverage guarantees, addressing the issue that CFX generation violates exchangeability. To do so, we develop a novel localised CP procedure that enjoys an efficient MILP encoding by leveraging an offline tree-based partitioning of the input space. This way, CONFEX generates CFXs with rigorous guarantees on both predictive uncertainty and optimality. We evaluate CONFEX against state-of-the-art methods across diverse benchmarks and metrics, demonstrating that our uncertainty-aware approach yields robust and plausible explanations.

[379] The Tail Tells All: Estimating Model-Level Membership Inference Vulnerability Without Reference Models

Euodia Dodd, Nataša Krčo, Igor Shilov, Yves-Alexandre de Montjoye

Main category: cs.LG

TL;DR: A novel method to estimate model vulnerability to membership inference attacks without requiring computationally expensive reference models, leveraging asymmetric loss distributions and using the absence of high-loss outliers as a predictor.

Details

Motivation: Current state-of-the-art membership inference attacks require training numerous reference models, which is computationally expensive and limits practical application. There's a need for more efficient vulnerability assessment methods.

Method: Analyze asymmetric and heavy-tailed loss distributions, observe that vulnerable points shift from high-loss to low-loss regions after training, and use the absence of outliers in high-loss regions to estimate vulnerability. The method uses TNR of a simple loss attack from training and testing distributions alone.

Result: The method accurately estimates model-level vulnerability to state-of-the-art MIA attacks (LiRA) across various architectures and datasets, outperforms low-cost attacks like RMIA and other distribution difference measures, and shows promise for evaluating risk in large-language models.

Conclusion: The proposed approach provides an efficient and accurate way to estimate model vulnerability to membership inference attacks without the computational burden of reference models, with potential applications in large-language model risk assessment.

Abstract: Membership inference attacks (MIAs) have emerged as the standard tool for evaluating the privacy risks of AI models. However, state-of-the-art attacks require training numerous, often computationally expensive, reference models, limiting their practicality. We present a novel approach for estimating model-level vulnerability, the TPR at low FPR, to membership inference attacks without requiring reference models. Empirical analysis shows loss distributions to be asymmetric and heavy-tailed and suggests that most points at risk from MIAs have moved from the tail (high-loss region) to the head (low-loss region) of the distribution after training. We leverage this insight to propose a method to estimate model-level vulnerability from the training and testing distribution alone: using the absence of outliers from the high-loss region as a predictor of the risk. We evaluate our method, the TNR of a simple loss attack, across a wide range of architectures and datasets and show it to accurately estimate model-level vulnerability to the SOTA MIA attack (LiRA). We also show our method to outperform both low-cost (few reference models) attacks such as RMIA and other measures of distribution difference. We finally evaluate the use of non-linear functions to evaluate risk and show the approach to be promising to evaluate the risk in large-language models.

[380] GaLLoP: Gradient-based Sparse Learning on Low-Magnitude Parameters

Anand Choudhary, Yasser Sulaıman, Lukas Mauch, Ghouthi Boukli Hacene, Fabien Cardinaux, Antoine Bosselut

Main category: cs.LG

TL;DR: GaLLoP is a sparse fine-tuning technique that selectively tunes model parameters with large gradient magnitudes on downstream tasks and small pre-trained magnitudes, improving performance while mitigating catastrophic forgetting.

Details

Motivation: Sparse fine-tuning effectiveness depends on optimal parameter selection. Current methods need better ways to identify which parameters to tune to maximize task relevance while minimizing disruption to pre-trained knowledge.

Method: GaLLoP fine-tunes parameters with the largest gradient magnitudes on downstream tasks and smallest pre-trained magnitudes, prioritizing task-relevant but minimally disruptive parameters.

Result: GaLLoP consistently improves or matches performance of leading PEFT techniques (LoRA, DoRA, SAFT) on LLaMA3 8B and Gemma 2B, with better in-distribution and out-of-distribution performance, reduced catastrophic forgetting, and stabilized generalization across random seeds.

Conclusion: GaLLoP provides an effective sparse fine-tuning approach that balances task adaptation with preservation of pre-trained knowledge, demonstrating robust performance improvements and stability compared to existing methods.

Abstract: Sparse fine-tuning techniques adapt LLMs to downstream tasks by only tuning a sparse subset of model parameters. However, the effectiveness of sparse adaptation depends on optimally selecting the model parameters to be fine-tuned. In this work, we introduce a novel sparse fine-tuning technique named GaLLoP: Gradient-based Sparse Learning on Low-Magnitude Parameters, which fine-tunes only those model parameters which have the largest gradient magnitudes on downstream tasks and the smallest pre-trained magnitudes, intuitively prioritizing parameters that are highly task-relevant, but minimally disruptive to pre-trained knowledge. Our experimentation with LLaMA3 8B and Gemma 2B as base models shows that GaLLoP consistently improves or matches the in-distribution as well as out-of-distribution performance obtained via the usage of other leading parameter-efficient fine-tuning techniques, including LoRA, DoRA, and SAFT. Our analysis demonstrates that GaLLoP mitigates catastrophic forgetting and memorization of task data, as important pre-trained parameters remain unchanged, and stabilizes performance relative to other fine-tuning techniques, robustly generalizing across most random seeds.

[381] Blackbox Model Provenance via Palimpsestic Membership Inference

Rohith Kuditipudi, Jing Huang, Sally Zhu, Diyi Yang, Christopher Potts, Percy Liang

Main category: cs.LG

TL;DR: This paper proposes methods to detect if someone is using a derivative of Alice’s language model by testing for statistical dependencies between Bob’s model/text and the randomized training order of Alice’s model, leveraging memorization patterns in language models.

Details

Motivation: To enable model creators to prove when others are using derivatives of their models, addressing intellectual property concerns in open-weight language models.

Method: Formulates the problem as independence testing using palimpsestic memorization - models memorize data seen later in training. Tests correlation between Bob’s model/text and Alice’s randomized training data order through two settings: query setting (direct model prompting) and observational setting (text analysis).

Result: In query setting: achieved p-values ≤1e-8 for most of 40+ fine-tuned models. In observational setting: second approach reliably detects usage from few hundred tokens; first approach requires hundreds of thousands of tokens.

Conclusion: The proposed methods provide statistically significant evidence to detect model derivative usage, with the observational retraining approach being particularly efficient for detection from small text samples.

Abstract: Suppose Alice trains an open-weight language model and Bob uses a blackbox derivative of Alice’s model to produce text. Can Alice prove that Bob is using her model, either by querying Bob’s derivative model (query setting) or from the text alone (observational setting)? We formulate this question as an independence testing problem–in which the null hypothesis is that Bob’s model or text is independent of Alice’s randomized training run–and investigate it through the lens of palimpsestic memorization in language models: models are more likely to memorize data seen later in training, so we can test whether Bob is using Alice’s model using test statistics that capture correlation between Bob’s model or text and the ordering of training examples in Alice’s training run. If Alice has randomly shuffled her training data, then any significant correlation amounts to exactly quantifiable statistical evidence against the null hypothesis, regardless of the composition of Alice’s training data. In the query setting, we directly estimate (via prompting) the likelihood Bob’s model gives to Alice’s training examples and order; we correlate the likelihoods of over 40 fine-tunes of various Pythia and OLMo base models ranging from 1B to 12B parameters with the base model’s training data order, achieving a p-value on the order of at most 1e-8 in all but six cases. In the observational setting, we try two approaches based on estimating 1) the likelihood of Bob’s text overlapping with spans of Alice’s training examples and 2) the likelihood of Bob’s text with respect to different versions of Alice’s model we obtain by repeating the last phase (e.g., 1%) of her training run on reshuffled data. The second approach can reliably distinguish Bob’s text from as little as a few hundred tokens; the first does not involve any retraining but requires many more tokens (several hundred thousand) to achieve high power.

[382] Environment Inference for Learning Generalizable Dynamical System

Shixuan Liu, Yue He, Haotian Wang, Wenjing Yang, Yunfei Wang, Peng Cui, Zhong Liu

Main category: cs.LG

TL;DR: DynaInfer is a novel method that infers environment specifications from prediction errors of fixed neural networks, enabling environment assignments without requiring environment labels during training.

Details

Motivation: Current generalization techniques for dynamical systems depend on environment labels, which are often unavailable due to data acquisition challenges, privacy concerns, and environmental variability in large datasets.

Method: DynaInfer analyzes prediction errors from fixed neural networks within each training round to infer environment specifications and assign environments directly from data.

Result: Extensive experiments show DynaInfer outperforms existing environment assignment techniques, converges rapidly to true labels, and achieves superior performance even when environment labels are available.

Conclusion: DynaInfer effectively solves the alternating optimization problem in unlabeled scenarios and provides a practical solution for environment assignment without requiring explicit environment labels.

Abstract: Data-driven methods offer efficient and robust solutions for analyzing complex dynamical systems but rely on the assumption of I.I.D. data, driving the development of generalization techniques for handling environmental differences. These techniques, however, are limited by their dependence on environment labels, which are often unavailable during training due to data acquisition challenges, privacy concerns, and environmental variability, particularly in large public datasets and privacy-sensitive domains. In response, we propose DynaInfer, a novel method that infers environment specifications by analyzing prediction errors from fixed neural networks within each training round, enabling environment assignments directly from data. We prove our algorithm effectively solves the alternating optimization problem in unlabeled scenarios and validate it through extensive experiments across diverse dynamical systems. Results show that DynaInfer outperforms existing environment assignment techniques, converges rapidly to true labels, and even achieves superior performance when environment labels are available.

[383] Transformers are almost optimal metalearners for linear classification

Roey Magen, Gal Vardi

Main category: cs.LG

TL;DR: Transformers can function as near-optimal metalearners in linear classification, requiring only O(k/R^4) in-context examples to generalize to new tasks, significantly outperforming methods that only use in-context data.

Details

Motivation: To theoretically analyze whether transformers can serve as metalearners that adapt to new tasks using few in-context examples, addressing the gap in formal metalearning analysis for transformers.

Method: Analyzed a simplified transformer architecture trained via gradient descent on tasks from class-conditional Gaussian mixture models with means in a shared k-dimensional subspace of R^d.

Result: After sufficient training, transformers generalize to new tasks with O(k/R^4) in-context examples, nearly matching optimal learners that know the shared subspace and outperforming methods using only in-context data.

Conclusion: Transformers can achieve near-optimal metalearning performance in linear classification settings, with training requirements independent of ambient dimension d.

Abstract: Transformers have demonstrated impressive in-context learning (ICL) capabilities, raising the question of whether they can serve as metalearners that adapt to new tasks using only a small number of in-context examples, without any further training. While recent theoretical work has studied transformers’ ability to perform ICL, most of these analyses do not address the formal metalearning setting, where the objective is to solve a collection of related tasks more efficiently than would be possible by solving each task individually. In this paper, we provide the first theoretical analysis showing that a simplified transformer architecture trained via gradient descent can act as a near-optimal metalearner in a linear classification setting. We consider a natural family of tasks where each task corresponds to a class-conditional Gaussian mixture model, with the mean vectors lying in a shared $k$-dimensional subspace of $R^d$. After training on a sufficient number of such tasks, we show that the transformer can generalize to a new task using only $O(k / R^4)$ in-context examples, where $R$ denotes the signal strength at test time. This performance (almost) matches that of an optimal learner that knows exactly the shared subspace and significantly outperforms any learner that only has access to the in-context data, which requires $\Omega(d / R^4)$ examples to generalize. Importantly, our bounds on the number of training tasks and examples per task needed to achieve this result are independent of the ambient dimension $d$.

[384] The Feasibility of Training Sovereign Language Models in the Global South: A Study of Brazil and Mexico

Sandra Malagon, Monica A. Ulloa Ruiz, Tatiana Elizabeth Sandoval Plaza, Gabriel Rafael Rosario Bolívar, Valentina García Mesa, Ivanna Alvarado Morales

Main category: cs.LG

TL;DR: This paper analyzes the feasibility of sovereign-scale language model training in Brazil and Mexico under hardware, energy, and fiscal constraints, finding that H100-based configurations are more cost-effective than A100 alternatives.

Details

Motivation: To address structural asymmetries in AI development between Global North and South by examining whether middle-income countries can achieve technological sovereignty in large-scale language model training despite resource constraints.

Method: Used a dual-axis design varying accelerator generation (NVIDIA H100 vs. A100) and training duration (90 vs. 150 days) to estimate compute demand, energy consumption, capital expenditures, and regulatory compatibility for training a 10-trillion-token model.

Result: H100-based scenarios achieved training feasibility at 8-14 million USD total cost, while A100 deployments required 19-32 million USD due to higher energy and hardware demand. All configurations remained below export-control and electrical infrastructure thresholds.

Conclusion: Extending training timelines can serve as a policy lever to mitigate hardware constraints, enabling middle-income countries to develop locally aligned AI models without competing at the global frontier, contributing to sustainable technological sovereignty.

Abstract: The rapid escalation of computational requirements for training large-scale language models has reinforced structural asymmetries between high-capacity jurisdictions and countries in the Global South. This paper examines the technical and fiscal feasibility of sovereign-scale language model training in Brazil and Mexico under conditions of constrained hardware access, energy availability, and fiscal ceilings. Using a dual-axis design that varies accelerator generation (NVIDIA H100 vs. A100) and training duration (90 vs. 150 days), we estimate compute demand, energy consumption, capital expenditures, and regulatory compatibility for the training of a 10-trillion-token model. Our findings show that while all configurations remain below export-control and electrical infrastructure thresholds, fiscal viability is determined by hardware efficiency. H100-based scenarios achieve training feasibility at a total cost of 8-14 million USD, while A100 deployments require 19-32 million USD due to higher energy and hardware demand. We argue that extending training timelines should be treated as a policy lever to mitigate hardware constraints, enabling the production of usable, auditable, and locally aligned models without competing at the global frontier. This study contributes to the discourse on AI compute governance and technological sovereignty by highlighting context-sensitive strategies that allow middle-income countries to establish sustainable and strategically sufficient AI capabilities.

[385] FP-IRL: Fokker-Planck Inverse Reinforcement Learning – A Physics-Constrained Approach to Markov Decision Processes

Chengyang Huang, Siddhartha Srivastava, Kenneth K. Y. Ho, Kathy E. Luker, Gary D. Luker, Xun Huan, Krishna Garikipati

Main category: cs.LG

TL;DR: FP-IRL is a novel inverse reinforcement learning method that simultaneously infers both reward and transition functions from trajectory data without requiring access to sampled transitions, by leveraging connections between MDPs and Fokker-Planck dynamics.

Details

Motivation: Existing IRL methods require access to the transition function, which is challenging when dynamics are unknown, unobservable, or not easily sampled. This limitation motivates a physics-constrained approach that can infer both reward and transition functions directly from trajectory data.

Method: FP-IRL uses a conjectured equivalence between MDPs and Fokker-Planck dynamics, linking reward maximization in MDPs with free energy minimization in FP dynamics. It employs variational system identification to infer the potential function, from which all MDP components (reward, transition, policy) can be recovered using analytic expressions.

Result: Experiments on synthetic benchmarks and a modified Mountain Car problem show that FP-IRL achieves accurate recovery of agent incentives while maintaining computational efficiency and physical interpretability.

Conclusion: FP-IRL provides an effective physics-constrained framework for IRL that eliminates the need for transition function access, enabling simultaneous inference of both reward and transition functions directly from trajectory data with preserved interpretability.

Abstract: Inverse reinforcement learning (IRL) is a powerful paradigm for uncovering the incentive structure that drives agent behavior, by inferring an unknown reward function from observed trajectories within a Markov decision process (MDP). However, most existing IRL methods require access to the transition function, either prescribed or estimated \textit{a priori}, which poses significant challenges when the underlying dynamics are unknown, unobservable, or not easily sampled. We propose Fokker–Planck inverse reinforcement learning (FP-IRL), a novel physics-constrained IRL framework tailored for systems governed by Fokker–Planck (FP) dynamics. FP-IRL simultaneously infers both the reward and transition functions directly from trajectory data, without requiring access to sampled transitions. Our method leverages a conjectured equivalence between MDPs and the FP equation, linking reward maximization in MDPs with free energy minimization in FP dynamics. This connection enables inference of the potential function using our inference approach of variational system identification, from which the full set of MDP components – reward, transition, and policy – can be recovered using analytic expressions. We demonstrate the effectiveness of FP-IRL through experiments on synthetic benchmarks and a modified version of the Mountain Car problem. Our results show that FP-IRL achieves accurate recovery of agent incentives while preserving computational efficiency and physical interpretability.

[386] Towards Context-Aware Domain Generalization: Understanding the Benefits and Limits of Marginal Transfer Learning

Jens Müller, Lars Kühmichel, Martin Rohbeck, Stefan T. Radev, Ullrich Köthe

Main category: cs.LG

TL;DR: This paper analyzes when contextual information from same-domain data can improve deep learning model predictions in new domains, providing theoretical conditions and practical criteria for marginal transfer learning in domain generalization.

Details

Motivation: To understand when and how contextual information about input data's domain context can enhance model robustness and predictions in domain generalization scenarios, addressing the trade-off between predictive performance and robustness.

Method: Formalizes context as permutation-invariant representations of same-domain data points, provides theoretical analysis of beneficial conditions, formulates two practical verification criteria, and empirically tests these criteria while demonstrating model selection methods.

Result: The proposed criteria effectively identify favorable and unfavorable scenarios for marginal transfer learning, enable reliable detection of unwarranted extrapolation cases in OOD domains, and allow selection between predictive and robust models without the typical performance-robustness trade-off.

Conclusion: Contextual information through marginal transfer learning can improve domain generalization when proper conditions are met, and the developed criteria provide practical tools to identify beneficial scenarios and select appropriate models while avoiding the robustness-performance trade-off.

Abstract: In this work, we analyze the conditions under which information about the context of an input $X$ can improve the predictions of deep learning models in new domains. Following work in marginal transfer learning in Domain Generalization (DG), we formalize the notion of context as a permutation-invariant representation of a set of data points that originate from the same domain as the input itself. We offer a theoretical analysis of the conditions under which this approach can, in principle, yield benefits, and formulate two necessary criteria that can be easily verified in practice. Additionally, we contribute insights into the kind of distribution shifts for which the marginal transfer learning approach promises robustness. Empirical analysis shows that our criteria are effective in discerning both favorable and unfavorable scenarios. Finally, we demonstrate that we can reliably detect scenarios where a model is tasked with unwarranted extrapolation in out-of-distribution (OOD) domains, identifying potential failure cases. Consequently, we showcase a method to select between the most predictive and the most robust model, circumventing the well-known trade-off between predictive performance and robustness.

[387] LICO: Large Language Models for In-Context Molecular Optimization

Tung Nguyen, Aditya Grover

Main category: cs.LG

TL;DR: LICO extends LLMs for black-box optimization by adding embedding and prediction layers, enabling competitive performance on molecular optimization benchmarks.

Details

Motivation: LLMs have strong pattern-matching capabilities but struggle with scientific domains due to scarce domain-specific data and difficulty articulating complex problems in natural language.

Method: Add separate embedding and prediction layers to base LLMs, train on diverse functions for in-context predictions, then generalize to unseen properties via prompting.

Result: Competitive performance on PMO benchmark (23 objectives) and state-of-the-art on PMO-1K low-budget version.

Conclusion: LICO provides a general-purpose approach to extend LLMs for effective black-box optimization in scientific domains like molecular design.

Abstract: Optimizing black-box functions is a fundamental problem in science and engineering. To solve this problem, many approaches learn a surrogate function that estimates the underlying objective from limited historical evaluations. Large Language Models (LLMs), with their strong pattern-matching capabilities via pretraining on vast amounts of data, stand out as a potential candidate for surrogate modeling. However, directly prompting a pretrained language model to produce predictions is not feasible in many scientific domains due to the scarcity of domain-specific data in the pretraining corpora and the challenges of articulating complex problems in natural language. In this work, we introduce LICO, a general-purpose model that extends arbitrary base LLMs for black-box optimization, with a particular application to the molecular domain. To achieve this, we equip the language model with a separate embedding layer and prediction layer, and train the model to perform in-context predictions on a diverse set of functions defined over the domain. Once trained, LICO can generalize to unseen molecule properties simply via in-context prompting. LICO performs competitively on PMO, a challenging molecular optimization benchmark comprising 23 objective functions, and achieves state-of-the-art performance on its low-budget version PMO-1K.

[388] Estimating Long-term Heterogeneous Dose-response Curve: Generalization Bound Leveraging Optimal Transport Weights

Zeqin Yang, Weilin Chen, Ruichu Cai, Yuguang Yan, Zhifeng Hao, Zhipeng Yu, Zhichao Zou, Jixing Xu, Zhen Peng, Jiecheng Guo

Main category: cs.LG

TL;DR: The paper proposes a method to estimate long-term Heterogeneous Dose-Response Curves (HDRC) that handles unobserved confounders and continuous treatment using optimal transport weighting and theoretical generalization bounds.

Details

Motivation: Existing methods for long-term treatment effect estimation rely on ideal assumptions like no unobserved confounders or binary treatment, which are often violated in real-world applications. Average treatment effects are also insufficient for personalized decision-making.

Method: Introduces an optimal transport weighting framework to align long-term observational data with auxiliary short-term experimental data to remove unobserved confounders. Establishes a generalization bound on counterfactual prediction error using the reweighted distribution.

Result: Developed a long-term HDRC estimator based on the theoretical foundations. Extensive experiments on synthetic and semi-synthetic datasets demonstrate the effectiveness of the approach.

Conclusion: The proposed method successfully addresses the challenge of estimating long-term heterogeneous dose-response curves while accounting for unobserved confounders and continuous treatment, providing a more general solution than existing approaches.

Abstract: Long-term treatment effect estimation is a significant but challenging problem in many applications. Existing methods rely on ideal assumptions, such as no unobserved confounders or binary treatment, to estimate long-term average treatment effects. However, in numerous real-world applications, these assumptions could be violated, and average treatment effects are insufficient for personalized decision-making. In this paper, we address a more general problem of estimating long-term Heterogeneous Dose-Response Curve (HDRC) while accounting for unobserved confounders and continuous treatment. Specifically, to remove the unobserved confounders in the long-term observational data, we introduce an optimal transport weighting framework to align the long-term observational data to an auxiliary short-term experimental data. Furthermore, to accurately predict the heterogeneous effects of continuous treatment, we establish a generalization bound on counterfactual prediction error by leveraging the reweighted distribution induced by optimal transport. Finally, we develop a long-term HDRC estimator building upon the above theoretical foundations. Extensive experiments on synthetic and semi-synthetic datasets demonstrate the effectiveness of our approach.

[389] Unveiling Transformer Perception by Exploring Input Manifolds

Alessandro Benfenati, Alfio Ferrara, Alessio Marta, Davide Riva, Elisabetta Rocchetti

Main category: cs.LG

TL;DR: A method for exploring equivalence classes in Transformer input space using mathematical theory of manifold deformations and eigendecomposition of the Jacobian pullback metric.

Details

Motivation: To understand and explore the structure of equivalence classes in Transformer models' input space, enabling systematic navigation between inputs that produce similar or different outputs.

Method: Uses eigendecomposition of the pullback metric through the model’s Jacobian to reconstruct equivalence classes, with two exploration procedures: finding inputs with same class distributions and navigating to different class distributions.

Result: The method successfully identifies equivalence classes in input space and enables navigation between them, with retrieved instances being interpretable through projection back to human-readable format.

Conclusion: The proposed approach provides a mathematically grounded framework for exploring Transformer input spaces and understanding their internal representations through equivalence class analysis.

Abstract: This paper introduces a general method for the exploration of equivalence classes in the input space of Transformer models. The proposed approach is based on sound mathematical theory which describes the internal layers of a Transformer architecture as sequential deformations of the input manifold. Using eigendecomposition of the pullback of the distance metric defined on the output space through the Jacobian of the model, we are able to reconstruct equivalence classes in the input space and navigate across them. Our method enables two complementary exploration procedures: the first retrieves input instances that produce the same class probability distribution as the original instance-thus identifying elements within the same equivalence class-while the second discovers instances that yield a different class probability distribution, effectively navigating toward distinct equivalence classes. Finally, we demonstrate how the retrieved instances can be meaningfully interpreted by projecting their embeddings back into a human-readable format.

[390] Learning Linear Attention in Polynomial Time

Morris Yau, Ekin Akyürek, Jiayuan Mao, Joshua B. Tenenbaum, Stefanie Jegelka, Jacob Andreas

Main category: cs.LG

TL;DR: First polynomial-time learnability results for single-layer Transformers with linear attention, showing they can be learned via linear predictors in RKHS and generalize correctly.

Details

Motivation: Bridge the gap between theoretical expressivity and practical learnability of Transformers, addressing whether simulators of Boolean circuits or Turing machines can be learned from observational data.

Method: View linear attention as linear predictor in RKHS, convert learning problem to ordinary linear prediction in expanded feature space, and efficiently identify training datasets for generalization guarantees.

Result: Proved polynomial-time learnability (strong agnostic PAC learning) for linear Transformers, with examples including associative memories, finite automata, and bounded UTMs. Empirical validation on three tasks.

Conclusion: Flexible and general models of computation are efficiently learnable, bridging critical gap between expressivity and learnability of Transformers.

Abstract: Previous research has explored the computational expressivity of Transformer models in simulating Boolean circuits or Turing machines. However, the learnability of these simulators from observational data has remained an open question. Our study addresses this gap by providing the first polynomial-time learnability results (specifically strong, agnostic PAC learning) for single-layer Transformers with linear attention. We show that linear attention may be viewed as a linear predictor in a suitably defined RKHS. As a consequence, the problem of learning any linear transformer may be converted into the problem of learning an ordinary linear predictor in an expanded feature space, and any such predictor may be converted back into a multiheaded linear transformer. Moving to generalization, we show how to efficiently identify training datasets for which every empirical risk minimizer is equivalent (up to trivial symmetries) to the linear Transformer that generated the data, thereby guaranteeing the learned model will correctly generalize across all inputs. Finally, we provide examples of computations expressible via linear attention and therefore polynomial-time learnable, including associative memories, finite automata, and a class of Universal Turing Machine (UTMs) with polynomially bounded computation histories. We empirically validate our theoretical findings on three tasks: learning random linear attention networks, key–value associations, and learning to execute finite automata. Our findings bridge a critical gap between theoretical expressivity and learnability of Transformers, and show that flexible and general models of computation are efficiently learnable.

[391] Model-based Large Language Model Customization as Service

Zhaomin Wu, Jizhou Guo, Junyi Hou, Bingsheng He, Lixin Fan, Qiang Yang

Main category: cs.LG

TL;DR: Llamdex is a privacy-preserving LLM customization framework that allows clients to upload pre-trained domain-specific models instead of sensitive data, achieving better accuracy than DP data synthesis methods while maintaining inference efficiency.

Details

Motivation: Current LLM customization services require uploading sensitive data for fine-tuning, creating privacy risks. DP data synthesis alternatives suffer from low effectiveness due to excessive noise.

Method: Clients upload pre-trained domain-specific models (optionally DP-protected with lower noise) that are inserted into base LLMs via connection modules trained without sensitive domain data.

Result: Llamdex improves domain-specific accuracy by up to 26% over state-of-the-art private data synthesis methods under identical privacy constraints, while maintaining inference efficiency comparable to original LLM services.

Conclusion: Llamdex enables effective LLM customization while preserving data privacy, eliminating the need for users to provide domain context in queries and outperforming existing privacy-preserving methods.

Abstract: Prominent Large Language Model (LLM) services from providers like OpenAI and Google excel at general tasks but often underperform on domain-specific applications. Current customization services for these LLMs typically require users to upload data for fine-tuning, posing significant privacy risks. While differentially private (DP) data synthesis presents a potential alternative, its application commonly results in low effectiveness due to the introduction of excessive noise on data for DP. To overcome this, we introduce Llamdex, a novel framework that facilitates LLM customization as a service, where the client uploads pre-trained domain-specific models rather than data. This client-uploaded model, optionally protected by DP with much lower noise, is inserted into the base LLM via connection modules. Significantly, these connecting modules are trained without requiring sensitive domain data, enabling clients to customize LLM services while preserving data privacy. Experiments demonstrate that Llamdex improves domain-specific accuracy by up to 26% over state-of-the-art private data synthesis methods under identical privacy constraints and, by obviating the need for users to provide domain context within queries, maintains inference efficiency comparable to the original LLM service.

[392] Benchmarking Large Language Models with Integer Sequence Generation Tasks

Daniel O’Malley, Manish Bhattarai, Javier Santos, Nishath Rajiv Ranasinghe, Erick Draayer

Main category: cs.LG

TL;DR: A new benchmark using OEIS integer sequences to evaluate LLMs’ mathematical reasoning and code synthesis capabilities, with cheating detection to prevent lookup table usage.

Details

Motivation: To rigorously test LLMs' true algorithmic reasoning abilities in mathematical tasks without relying on memorization or lookup tables.

Method: Uses 1000 OEIS sequences (500 classical, 500 recent) categorized as easy/hard, with automated cheating detection to flag lookup table usage, evaluated across major LLM providers.

Result: Reasoning-specialized models (OpenAI o-series, Gemini 2.5-pro) show significant improvements over non-reasoning models, but overall performance on hard sequences remains poor.

Conclusion: Current LLMs still struggle with complex mathematical reasoning tasks, highlighting the need for further advancements in algorithmic reasoning capabilities.

Abstract: We present a novel benchmark designed to rigorously evaluate the capabilities of large language models (LLMs) in mathematical reasoning and algorithmic code synthesis tasks. The benchmark comprises integer sequence generation tasks sourced from the Online Encyclopedia of Integer Sequences (OEIS), testing LLMs’ abilities to accurately and efficiently generate Python code to compute these sequences without using lookup tables. Our comprehensive evaluation includes leading models from OpenAI (including the specialized reasoning-focused o-series), Anthropic, Meta, and Google across a carefully selected set of 1000 OEIS sequences categorized as easy'' or hard.’’ Half of these sequences are classical sequences from the early days of OEIS and half were recently added to avoid contamination with the models’ training data. To prevent models from exploiting memorized sequence values, we introduce an automated cheating detection mechanism that flags usage of lookup tables, validated by comparison with human expert evaluations. Experimental results demonstrate that reasoning-specialized models (o3, o3-mini, o4-mini from OpenAI, and Gemini 2.5-pro from Google) achieve substantial improvements in accuracy over non-reasoning models, especially on more complex tasks. However, overall model performance on the hard sequences is poor, highlighting persistent challenges in algorithmic reasoning. Our benchmark provides important insights into the strengths and limitations of state-of-the-art LLMs, particularly emphasizing the necessity for further advancements to reliably solve complex mathematical reasoning tasks algorithmically.

[393] Explainable fault and severity classification for rolling element bearings using Kolmogorov-Arnold networks

Spyros Rigas, Michalis Papachristou, Ioannis Sotiropoulos, Georgios Alexandridis

Main category: cs.LG

TL;DR: A methodology using Kolmogorov-Arnold Networks for bearing fault diagnosis that achieves automatic feature selection, hyperparameter tuning, and interpretable analysis in a unified framework, delivering lightweight models with perfect F1-Scores.

Details

Motivation: Bearing faults are a leading cause of machinery failures causing costly downtime and reduced productivity, necessitating efficient and reliable fault diagnosis methods.

Method: Utilizes Kolmogorov-Arnold Networks with shallow architectures for automatic feature selection and hyperparameter tuning, producing lightweight models with explainable results through feature attribution and symbolic representations.

Result: Achieved perfect F1-Scores for fault detection and high performance in fault/severity classification (100% F1-Scores in most cases), demonstrated adaptability to diverse fault types including imbalance and misalignment.

Conclusion: The framework shows strong potential for practical real-time machinery monitoring and scientific research requiring efficient and explainable models.

Abstract: Rolling element bearings are critical components of rotating machinery, with their performance directly influencing the efficiency and reliability of industrial systems. At the same time, bearing faults are a leading cause of machinery failures, often resulting in costly downtime, reduced productivity, and, in extreme cases, catastrophic damage. This study presents a methodology that utilizes Kolmogorov-Arnold Networks to address these challenges through automatic feature selection, hyperparameter tuning and interpretable fault analysis within a unified framework. By training shallow network architectures and minimizing the number of selected features, the framework produces lightweight models that deliver explainable results through feature attribution and symbolic representations of their activation functions. Validated on two widely recognized datasets for bearing fault diagnosis, the framework achieved perfect F1-Scores for fault detection and high performance in fault and severity classification tasks, including 100% F1-Scores in most cases. Notably, it demonstrated adaptability by handling diverse fault types, such as imbalance and misalignment, within the same dataset. The symbolic representations enhanced model interpretability, while feature attribution offered insights into the optimal feature types or signals for each studied task. These results highlight the framework’s potential for practical applications, such as real-time machinery monitoring, and for scientific research requiring efficient and explainable models.

[394] DR-VIDAL – Doubly Robust Variational Information-theoretic Deep Adversarial Learning for Counterfactual Prediction and Treatment Effect Estimation on Real World Data

Shantanu Ghosh, Zheng Feng, Jiang Bian, Kevin Butler, Mattia Prosperi

Main category: cs.LG

TL;DR: DR-VIDAL is a novel causal deep learning framework that combines variational autoencoders, information-theoretic GANs, and doubly robust estimation to provide unbiased individualized treatment effect estimation from observational data.

Details

Motivation: Estimating causal effects from observational data is challenging due to bias, and existing methods need improvement for accurate individualized treatment effect estimation in real-world applications like treatment repurposing using electronic health records.

Method: DR-VIDAL integrates: (1) VAE to factorize confounders into latent variables based on causal assumptions, (2) Info-GAN to generate counterfactuals, and (3) doubly robust block with treatment propensities for outcome predictions, ensuring unbiased estimation even when one model is misspecified.

Result: DR-VIDAL achieves better performance than other non-generative and generative methods on synthetic and real-world datasets including Infant Health and Development Program, Twin Birth Registry, and National Supported Work Program.

Conclusion: DR-VIDAL uniquely combines causal assumptions, VAE, Info-GAN, and doubly robust estimation into a comprehensive framework that provides performant and unbiased individualized treatment effect estimation from observational data.

Abstract: Determining causal effects of interventions onto outcomes from real-world, observational (non-randomized) data, e.g., treatment repurposing using electronic health records, is challenging due to underlying bias. Causal deep learning has improved over traditional techniques for estimating individualized treatment effects (ITE). We present the Doubly Robust Variational Information-theoretic Deep Adversarial Learning (DR-VIDAL), a novel generative framework that combines two joint models of treatment and outcome, ensuring an unbiased ITE estimation even when one of the two is misspecified. DR-VIDAL integrates: (i) a variational autoencoder (VAE) to factorize confounders into latent variables according to causal assumptions; (ii) an information-theoretic generative adversarial network (Info-GAN) to generate counterfactuals; (iii) a doubly robust block incorporating treatment propensities for outcome predictions. On synthetic and real-world datasets (Infant Health and Development Program, Twin Birth Registry, and National Supported Work Program), DR-VIDAL achieves better performance than other non-generative and generative methods. In conclusion, DR-VIDAL uniquely fuses causal assumptions, VAE, Info-GAN, and doubly robustness into a comprehensive, performant framework. Code is available at: https://github.com/Shantanu48114860/DR-VIDAL-AMIA-22 under MIT license.

[395] Graph Representation Learning with Diffusion Generative Models

Daniel Wesego

Main category: cs.LG

TL;DR: This paper proposes using discrete diffusion models for graph representation learning, addressing the challenge of applying diffusion models to discrete graph-structured data through an autoencoder framework.

Details

Motivation: Diffusion models have shown strong generative capabilities in continuous domains like images and videos, but their application to discrete graph data remains underexplored due to the need for discrete diffusion processes.

Method: The authors train a discrete diffusion model within an autoencoder framework, combining encoder outputs with decoder’s first time step hidden embeddings to learn graph representations.

Result: The approach demonstrates that discrete diffusion models can effectively learn meaningful embeddings for graph-structured data while enabling both autoencoding and representation learning.

Conclusion: Discrete diffusion models show promising potential for graph representation learning, offering a novel approach that leverages their representational capabilities for graph-structured data.

Abstract: Diffusion models have established themselves as state-of-the-art generative models across various data modalities, including images and videos, due to their ability to accurately approximate complex data distributions. Unlike traditional generative approaches such as VAEs and GANs, diffusion models employ a progressive denoising process that transforms noise into meaningful data over multiple iterative steps. This gradual approach enhances their expressiveness and generation quality. Not only that, diffusion models have also been shown to extract meaningful representations from data while learning to generate samples. Despite their success, the application of diffusion models to graph-structured data remains relatively unexplored, primarily due to the discrete nature of graphs, which necessitates discrete diffusion processes distinct from the continuous methods used in other domains. In this work, we leverage the representational capabilities of diffusion models to learn meaningful embeddings for graph data. By training a discrete diffusion model within an autoencoder framework, we enable both effective autoencoding and representation learning tailored to the unique characteristics of graph-structured data. We extract the representation from the combination of the encoder’s output and the decoder’s first time step hidden embedding. Our approach demonstrates the potential of discrete diffusion models to be used for graph representation learning. The code can be found at https://github.com/DanielMitiku/Graph-Representation-Learning-with-Diffusion-Generative-Models

[396] Source-Free Domain Adaptation for SSVEP-based Brain-Computer Interfaces

Osman Berke Guney, Deniz Kucukahmetler, Huseyin Ozkan

Main category: cs.LG

TL;DR: A novel domain adaptation method for SSVEP-based BCI spellers that eliminates the need for extensive calibration by adapting pre-trained DNNs to new users using only unlabeled data, achieving high ITRs without user discomfort.

Details

Motivation: To address the discomfort caused by extensive calibration periods in SSVEP-based BCI spellers for new users, while maintaining high information transfer rates.

Method: Adapts pre-trained DNNs to new users using unlabeled target data by minimizing a custom loss function with self-adaptation (pseudo-label strategy) and local-regularity terms that enforce similar labels for adjacent instances.

Result: Achieves excellent ITRs of 201.15 bits/min and 145.02 bits/min on benchmark and BETA datasets, outperforming state-of-the-art alternatives.

Conclusion: The method prioritizes user comfort by removing calibration burden while maintaining high accuracy and ITR, potentially accelerating BCI adoption in everyday life.

Abstract: Objective: SSVEP-based BCI spellers assist individuals experiencing speech difficulties by enabling them to communicate at a fast rate. However, achieving a high information transfer rate (ITR) in most prominent methods requires an extensive calibration period before using the system, leading to discomfort for new users. We address this issue by proposing a novel method that adapts a powerful deep neural network (DNN) pre-trained on data from source domains (data from former users or participants of previous experiments), to the new user (target domain) using only unlabeled target data. Approach: Our method adapts the pre-trained DNN to the new user by minimizing our proposed custom loss function composed of self-adaptation and local-regularity terms. The self-adaptation term uses the pseudo-label strategy, while the novel local-regularity term exploits the data structure and forces the DNN to assign similar labels to adjacent instances. Main results: Our method achieves excellent ITRs of 201.15 bits/min and 145.02 bits/min on the benchmark and BETA datasets, respectively, and outperforms the state-of-the-art alternatives. Our code is available at https://github.com/osmanberke/SFDA-SSVEP-BCI Significance: The proposed method prioritizes user comfort by removing the burden of calibration while maintaining an excellent character identification accuracy and ITR. Because of these attributes, our approach could significantly accelerate the adoption of BCI systems into everyday life.

[397] AtomSurf : Surface Representation for Learning on Protein Structures

Vincent Mallet, Souhaib Attaiki, Yangyang Miao, Bruno Correia, Maks Ovsjanikov

Main category: cs.LG

TL;DR: This paper addresses the gap in comparing surface-based learning methods with graph representations for protein data. It adapts a surface encoder, performs fair comparisons showing limitations of pure surface learning, and proposes an integrated approach that shares features between graphs and surfaces.

Details

Motivation: There is a lack of direct and fair benchmark comparison between surface-based learning methods and alternative representations like graphs for protein data. Existing approaches either use surface information in isolation or perform basic global pooling.

Method: The authors adapt a state-of-the-art surface encoder, perform fair comparisons within Atom3D benchmark, and propose an integrated approach that enables learned feature sharing between graphs and surface representations at the node/vertex level across all layers.

Result: The integrated architecture achieves state-of-the-art results on all tasks in the Atom3D benchmark, as well as on binding site identification and binding pocket classification. The approach is optimized for efficiency with competitive training and inference times.

Conclusion: The proposed integrated approach that shares features between graphs and surface representations outperforms pure surface-based learning and achieves state-of-the-art performance across multiple protein learning tasks while maintaining efficiency.

Abstract: While there has been significant progress in evaluating and comparing different representations for learning on protein data, the role of surface-based learning approaches remains not well-understood. In particular, there is a lack of direct and fair benchmark comparison between the best available surface-based learning methods against alternative representations such as graphs. Moreover, the few existing surface-based approaches either use surface information in isolation or, at best, perform global pooling between surface and graph-based architectures. In this work, we fill this gap by first adapting a state-of-the-art surface encoder for protein learning tasks. We then perform a direct and fair comparison of the resulting method against alternative approaches within the Atom3D benchmark, highlighting the limitations of pure surface-based learning. Finally, we propose an integrated approach, which allows learned feature sharing between graphs and surface representations on the level of nodes and vertices across all layers. We demonstrate that the resulting architecture achieves state-of-the-art results on all tasks in the Atom3D benchmark, while adhering to the strict benchmark protocol, as well as more broadly on binding site identification and binding pocket classification. Furthermore, we use coarsened surfaces and optimize our approach for efficiency, making our tool competitive in training and inference time with existing techniques. Code can be found online: https://github.com/Vincentx15/atomsurf

[398] An Efficient Local Search Approach for Polarized Community Discovery in Signed Networks

Linus Aronsson, Morteza Haghir Chehreghani

Main category: cs.LG

TL;DR: A method for detecting k polarized communities in signed networks that addresses size imbalance issues and scales to large networks with neutral vertices.

Details

Motivation: Signed networks with positive/negative edges are crucial for analyzing polarization and conflict in social systems, but existing methods produce size-imbalanced communities and struggle with neutral vertices.

Method: Novel optimization objective to avoid size imbalance, combined with a local search algorithm that extends to networks with neutral vertices and connects to block-coordinate Frank-Wolfe optimization.

Result: Proven linear convergence rate and experiments show consistent outperformance of state-of-the-art baselines in solution quality while maintaining computational efficiency.

Conclusion: The proposed method effectively addresses limitations of prior approaches for polarized community detection in signed networks, providing balanced solutions that scale to large networks with neutral vertices.

Abstract: Signed networks, where edges are labeled as positive or negative to represent friendly or antagonistic interactions, provide a natural framework for analyzing polarization, trust, and conflict in social systems. Detecting meaningful group structures in such networks is crucial for understanding online discourse, political divisions, and trust dynamics. A key challenge is to identify communities that are internally cohesive and externally antagonistic, while allowing for neutral or unaligned vertices. In this paper, we propose a method for identifying $k$ polarized communities that addresses a major limitation of prior methods: their tendency to produce highly size-imbalanced solutions. We introduce a novel optimization objective that avoids such imbalance. In addition, it is well known that approximation algorithms based on local search are highly effective for clustering signed networks when neutral vertices are not allowed. We build on this idea and design the first local search algorithm that extends to the setting with neutral vertices while scaling to large networks. By connecting our approach to block-coordinate Frank-Wolfe optimization, we prove a linear convergence rate, enabled by the structure of our objective. Experiments on real-world and synthetic datasets demonstrate that our method consistently outperforms state-of-the-art baselines in solution quality, while remaining competitive in computational efficiency.

[399] Phase-driven Domain Generalizable Learning for Nonstationary Time Series

Payal Mohapatra, Lixu Wang, Qi Zhu

Main category: cs.LG

TL;DR: PhASER is a phase-driven framework for learning generalizable time-series representations that addresses distribution shifts and nonstationarity by using phase information as a proxy for nonstationarity through Hilbert transform-based augmentation, separate magnitude-phase encoding, and phase-residual feature broadcasting.

Details

Motivation: Real-world time-series data often experience distribution shifts and inherent nonstationarity (variations in statistical/spectral properties over time), making it challenging to learn generalizable representations for classification tasks.

Method: 1) Hilbert transform-based augmentation to diversify nonstationarity while preserving task semantics; 2) Separate magnitude-phase encoding treating them as independent modalities; 3) Phase-residual feature broadcasting integrating 2D phase features with residual connections to 1D signal representations.

Result: PhASER consistently outperforms 13 state-of-the-art baselines by an average of 5% and up to 11% on five datasets from sleep-stage classification, human activity recognition, and gesture recognition.

Conclusion: Phase information serves as an effective proxy for nonstationarity, and the PhASER framework provides a principled approach to improve generalizability in time-series representation learning that can be broadly applied to existing models.

Abstract: Pattern recognition is a fundamental task in continuous sensing applications, but real-world scenarios often experience distribution shifts that necessitate learning generalizable representations for such tasks. This challenge is exacerbated with time-series data, which also exhibit inherent nonstationarity–variations in statistical and spectral properties over time. In this work, we offer a fresh perspective on learning generalizable representations for time-series classification by considering the phase information of a signal as an approximate proxy for nonstationarity and propose a phase-driven generalizable representation learning framework for time-series classification, PhASER. It consists of three key elements: 1) Hilbert transform-based augmentation, which diversifies nonstationarity while preserving task-specific discriminatory semantics, 2) separate magnitude-phase encoding, viewing time-varying magnitude and phase as independent modalities, and 3) phase-residual feature broadcasting, integrating 2D phase features with a residual connection to the 1D signal representation, providing inherent regularization to improve distribution-invariant learning. Extensive evaluations on five datasets from sleep-stage classification, human activity recognition, and gesture recognition against 13 state-of-the-art baseline methods demonstrate that PhASER consistently outperforms the best baselines by an average of 5% and up to 11% in some cases. Additionally, the principles of PhASER can be broadly applied to enhance the generalizability of existing time-series representation learning models.

[400] LoRA vs Full Fine-tuning: An Illusion of Equivalence

Reece Shuttleworth, Jacob Andreas, Antonio Torralba, Pratyusha Sharma

Main category: cs.LG

TL;DR: LoRA fine-tuning creates new high-ranking singular vectors (intruder dimensions) that cause forgetting of pre-training knowledge, while full fine-tuning preserves spectral structure better.

Details

Motivation: To understand if LoRA and full fine-tuning produce equivalent solutions by analyzing their spectral properties and impact on model behavior.

Method: Analyzed weight matrices using singular value decomposition, identified intruder dimensions in LoRA-trained models, and performed causal interventions by scaling singular values.

Result: LoRA creates intruder dimensions that cause forgetting; scaling them down improves pre-training distribution modeling with minimal task performance drop. LoRA accumulates intruder dimensions in continual learning, leading to worse performance.

Conclusion: LoRA and full fine-tuning produce fundamentally different solutions due to intruder dimensions, which have practical implications for continual learning scenarios.

Abstract: Fine-tuning is a crucial paradigm for adapting pre-trained large language models to downstream tasks. Recently, methods like Low-Rank Adaptation (LoRA) have been shown to effectively fine-tune LLMs with an extreme reduction in trainable parameters. But, \emph{are their learned solutions really equivalent?} We study how LoRA and full-finetuning change pre-trained models by analyzing the model’s weight matrices through the lens of their spectral properties. We find that LoRA and full fine-tuning yield weight matrices whose singular value decompositions exhibit very different structure: weight matrices trained with LoRA have new, high-ranking singular vectors, which we call \emph{intruder dimensions}, while those trained with full fine-tuning do not. Further, we extend the finding that LoRA forgets less than full fine-tuning and find its forgetting is vastly localized to the intruder dimension – by causally intervening on the intruder dimensions by changing their associated singular values post-fine-tuning, we show that they cause forgetting. Moreover, scaling them down significantly improves modeling of the pre-training distribution with a minimal drop in downstream task performance. Given this, we should expect accumulating intruder dimensions to be harmful and lead to more forgetting. This will be amplified during continual learning because of sequentially fine-tuning, and we show that LoRA models do accumulate intruder dimensions here tend to perform worse in this setting, emphasizing the practicality of our findings.

[401] FairGen: Controlling Sensitive Attributes for Fair Generations in Diffusion Models via Adaptive Latent Guidance

Mintong Kang, Vinayshekhar Bannihatti Kumar, Shamik Roy, Abhishek Kumar, Sopan Khosla, Balakrishnan Murali Narayanaswamy, Rashmi Gangadharaiah

Main category: cs.LG

TL;DR: FairGen is an adaptive latent guidance mechanism that mitigates demographic biases in text-to-image diffusion models while preserving generation quality, achieving 68.5% gender bias reduction on Stable Diffusion 2.

Details

Motivation: Text-to-image diffusion models exhibit biases toward specific demographic groups (e.g., generating more males than females for engineers), raising ethical concerns and limiting adoption.

Method: Proposes FairGen with two modules: latent guidance dynamically adjusts diffusion process to enforce specific attributes, and memory module tracks generation statistics to steer guidance toward fair distribution.

Result: Extensive evaluations show FairGen outperforms existing bias mitigation approaches, achieving substantial bias reduction (68.5% gender bias reduction on SD2) and introduces Holistic Bias Evaluation benchmark.

Conclusion: FairGen effectively mitigates generation bias while preserving quality, offers flexible control over output distribution at user-specified granularity, and enables adaptive targeted bias mitigation.

Abstract: Text-to-image diffusion models often exhibit biases toward specific demographic groups, such as generating more males than females when prompted to generate images of engineers, raising ethical concerns and limiting their adoption. In this paper, we tackle the challenge of mitigating generation bias towards any target attribute value (e.g., “male” for “gender”) in diffusion models while preserving generation quality. We propose FairGen, an adaptive latent guidance mechanism which controls the generation distribution during inference. In FairGen, a latent guidance module dynamically adjusts the diffusion process to enforce specific attributes, while a memory module tracks the generation statistics and steers latent guidance to align with the targeted fair distribution of the attribute values. Furthermore, we address the limitations of existing datasets by introducing the Holistic Bias Evaluation (HBE) benchmark, which covers diverse domains and incorporates complex prompts to assess bias more comprehensively. Extensive evaluations on HBE and Stable Bias datasets demonstrate that FairGen outperforms existing bias mitigation approaches, achieving substantial bias reduction (e.g., 68.5% gender bias reduction on Stable Diffusion 2). Ablation studies highlight FairGen’s ability to flexibly control the output distribution at any user-specified granularity, ensuring adaptive and targeted bias mitigation.

[402] Survey of Graph Neural Network for Internet of Things and NextG Networks

Sabarish Krishna Moorthy, Jithin Jagannath

Main category: cs.LG

TL;DR: This survey paper provides a comprehensive review of Graph Neural Networks (GNNs) applications in IoT and NextG networks, covering data fusion, intrusion detection, spectrum awareness, networking, and tactical systems, while discussing challenges and future research directions.

Details

Motivation: The exponential growth of IoT devices and 6G networks has created massive data that requires efficient machine learning approaches. GNNs are promising for modeling complex network structures in wireless systems due to their high performance, scalability, and resource efficiency.

Method: The survey systematically reviews GNN applications by first explaining GNN terminologies, architecture, and types, then comprehensively examining GNN advancements in IoT data fusion, intrusion detection, spectrum awareness, networking, and tactical systems.

Result: The paper provides a detailed resource showing GNN’s state-of-the-art applications in wireless networks, contrasting GNN approaches with other machine learning methods, and identifying current applications across various IoT and NextG domains.

Conclusion: GNNs show significant potential for IoT and NextG networks, but challenges remain that require further research. The survey serves as a comprehensive resource to motivate continued development and application of GNNs in wireless communication systems.

Abstract: The exponential increase in Internet of Things (IoT) devices coupled with 6G pushing towards higher data rates and connected devices has sparked a surge in data. Consequently, harnessing the full potential of data-driven machine learning has become one of the important thrusts. In addition to the advancement in wireless technology, it is important to efficiently use the resources available and meet the users’ requirements. Graph Neural Networks (GNNs) have emerged as a promising paradigm for effectively modeling and extracting insights which inherently exhibit complex network structures due to its high performance and accuracy, scalability, adaptability, and resource efficiency. There is a lack of a comprehensive survey that focuses on the applications and advances GNN has made in the context of IoT and Next Generation (NextG) networks. To bridge that gap, this survey starts by providing a detailed description of GNN’s terminologies, architecture, and the different types of GNNs. Then we provide a comprehensive survey of the advancements in applying GNNs for IoT from the perspective of data fusion and intrusion detection. Thereafter, we survey the impact GNN has made in improving spectrum awareness. Next, we provide a detailed account of how GNN has been leveraged for networking and tactical systems. Through this survey, we aim to provide a comprehensive resource for researchers to learn more about GNN in the context of wireless networks, and understand its state-of-the-art use cases while contrasting to other machine learning approaches. Finally, we also discussed the challenges and wide range of future research directions to further motivate the use of GNN for IoT and NextG Networks.

[403] Merge then Realign: Simple and Effective Modality-Incremental Continual Learning for Multimodal LLMs

Dingkun Zhang, Shuhan Qi, Xinyu Xiao, Kehai Chen, Xuan Wang

Main category: cs.LG

TL;DR: The paper proposes MERA, a simple Modality-incremental Continual Learning paradigm that addresses forgetting and misalignment issues in Multimodal Large Language Models, achieving nearly lossless performance when extending to new modalities.

Details

Motivation: To efficiently extend existing MLLMs to more modalities through continual learning, avoiding the heavy cost of retraining from scratch while addressing performance degradation issues.

Method: MErge then ReAlign (MERA) - a simple paradigm that addresses both forgetting and misalignment between modality-agnostic and modality-specific components without heavy model budgets or architectural changes.

Result: Achieves 99.84% Backward Relative Gain when extending to four modalities, demonstrating nearly lossless MCL performance and high reusability.

Conclusion: The work identifies misalignment as a key issue in MCL and provides an effective solution that showcases how to adjust different MLLM components during continual learning.

Abstract: Recent advances in Multimodal Large Language Models (MLLMs) have enhanced their versatility as they integrate a growing number of modalities. Considering the heavy cost of training MLLMs, it is efficient to reuse the existing ones and extend them to more modalities through Modality-incremental Continual Learning (MCL). The exploration of MCL is in its early stages. In this work, we dive into the causes of performance degradation in MCL. We uncover that it suffers not only from forgetting as in traditional continual learning, but also from misalignment between the modality-agnostic and modality-specific components. To this end, we propose an elegantly simple MCL paradigm called “MErge then ReAlign” (MERA) to address both forgetting and misalignment. MERA avoids introducing heavy model budgets or modifying model architectures, hence is easy to deploy and highly reusable in the MLLM community. Extensive experiments demonstrate the impressive performance of MERA, holding an average of 99.84% Backward Relative Gain when extending to four modalities, achieving nearly lossless MCL performance. Our findings underscore the misalignment issue in MCL. More broadly, our work showcases how to adjust different components of MLLMs during continual learning.

[404] Bootstrap Sampling Rate Greater than 1.0 May Improve Random Forest Performance

Stanisław Kaźmierczak, Jacek Mańdziuk

Main category: cs.LG

TL;DR: This paper challenges conventional wisdom by showing that using bootstrap rates (BR) greater than 1.0 (sampling more observations than the original dataset size) in random forests can significantly improve classification accuracy across 36 diverse datasets.

Details

Motivation: Previous research has generally considered sampling more than N observations (BR > 1.0) in random forests to be ineffective, but this assumption has been explored only to a limited extent.

Method: The authors evaluated bootstrap rates ranging from 1.2 to 5.0 across 36 diverse datasets, analyzing how BR affects decision tree leaf structure and investigating factors influencing optimal BR.

Result: Contrary to previous findings, higher BR values led to statistically significant improvements in classification accuracy compared to standard settings (BR ≤ 1.0). The optimal BR was found to be primarily determined by dataset characteristics rather than RF hyperparameters.

Conclusion: Bootstrap rates greater than 1.0 can be beneficial for random forests, and the optimal BR depends more on dataset characteristics than on random forest hyperparameters.

Abstract: Random forests (RFs) utilize bootstrap sampling to generate individual training sets for each component tree by sampling with replacement, with the sample size typically equal to that of the original training set ($N$). Previous research indicates that drawing fewer than $N$ observations can also yield satisfactory results. The ratio of the number of observations in each bootstrap sample to the total number of training instances is referred to as the bootstrap rate (BR). Sampling more than $N$ observations (BR $>$ 1.0) has been explored only to a limited extent and has generally been considered ineffective. In this paper, we revisit this setup using 36 diverse datasets, evaluating BR values ranging from 1.2 to 5.0. Contrary to previous findings, we show that higher BR values can lead to statistically significant improvements in classification accuracy compared to standard settings (BR $\leq$ 1.0). Furthermore, we analyze how BR affects the leaf structure of decision trees within the RF and investigate factors influencing the optimal BR. Our results indicate that the optimal BR is primarily determined by the characteristics of the data set rather than the RF hyperparameters.

[405] Using (Not-so) Large Language Models to Generate Simulation Models in a Formal DSL: A Study on Reaction Networks

Justin N. Kreikemeyer, Miłosz Jankowski, Pia Wilsdorf, Adelinde M. Uhrmacher

Main category: cs.LG

TL;DR: Fine-tuning a small 7B-parameter Mistral LLM enables translation of natural language to formal simulation models, achieving 84.5% accuracy in recovering ground truth models while offering a self-hostable alternative to large commercial LLMs.

Details

Motivation: Natural language is the most human-accessible way to express models but not easily interpretable by computers, creating a need to bridge this gap using LLMs for formalizing natural language into simulation models.

Method: Fine-tuned an open-weights 7B-parameter Mistral model using synthetic data generation to translate natural language descriptions to reaction network models in a domain-specific language.

Result: The fine-tuned Mistral model recovered ground truth simulation models in up to 84.5% of cases and demonstrated practical potential in user studies for one-time generation and interactive modeling.

Conclusion: While promising, the fine-tuned small LLM cannot yet match large LLMs, requiring higher-quality training data, but future small open-source LLMs offer new opportunities for efficient formal language translation.

Abstract: Formal languages are an integral part of modeling and simulation. They allow the distillation of knowledge into concise simulation models amenable to automatic execution, interpretation, and analysis. However, the arguably most humanly accessible means of expressing models is through natural language, which is not easily interpretable by computers. Here, we evaluate how a Large Language Model (LLM) might be used for formalizing natural language into simulation models. Existing studies only explored using very large LLMs, like the commercial GPT models, without fine-tuning model weights. To close this gap, we show how an open-weights, 7B-parameter Mistral model can be fine-tuned to translate natural language descriptions to reaction network models in a domain-specific language, offering a self-hostable, compute-efficient, and memory efficient alternative. To this end, we develop a synthetic data generator to serve as the basis for fine-tuning and evaluation. Our quantitative evaluation shows that our fine-tuned Mistral model can recover the ground truth simulation model in up to 84.5% of cases. In addition, our small-scale user study demonstrates the model’s practical potential for one-time generation as well as interactive modeling in various domains. While promising, in its current form, the fine-tuned small LLM cannot catch up with large LLMs. We conclude that higher-quality training data are required, and expect future small and open-source LLMs to offer new opportunities.

[406] Learning to Learn with Contrastive Meta-Objective

Shiguang Wu, Yaqing Wang, Yatao Bian, Quanming Yao

Main category: cs.LG

TL;DR: ConML improves meta-learning by using task identity as additional supervision through contrastive learning of model representations, enhancing performance across various meta-learners with minimal implementation cost.

Details

Motivation: Current meta-learning approaches use mini-batch episodic training which naturally provides task identity information. This can serve as additional supervision to improve generalizability, inspired by human's alignment and discrimination ability in fast learning.

Method: Proposes ConML framework that contrasts model representations using task identity as supervision. It evaluates and optimizes contrastive meta-objective under a problem- and learner-agnostic meta-training framework.

Result: ConML integrates seamlessly with existing meta-learners and in-context learning models, bringing significant performance boost with small implementation cost.

Conclusion: Exploiting task identity as additional supervision through contrastive learning effectively improves meta-learning performance across various models and frameworks.

Abstract: Meta-learning enables learning systems to adapt quickly to new tasks, similar to humans. Different meta-learning approaches all work under/with the mini-batch episodic training framework. Such framework naturally gives the information about task identity, which can serve as additional supervision for meta-training to improve generalizability. We propose to exploit task identity as additional supervision in meta-training, inspired by the alignment and discrimination ability which is is intrinsic in human’s fast learning. This is achieved by contrasting what meta-learners learn, i.e., model representations. The proposed ConML is evaluating and optimizing the contrastive meta-objective under a problem- and learner-agnostic meta-training framework. We demonstrate that ConML integrates seamlessly with existing meta-learners, as well as in-context learning models, and brings significant boost in performance with small implementation cost.

[407] Memorization-Compression Cycles Improve Generalization

Fangyuan Yu

Main category: cs.LG

TL;DR: The paper introduces IBLM objective and GAPT algorithm that alternates between memorization and compression phases during LLM pretraining, improving generalization and reducing catastrophic forgetting.

Details

Motivation: To improve language model generalization by compressing internal representations, inspired by the biological alternation between awake learning and sleep consolidation.

Method: Proposed Information Bottleneck Language Modeling (IBLM) objective and Gated Phase Transition (GAPT) algorithm that adaptively switches between memorization and compression phases during training.

Result: GAPT reduces Matrix-Based Entropy by 50%, improves cross-entropy by 4.8%, enhances OOD generalization by 35% on arithmetic tasks, and reduces interference by 97% in catastrophic forgetting scenarios.

Conclusion: The emergent memorization-compression cycle during pretraining mirrors biological learning patterns, and explicitly managing this cycle through GAPT significantly improves model generalization and robustness.

Abstract: We prove theoretically that generalization improves not only through data scaling but also by compressing internal representations. To operationalize this insight, we introduce the Information Bottleneck Language Modeling (IBLM) objective, which reframes language modeling as a constrained optimization problem: minimizing representation entropy subject to optimal prediction performance. Empirically, we observe an emergent memorization-compression cycle during LLM pretraining, evidenced by oscillation positive/negative gradient alignment between cross-entropy and Matrix-Based Entropy (MBE), a measure of representation entropy. This pattern closely mirrors the predictive-compressive trade-off prescribed by IBLM and also parallels the biological alternation between awake learning and sleep consolidation. Motivated by this observation, we propose Gated Phase Transition (GAPT), a training algorithm that adaptively switches between memorization and compression phases. When applied to GPT-2 pretraining on FineWeb dataset, GAPT reduces MBE by 50% and improves cross-entropy by 4.8%. GAPT improves OOD generalizatino by 35% in a pretraining task on arithmetic multiplication. In a setting designed to simulate catastrophic forgetting, GAPT reduces interference by compressing and separating representations, achieving a 97% improvement in separation - paralleling the functional role of sleep consolidation.

[408] TinySQL: A Progressive Text-to-SQL Dataset for Mechanistic Interpretability Research

Abir Harrasse, Philip Quirke, Clement Neo, Dhruv Nathawani, Luke Marks, Amir Abdullah

Main category: cs.LG

TL;DR: The paper proposes text-to-SQL generation as an ideal task for mechanistic interpretability research, bridging simple toy tasks and complex real-world models. It introduces TinySQL dataset and applies multiple interpretability techniques to analyze SQL generation circuits.

Details

Motivation: To bridge the gap between analyzing simple circuits in toy tasks and discovering features in large models by using text-to-SQL generation as an ideal task that combines formal structure with real-world complexity.

Method: Created TinySQL synthetic dataset with progressive SQL complexity, trained models from 33M to 1B parameters, applied Edge Attribution Patching and Sparse Autoencoders to identify circuits, compared circuits for different SQL subskills, and conducted layerwise logit lens analysis.

Result: Identified minimal circuits and components supporting SQL generation, evaluated circuit minimality, reliability, and identifiability, and revealed how models compose SQL queries across layers from intent recognition to schema resolution to structured generation.

Conclusion: Provides a robust framework for probing and comparing interpretability methods in a structured, progressively complex setting using text-to-SQL generation as an ideal testbed.

Abstract: Mechanistic interpretability research faces a gap between analyzing simple circuits in toy tasks and discovering features in large models. To bridge this gap, we propose text-to-SQL generation as an ideal task to study, as it combines the formal structure of toy tasks with real-world complexity. We introduce TinySQL, a synthetic dataset, progressing from basic to advanced SQL operations, and train models ranging from 33M to 1B parameters to establish a comprehensive testbed for interpretability. We apply multiple complementary interpretability techniques, including Edge Attribution Patching and Sparse Autoencoders, to identify minimal circuits and components supporting SQL generation. We compare circuits for different SQL subskills, evaluating their minimality, reliability, and identifiability. Finally, we conduct a layerwise logit lens analysis to reveal how models compose SQL queries across layers: from intent recognition to schema resolution to structured generation. Our work provides a robust framework for probing and comparing interpretability methods in a structured, progressively complex setting.

[409] AdaptGrad: Adaptive Sampling to Reduce Noise

Linjiang Zhou, Chao Ma, Zepeng Wang, Libing Wu, Xiaochuan Shi

Main category: cs.LG

TL;DR: The paper proposes AdaptGrad, an adaptive gradient smoothing method that automatically determines the optimal noise variance parameter for gradient-based explanation methods, outperforming manual parameter selection in noise reduction.

Details

Motivation: SmoothGrad uses Gaussian noise to reduce noise in gradient-based explanations but requires manual setting of the variance parameter σ, which still leaves residual noise in the smoothed gradients.

Method: Reinterpret SmoothGrad through convolution theory, analyze gradient noise and σ’s role from a confidence perspective, then develop AdaptGrad as an adaptive gradient smoothing method that automatically optimizes the noise variance parameter.

Result: Comprehensive experiments show AdaptGrad effectively reduces almost all noise in vanilla gradients compared to baseline methods, with both qualitative and quantitative improvements.

Conclusion: AdaptGrad is a simple, universal method that enhances gradient-based interpretability methods for better visualization by adaptively determining the optimal smoothing parameters.

Abstract: Gradient Smoothing is an efficient approach to reducing noise in gradient-based model explanation method. SmoothGrad adds Gaussian noise to mitigate much of these noise. However, the crucial hyper-parameter in this method, the variance $\sigma$ of Gaussian noise, is set manually or with heuristic approach. However, it results in the smoothed gradients still containing a certain amount of noise. In this paper, we aim to interpret SmoothGrad as a corollary of convolution, thereby re-understanding the gradient noise and the role of $\sigma$ from the perspective of confidence level. Furthermore, we propose an adaptive gradient smoothing method, AdaptGrad, based on these insights. Through comprehensive experiments, both qualitative and quantitative results demonstrate that AdaptGrad could effectively reduce almost all the noise in vanilla gradients compared with baselines methods. AdaptGrad is simple and universal, making it applicable for enhancing gradient-based interpretability methods for better visualization.

[410] InfiFPO: Implicit Model Fusion via Preference Optimization in Large Language Models

Yanggan Gu, Yuanyi Wang, Zhaoyi Yan, Yiming Zhang, Qi Zhou, Fei Wu, Hongxia Yang

Main category: cs.LG

TL;DR: InfiFPO is a preference optimization method for implicit model fusion that addresses limitations in existing preference alignment approaches by preserving probability information from source models and using sequence-level fusion strategies.

Details

Motivation: Existing model fusion methods focus mainly on supervised fine-tuning and neglect preference alignment, with current PA fusion methods discarding valuable probability information from source models, limiting their effectiveness.

Method: InfiFPO replaces the reference model in DPO with a fused source model that synthesizes multi-source probabilities at sequence level, using probability clipping and max-margin fusion strategies to avoid vocabulary alignment challenges.

Result: Comprehensive experiments on 11 benchmarks show InfiFPO consistently outperforms existing methods, improving Phi-4’s average performance from 79.95 to 83.33, with significant gains in mathematics, coding, and reasoning tasks.

Conclusion: InfiFPO effectively addresses the gap in preference alignment for model fusion by preserving probability information and enabling better knowledge distillation from source models, leading to substantial performance improvements across diverse tasks.

Abstract: Model fusion combines multiple Large Language Models (LLMs) with different strengths into a more powerful, integrated model through lightweight training methods. Existing works on model fusion focus primarily on supervised fine-tuning (SFT), leaving preference alignment (PA) –a critical phase for enhancing LLM performance–largely unexplored. The current few fusion methods on PA phase, like WRPO, simplify the process by utilizing only response outputs from source models while discarding their probability information. To address this limitation, we propose InfiFPO, a preference optimization method for implicit model fusion. InfiFPO replaces the reference model in Direct Preference Optimization (DPO) with a fused source model that synthesizes multi-source probabilities at the sequence level, circumventing complex vocabulary alignment challenges in previous works and meanwhile maintaining the probability information. By introducing probability clipping and max-margin fusion strategies, InfiFPO enables the pivot model to align with human preferences while effectively distilling knowledge from source models. Comprehensive experiments on 11 widely-used benchmarks demonstrate that InfiFPO consistently outperforms existing model fusion and preference optimization methods. When using Phi-4 as the pivot model, InfiFPO improve its average performance from 79.95 to 83.33 on 11 benchmarks, significantly improving its capabilities in mathematics, coding, and reasoning tasks.

[411] Provably Efficient Reward Transfer in Reinforcement Learning with Discrete Markov Decision Processes

Kevin Vora, Yu Zhang

Main category: cs.LG

TL;DR: Q-Manipulation (Q-M) is a new method for reward adaptation in RL that uses Q-function bounds and iterative tightening to enable action pruning before learning starts, improving efficiency when adapting to new reward functions.

Details

Motivation: Learning target behaviors from scratch in reinforcement learning is inefficient when source behaviors already exist under the same domain dynamics but different reward functions.

Method: Manipulate Q-functions by computing bounds and using an iterative process (similar to value iteration) to tighten these bounds, enabling action pruning before learning begins. Assumes target reward is a known function of source rewards and uses a lite-model.

Result: Q-M is proven to not affect optimality of returned policy in discrete domains and is provably efficient in sample complexity. Evaluations in synthetic and simulation domains show effectiveness, generalizability, and practicality.

Conclusion: Q-Manipulation provides an efficient approach to reward adaptation by leveraging existing source behaviors and Q-function manipulation, enabling action pruning and improved sample efficiency.

Abstract: In this paper, we propose a new solution to reward adaptation (RA) in reinforcement learning, where the agent adapts to a target reward function based on one or more existing source behaviors learned a priori under the same domain dynamics but different reward functions. While learning the target behavior from scratch is possible, it is often inefficient given the available source behaviors. Our work introduces a new approach to RA through the manipulation of Q-functions. Assuming the target reward function is a known function of the source reward functions, we compute bounds on the Q-function and present an iterative process (akin to value iteration) to tighten these bounds. Such bounds enable action pruning in the target domain before learning even starts. We refer to this method as “Q-Manipulation” (Q-M). The iteration process assumes access to a lite-model, which is easy to provide or learn. We formally prove that Q-M, under discrete domains, does not affect the optimality of the returned policy and show that it is provably efficient in terms of sample complexity in a probabilistic sense. Q-M is evaluated in a variety of synthetic and simulation domains to demonstrate its effectiveness, generalizability, and practicality.

[412] Deep Linear Probe Generators for Weight Space Learning

Jonathan Kahana, Eliahu Horwitz, Imri Shuval, Yedid Hoshen

Main category: cs.LG

TL;DR: ProbeGen introduces a deep linear generator module to improve probing approaches for weight space learning, achieving state-of-the-art performance with significantly reduced computational cost.

Details

Motivation: Current weight space learning methods face challenges with high-dimensional weights and permutation symmetries, while existing probing strategies are ineffective despite showing initial promise.

Method: ProbeGen adds a shared generator module with deep linear architecture to standard probing approaches, providing inductive bias towards structured probes to reduce overfitting.

Result: ProbeGen significantly outperforms state-of-the-art methods while being very efficient, requiring 30 to 1000 times fewer FLOPs than other top approaches.

Conclusion: The proposed ProbeGen method demonstrates that simple modifications to probing approaches can yield substantial improvements in both performance and efficiency for weight space learning tasks.

Abstract: Weight space learning aims to extract information about a neural network, such as its training dataset or generalization error. Recent approaches learn directly from model weights, but this presents many challenges as weights are high-dimensional and include permutation symmetries between neurons. An alternative approach, Probing, represents a model by passing a set of learned inputs (probes) through the model, and training a predictor on top of the corresponding outputs. Although probing is typically not used as a stand alone approach, our preliminary experiment found that a vanilla probing baseline worked surprisingly well. However, we discover that current probe learning strategies are ineffective. We therefore propose Deep Linear Probe Generators (ProbeGen), a simple and effective modification to probing approaches. ProbeGen adds a shared generator module with a deep linear architecture, providing an inductive bias towards structured probes thus reducing overfitting. While simple, ProbeGen performs significantly better than the state-of-the-art and is very efficient, requiring between 30 to 1000 times fewer FLOPs than other top approaches.

[413] MLR-Bench: Evaluating AI Agents on Open-Ended Machine Learning Research

Hui Chen, Miao Xiong, Yujie Lu, Wei Han, Ailin Deng, Yufei He, Jiaying Wu, Yibo Li, Yue Liu, Bryan Hooi

Main category: cs.LG

TL;DR: MLR-Bench is a comprehensive benchmark for evaluating AI agents on open-ended machine learning research, featuring 201 tasks, automated evaluation framework, and modular agent scaffold.

Details

Motivation: To address the growing potential of AI agents in scientific discovery and provide a systematic way to evaluate their research capabilities, particularly in machine learning.

Method: Developed MLR-Bench with three components: 201 research tasks from NeurIPS/ICLR/ICML workshops, MLR-Judge automated evaluation framework using LLM-based reviewers with review rubrics, and MLR-Agent modular scaffold for completing research through four stages (idea generation, proposal formulation, experimentation, paper writing).

Result: Evaluation of six frontier LLMs and an advanced coding agent showed LLMs are effective at generating coherent ideas and structured papers, but coding agents frequently (80% cases) produce fabricated/invalidated experimental results. MLR-Judge achieved high agreement with human expert reviewers.

Conclusion: MLR-Bench provides a scalable tool for benchmarking AI research agents, revealing current limitations in experimental reliability while supporting trustworthy scientific discovery. The framework is open-sourced for community use.

Abstract: Recent advancements in AI agents have demonstrated their growing potential to drive and support scientific discovery. In this work, we introduce MLR-Bench, a comprehensive benchmark for evaluating AI agents on open-ended machine learning research. MLR-Bench includes three key components: (1) 201 research tasks sourced from NeurIPS, ICLR, and ICML workshops covering diverse ML topics; (2) MLR-Judge, an automated evaluation framework combining LLM-based reviewers with carefully designed review rubrics to assess research quality; and (3) MLR-Agent, a modular agent scaffold capable of completing research tasks through four stages: idea generation, proposal formulation, experimentation, and paper writing. Our framework supports both stepwise assessment across these distinct research stages, and end-to-end evaluation of the final research paper. We then use MLR-Bench to evaluate six frontier LLMs and an advanced coding agent, finding that while LLMs are effective at generating coherent ideas and well-structured papers, current coding agents frequently (e.g., in 80% of the cases) produce fabricated or invalidated experimental results–posing a major barrier to scientific reliability. We validate MLR-Judge through human evaluation, showing high agreement with expert reviewers, supporting its potential as a scalable tool for research evaluation. We open-source MLR-Bench to help the community benchmark, diagnose, and improve AI research agents toward trustworthy and transparent scientific discovery.

[414] Merging Embedded Topics with Optimal Transport for Online Topic Modeling on Data Streams

Federica Granese, Benjamin Navet, Serena Villata, Charles Bouveyron

Main category: cs.LG

TL;DR: StreamETM is a novel online topic modeling approach that extends the Embedded Topic Model (ETM) to handle streaming text data by merging models from consecutive document batches using unbalanced optimal transport and detecting topic shifts over time.

Details

Motivation: The rapid growth of social media generates massive volumes of streaming text data that require efficient online topic modeling methods to handle continuous data streams arriving over time.

Method: Extends Embedded Topic Model (ETM) to streaming data by merging models from consecutive document batches using unbalanced optimal transport, and employs online change point detection to identify topic shifts over time.

Result: Numerical experiments on simulated and real-world data demonstrate that StreamETM outperforms competing methods.

Conclusion: StreamETM provides an effective solution for online topic modeling in streaming text data, with publicly available code for implementation.

Abstract: Topic modeling is a key component in unsupervised learning, employed to identify topics within a corpus of textual data. The rapid growth of social media generates an ever-growing volume of textual data daily, making online topic modeling methods essential for managing these data streams that continuously arrive over time. This paper introduces a novel approach to online topic modeling named StreamETM. This approach builds on the Embedded Topic Model (ETM) to handle data streams by merging models learned on consecutive partial document batches using unbalanced optimal transport. Additionally, an online change point detection algorithm is employed to identify shifts in topics over time, enabling the identification of significant changes in the dynamics of text streams. Numerical experiments on simulated and real-world data show StreamETM outperforming competitors. We provide the code publicly available at https://github.com/fgranese/StreamETM.

[415] DNN Modularization via Activation-Driven Training

Tuan Ngo, Abid Hassan, Saad Shafiq, Nenad Medvidovic

Main category: cs.LG

TL;DR: MODA is an activation-driven modular training approach that decomposes DNNs into modules by regulating activation outputs based on three objectives: intra-class affinity, inter-class dispersion, and compactness, achieving faster training, fewer weights, and better accuracy preservation.

Details

Motivation: Deep Neural Networks accrue technical debt and have high retraining costs when adapting to evolving requirements. Existing modularization techniques suffer from weight overlaps, accuracy losses, limited focus on convolutional layers, and added complexity.

Method: MODA promotes inherent modularity by directly regulating activation outputs of layers based on three modular objectives: intra-class affinity (similar activations within same class), inter-class dispersion (different activations across classes), and compactness (sparse activations).

Result: MODA achieves 22% less training time, modules with up to 24x fewer weights and 37x less weight overlap, preserves original model accuracy without fine-tuning, and improves target class accuracy by 12% in module replacement scenarios with minimal impact on other classes.

Conclusion: MODA provides an effective activation-driven modular training approach that addresses limitations of existing methods, offering faster training, more compact modules with less overlap, and better accuracy preservation for DNN modularization.

Abstract: Deep Neural Networks (DNNs) tend to accrue technical debt and suffer from significant retraining costs when adapting to evolving requirements. Modularizing DNNs offers the promise of improving their reusability. Previous work has proposed techniques to decompose DNN models into modules both during and after training. However, these strategies yield several shortcomings, including significant weight overlaps and accuracy losses across modules, restricted focus on convolutional layers only, and added complexity and training time by introducing auxiliary masks to control modularity. In this work, we propose MODA, an activation-driven modular training approach. MODA promotes inherent modularity within a DNN model by directly regulating the activation outputs of its layers based on three modular objectives: intra-class affinity, inter-class dispersion, and compactness. MODA is evaluated using three well-known DNN models and five datasets with varying sizes. This evaluation indicates that, compared to the existing state-of-the-art, using MODA yields several advantages: (1) MODA accomplishes modularization with 22% less training time; (2) the resultant modules generated by MODA comprise up to 24x fewer weights and 37x less weight overlap while (3) preserving the original model’s accuracy without additional fine-tuning; in module replacement scenarios, (4) MODA improves the accuracy of a target class by 12% on average while ensuring minimal impact on the accuracy of other classes.

[416] Unlearned but Not Forgotten: Data Extraction after Exact Unlearning in LLM

Xiaoyu Wu, Yifei Pang, Terrance Liu, Zhiwei Steven Wu

Main category: cs.LG

TL;DR: Exact unlearning, considered the gold standard for privacy protection, may paradoxically increase privacy risks when both pre- and post-unlearning models are accessible, enabling novel data extraction attacks.

Details

Motivation: To challenge the assumption that exact unlearning effectively mitigates privacy risks in practical deployment scenarios where both pre- and post-unlearning models are available.

Method: A novel data extraction attack that uses signals from the pre-unlearning model to guide the post-unlearning model, combined with a token filtering strategy, to uncover patterns reflecting removed data distribution.

Result: The attack significantly improves extraction success rates (doubling performance in some cases) across benchmarks like MUSE, TOFU, and WMDP, and demonstrates effectiveness on a simulated medical diagnosis dataset.

Conclusion: Unlearning may increase privacy leakage risks in real-world deployments, requiring broader threat models that consider adversarial access to both pre- and post-unlearning models.

Abstract: Large Language Models are typically trained on datasets collected from the web, which may inadvertently contain harmful or sensitive personal information. To address growing privacy concerns, unlearning methods have been proposed to remove the influence of specific data from trained models. Of these, exact unlearning – which retrains the model from scratch without the target data – is widely regarded the gold standard for mitigating privacy risks in deployment. In this paper, we revisit this assumption in a practical deployment setting where both the pre- and post-unlearning logits API are exposed, such as in open-weight scenarios. Targeting this setting, we introduce a novel data extraction attack that leverages signals from the pre-unlearning model to guide the post-unlearning model, uncovering patterns that reflect the removed data distribution. Combining model guidance with a token filtering strategy, our attack significantly improves extraction success rates – doubling performance in some cases – across common benchmarks such as MUSE, TOFU, and WMDP. Furthermore, we demonstrate our attack’s effectiveness on a simulated medical diagnosis dataset to highlight real-world privacy risks associated with exact unlearning. In light of our findings, which suggest that unlearning may, in a contradictory way, increase the risk of privacy leakage during real-world deployments, we advocate for evaluation of unlearning methods to consider broader threat models that account not only for post-unlearning models but also for adversarial access to prior checkpoints. Code is publicly available at: https://github.com/Nicholas0228/unlearned_data_extraction_llm.

[417] A Multi-Task Foundation Model for Wireless Channel Representation Using Contrastive and Masked Autoencoder Learning

Berkay Guler, Giovanni Geraci, Hamid Jafarkhani

Main category: cs.LG

TL;DR: ContraWiMAE is a transformer-based foundation model that unifies masked reconstruction and masked contrastive learning for wireless channel representation, using wireless characteristics as natural augmentation.

Details

Motivation: Current self-supervised learning approaches for wireless channels borrow from text/image processing without addressing unique wireless constraints like noise, fading, and partial observability.

Method: Introduces wireless-inspired contrastive objective that exploits noise, fading, and partial observability as natural augmentation, combining masked reconstruction with masked contrastive learning.

Result: Demonstrates effectiveness in cross-frequency beam selection, line-of-sight detection, and channel estimation with superior linear separability and adaptability in diverse wireless environments.

Conclusion: ContraWiMAE shows exceptional data efficiency and competitive performance compared to supervised baselines, establishing a powerful baseline for self-supervised wireless channel representation learning.

Abstract: Current applications of self-supervised learning to wireless channel representation often borrow paradigms developed for text and image processing, without fully addressing the unique characteristics and constraints of wireless communications. To bridge this gap, we introduce ContraWiMAE, Wireless Contrastive Masked Autoencoder, a transformer-based foundation model that unifies masked reconstruction and masked contrastive learning for wireless channel representation. Our key innovation is a new wireless-inspired contrastive objective that exploits the inherent characteristics of wireless environment, including noise, fading, and partial observability, as natural augmentation. Through extensive evaluation on unseen scenarios and conditions, we demonstrate our method’s effectiveness in multiple downstream tasks, including cross-frequency beam selection, line-of-sight detection, and channel estimation. ContraWiMAE exhibits superior linear separability and adaptability in diverse wireless environments, demonstrating exceptional data efficiency and competitive performance compared with supervised baselines under challenging conditions. Comparative evaluations against a state-of-the-art wireless channel foundation model confirm the superior performance and data efficiency of our approach, highlighting its potential as a powerful baseline for future research in self-supervised wireless channel representation learning. To foster further work in this direction, we release the model weights and training pipeline for ContraWiMAE.

[418] Learn More by Using Less: Distributed Learning with Energy-Constrained Devices

Roberto Pereira, Cristian J. Vaca-Rubio, Luis Blanco

Main category: cs.LG

TL;DR: LeanFed is an energy-aware federated learning framework that optimizes client selection and training workloads on battery-constrained devices by dynamically adjusting local data usage to prevent battery depletion and improve participation.

Details

Motivation: Federated Learning faces challenges in real-world deployments due to system heterogeneity and energy limitations of devices, which reduce model accuracy and increase dropout rates, impacting convergence.

Method: LeanFed uses adaptive data usage by dynamically adjusting the fraction of local data each device utilizes during training, ensuring devices don’t run out of battery while maximizing participation across communication rounds.

Result: Evaluation on CIFAR-10 and CIFAR-100 datasets shows LeanFed consistently enhances model accuracy and stability, particularly in high data heterogeneity and limited battery scenarios, by mitigating client dropout and extending device availability.

Conclusion: LeanFed demonstrates the potential for energy-efficient, privacy-preserving FL in real-world applications, providing a foundation for robust and sustainable pervasive AI on resource-constrained networks.

Abstract: Federated Learning (FL) has emerged as a solution for distributed model training across decentralized, privacy-preserving devices, but the different energy capacities of participating devices (system heterogeneity) constrain real-world implementations. These energy limitations not only reduce model accuracy but also increase dropout rates, impacting on convergence in practical FL deployments. In this work, we propose LeanFed, an energy-aware FL framework designed to optimize client selection and training workloads on battery-constrained devices. LeanFed leverages adaptive data usage by dynamically adjusting the fraction of local data each device utilizes during training, thereby maximizing device participation across communication rounds while ensuring they do not run out of battery during the process. We rigorously evaluate LeanFed against traditional FedAvg on CIFAR-10 and CIFAR-100 datasets, simulating various levels of data heterogeneity and device participation rates. Results show that LeanFed consistently enhances model accuracy and stability, particularly in settings with high data heterogeneity and limited battery life, by mitigating client dropout and extending device availability. This approach demonstrates the potential of energy-efficient, privacy-preserving FL in real-world, large-scale applications, setting a foundation for robust and sustainable pervasive AI on resource-constrained networks.

[419] GUARD: Guided Unlearning and Retention via Data Attribution for Large Language Models

Peizhi Niu, Evelyn Ma, Huiting Zhou, Duo Zhou, Huan Zhang, S. Rasoul Etesami, Olgica Milenkovic

Main category: cs.LG

TL;DR: GUARD is a novel framework for LLM unlearning that uses data attribution to mitigate unintended forgetting while maintaining model utility.

Details

Motivation: Address unintended forgetting in LLM unlearning where removing specific data impairs model utility and retention of desired information, especially when dealing with high-impact data.

Method: Proposes GUARD framework with lightweight proxy data attribution metric to quantify alignment between Forget and Retain sets, and adaptive unlearning weights inversely proportional to attribution scores.

Result: Reduces utility sacrifice by up to 194.92% in Truth Ratio on TOFU benchmark and improves knowledge retention by 16.20% on MUSE NEWS benchmark, with minimal privacy loss increase.

Conclusion: GUARD effectively improves retention while maintaining comparable forgetting performance, addressing key limitations of existing LLM unlearning methods.

Abstract: Unlearning in large language models is becoming increasingly important due to regulatory compliance, copyright protection, and privacy concerns. However, a key challenge in LLM unlearning is unintended forgetting, where the removal of specific data inadvertently impairs the utility of the model and its retention of valuable, desired information. While prior work has primarily focused on architectural innovations, the influence of data-level factors on unlearning performance remains underexplored. As a result, existing methods often suffer from degraded retention when forgetting high-impact data. To address this problem, we propose GUARD, a novel framework for Guided Unlearning And Retention via Data attribution. At its core, GUARD introduces a lightweight proxy data attribution metric tailored for LLM unlearning, which quantifies the alignment between the Forget and Retain sets while remaining computationally efficient. Building on this, we design a novel unlearning objective that assigns adaptive, nonuniform unlearning weights to samples, inversely proportional to their proxy attribution scores. Through such a reallocation of unlearning power, GUARD mitigates unintended retention loss. We also provide rigorous theoretical guarantees that GUARD significantly improves retention while maintaining forgetting metrics comparable to prior methods. Extensive experiments on the TOFU and MUSE benchmarks across multiple LLM architectures demonstrate that GUARD reduces utility sacrifice on the TOFU Retain Set by up to 194.92 percent in terms of Truth Ratio when forgetting 10 percent of the training data, and improves knowledge retention on the MUSE NEWS Retain Set by 16.20 percent, with comparable or very moderate increases in privacy loss compared to state-of-the-art methods.

[420] Concept-Guided Interpretability via Neural Chunking

Shuchen Wu, Stephan Alaniz, Shyamgopal Karthik, Peter Dayan, Eric Schulz, Zeynep Akata

Main category: cs.LG

TL;DR: The paper challenges the ‘black box’ view of neural networks by proposing the Reflection Hypothesis, which states that neural activity patterns mirror training data regularities. It introduces three chunking methods (DSC, PA, UCD) to extract interpretable concept-encoding entities from neural dynamics, showing these chunks causally affect network behavior.

Details

Motivation: To challenge the prevailing view of neural networks as black boxes and demonstrate that their internal activity patterns reflect regularities in training data, enabling interpretability through cognitive chunking principles.

Method: Proposed three complementary chunking methods: Discrete Sequence Chunking (DSC) learns entity dictionaries in lower-dimensional space; Population Averaging (PA) extracts labeled entities; Unsupervised Chunk Discovery (UCD) works without labels. Methods extract concept-encoding entities from neural population dynamics.

Result: Successfully extracted concept-encoding entities (concrete words, abstract POS tags, structural narrative schema) agnostic to model architectures. Demonstrated causal role of chunks through grafting experiments that produced controlled, predictable behavior changes in models.

Conclusion: The work provides a new interpretability direction that combines cognitive principles with naturalistic data structure to reveal hidden computations in neural networks, transforming them from black boxes to understandable systems.

Abstract: Neural networks are often described as black boxes, reflecting the significant challenge of understanding their internal workings and interactions. We propose a different perspective that challenges the prevailing view: rather than being inscrutable, neural networks exhibit patterns in their raw population activity that mirror regularities in the training data. We refer to this as the Reflection Hypothesis and provide evidence for this phenomenon in both simple recurrent neural networks (RNNs) and complex large language models (LLMs). Building on this insight, we propose to leverage our cognitive tendency of chunking to segment high-dimensional neural population dynamics into interpretable units that reflect underlying concepts. We propose three methods to extract recurring chunks on a neural population level, complementing each other based on label availability and neural data dimensionality. Discrete sequence chunking (DSC) learns a dictionary of entities in a lower-dimensional neural space; population averaging (PA) extracts recurring entities that correspond to known labels; and unsupervised chunk discovery (UCD) can be used when labels are absent. We demonstrate the effectiveness of these methods in extracting concept-encoding entities agnostic to model architectures. These concepts can be both concrete (words), abstract (POS tags), or structural (narrative schema). Additionally, we show that extracted chunks play a causal role in network behavior, as grafting them leads to controlled and predictable changes in the model’s behavior. Our work points to a new direction for interpretability, one that harnesses both cognitive principles and the structure of naturalistic data to reveal the hidden computations of complex learning systems, gradually transforming them from black boxes into systems we can begin to understand.

[421] Joint Hierarchical Representation Learning of Samples and Features via Informed Tree-Wasserstein Distance

Ya-Wei Eileen Lin, Ronald R. Coifman, Gal Mishne, Ronen Talmon

Main category: cs.LG

TL;DR: Proposes an unsupervised method for jointly learning hierarchical representations of samples and features using Tree-Wasserstein Distance, with alternating optimization between data modes.

Details

Motivation: High-dimensional data often has hierarchical structures in both samples and features, but existing methods only consider one mode at a time, missing the joint hierarchical relationships.

Method: Alternates between constructing a tree for one data mode, computing Tree-Wasserstein Distance for the other mode using that tree, then using the resulting distance to build the second mode’s tree, with iterative refinement.

Result: Method converges theoretically and outperforms baselines in sparse approximation and unsupervised Wasserstein distance learning. When integrated with hyperbolic graph convolutional networks, improves link prediction and node classification performance.

Conclusion: The proposed joint hierarchical representation learning approach effectively captures meaningful hierarchical structures in both data modes and demonstrates practical utility across multiple applications.

Abstract: High-dimensional data often exhibit hierarchical structures in both modes: samples and features. Yet, most existing approaches for hierarchical representation learning consider only one mode at a time. In this work, we propose an unsupervised method for jointly learning hierarchical representations of samples and features via Tree-Wasserstein Distance (TWD). Our method alternates between the two data modes. It first constructs a tree for one mode, then computes a TWD for the other mode based on that tree, and finally uses the resulting TWD to build the second mode’s tree. By repeatedly alternating through these steps, the method gradually refines both trees and the corresponding TWDs, capturing meaningful hierarchical representations of the data. We provide a theoretical analysis showing that our method converges. We show that our method can be integrated into hyperbolic graph convolutional networks as a pre-processing technique, improving performance in link prediction and node classification tasks. In addition, our method outperforms baselines in sparse approximation and unsupervised Wasserstein distance learning tasks on word-document and single-cell RNA-sequencing datasets.

[422] Flexible-length Text Infilling for Discrete Diffusion Models

Andrew Zhang, Anushka Sivakumar, Chiawei Tang, Chris Thomas

Main category: cs.LG

TL;DR: DDOT is a discrete diffusion model that enables flexible text infilling by jointly denoising token values and positions using Optimal Transport coupling, overcoming previous limitations of requiring ground-truth positional data.

Details

Motivation: Discrete diffusion models have advantages over autoregressive models but cannot perform flexible-length or flexible-position text infilling without access to ground-truth positional data, which limits their practical applications.

Method: DDOT jointly denoises token values and token positions using a novel sample-level Optimal Transport coupling that preserves relative token ordering while dynamically adjusting positions and length of infilled segments.

Result: Extensive experiments on text infilling benchmarks (One-Billion-Word and Yelp) show DDOT outperforms naive diffusion baselines and achieves performance on par with state-of-the-art non-autoregressive models.

Conclusion: DDOT enables significant improvements in training efficiency and flexibility for text infilling tasks, making it the first discrete diffusion model to overcome the positional data limitation while being compatible with various pretrained text denoisers.

Abstract: Discrete diffusion models are a new class of text generators that offer advantages such as bidirectional context use, parallelizable generation, and flexible prompting compared to autoregressive models. However, a critical limitation of discrete diffusion models is their inability to perform flexible-length or flexible-position text infilling without access to ground-truth positional data. We introduce \textbf{DDOT} (\textbf{D}iscrete \textbf{D}iffusion with \textbf{O}ptimal \textbf{T}ransport Position Coupling), the first discrete diffusion model to overcome this challenge. DDOT jointly denoises token values and token positions, employing a novel sample-level Optimal Transport (OT) coupling. This coupling preserves relative token ordering while dynamically adjusting the positions and length of infilled segments, a capability previously missing in text diffusion. Our method is orthogonal to existing discrete text diffusion methods and is compatible with various pretrained text denoisers. Extensive experiments on text infilling benchmarks such as One-Billion-Word and Yelp demonstrate that DDOT outperforms naive diffusion baselines. Furthermore, DDOT achieves performance on par with state-of-the-art non-autoregressive models and enables significant improvements in training efficiency and flexibility.

[423] Improving planning and MBRL with temporally-extended actions

Palash Chatterjee, Roni Khardon

Main category: cs.LG

TL;DR: The paper proposes controlling continuous decision timescale directly by treating action duration as an optimization variable, which speeds up simulation, enables deep horizon search with shallow planning depth, and reduces model errors in MBRL.

Details

Motivation: Discrete time dynamics require small simulation steps for accuracy, leading to large planning horizons, computationally demanding problems, and reduced performance. Existing action repeat methods only partially address this issue.

Method: Use temporally-extended actions where the planner optimizes both action variables and their duration. Integrate action duration selection using multi-armed bandit formulation within MBRL framework.

Result: Faster planning, better solutions, and enables solving problems that standard formulation cannot handle. Reduces compounding errors from model learning and improves model training time.

Conclusion: Controlling continuous decision timescale by optimizing action duration is effective for improving planning efficiency and performance in both planning and model-based reinforcement learning settings.

Abstract: Continuous time systems are often modeled using discrete time dynamics but this requires a small simulation step to maintain accuracy. In turn, this requires a large planning horizon which leads to computationally demanding planning problems and reduced performance. Previous work in model-free reinforcement learning has partially addressed this issue using action repeats where a policy is learned to determine a discrete action duration. Instead we propose to control the continuous decision timescale directly by using temporally-extended actions and letting the planner treat the duration of the action as an additional optimization variable along with the standard action variables. This additional structure has multiple advantages. It speeds up simulation time of trajectories and, importantly, it allows for deep horizon search in terms of primitive actions while using a shallow search depth in the planner. In addition, in the model-based reinforcement learning (MBRL) setting, it reduces compounding errors from model learning and improves training time for models. We show that this idea is effective and that the range for action durations can be automatically selected using a multi-armed bandit formulation and integrated into the MBRL framework. An extensive experimental evaluation both in planning and in MBRL, shows that our approach yields faster planning, better solutions, and that it enables solutions to problems that are not solved in the standard formulation.

[424] Finite Sample Identification of Partially Observed Bilinear Dynamical Systems

Yahya Sattar, Yassir Jedra, Maryam Fazel, Sarah Dean

Main category: cs.LG

TL;DR: This paper presents a finite-time analysis for learning partially observed bilinear dynamical systems from noisy input-output data, addressing challenges like nonlinear regression and system stability dependencies.

Details

Motivation: To develop a reliable system identification algorithm for bilinear dynamical systems that can handle noisy data and provide theoretical guarantees on learning accuracy.

Method: The algorithm learns Markov-like parameters by regressing outputs to highly correlated, nonlinear, and heavy-tailed covariates, with analysis under a uniform stability assumption.

Result: The paper provides high probability error bounds on the identification algorithm and insights into system theoretic quantities affecting learning accuracy and sample complexity.

Conclusion: Numerical experiments with synthetic data validate the theoretical insights, demonstrating the effectiveness of the proposed bilinear system identification approach.

Abstract: We consider the problem of learning a realization of a partially observed bilinear dynamical system (BLDS) from noisy input-output data. Given a single trajectory of input-output samples, we provide a finite time analysis for learning the system’s Markov-like parameters, from which a balanced realization of the bilinear system can be obtained. Our bilinear system identification algorithm learns the system’s Markov-like parameters by regressing the outputs to highly correlated, nonlinear, and heavy-tailed covariates. Moreover, the stability of BLDS depends on the sequence of inputs used to excite the system. These properties, unique to partially observed bilinear dynamical systems, pose significant challenges to the analysis of our algorithm for learning the unknown dynamics. We address these challenges and provide high probability error bounds on our identification algorithm under a uniform stability assumption. Our analysis provides insights into system theoretic quantities that affect learning accuracy and sample complexity. Lastly, we perform numerical experiments with synthetic data to reinforce these insights.

[425] An Analysis of Concept Bottleneck Models: Measuring, Understanding, and Mitigating the Impact of Noisy Annotations

Seonghwan Park, Jueun Mun, Donghyun Oh, Namhoon Lee

Main category: cs.LG

TL;DR: This paper presents the first systematic study of noise in concept bottleneck models (CBMs), showing that even moderate corruption impairs performance, interpretability, and intervention effectiveness. The authors propose a two-stage framework using sharpness-aware minimization during training and uncertainty-based concept correction during inference to mitigate these vulnerabilities.

Details

Motivation: Concept bottleneck models ensure interpretability through human-interpretable concepts, but the annotations used for training are often noisy. The impact of such corruption on CBMs is not well understood, motivating the need for a systematic study and robust solutions.

Method: The authors propose a two-stage framework: 1) During training, use sharpness-aware minimization to stabilize learning of noise-sensitive concepts. 2) During inference, rank concepts by predictive entropy and correct only the most uncertain ones, using uncertainty as a proxy for susceptibility.

Result: The study shows that even moderate corruption simultaneously impairs prediction performance, interpretability, and intervention effectiveness. The analysis identifies a susceptible subset of concepts whose accuracy declines far more than average and whose corruption accounts for most performance loss.

Conclusion: The proposed framework preserves both interpretability and resilience in the presence of noise, with theoretical analysis and extensive ablations explaining why sharpness-aware training confers robustness and why uncertainty reliably identifies susceptible concepts.

Abstract: Concept bottleneck models (CBMs) ensure interpretability by decomposing predictions into human interpretable concepts. Yet the annotations used for training CBMs that enable this transparency are often noisy, and the impact of such corruption is not well understood. In this study, we present the first systematic study of noise in CBMs and show that even moderate corruption simultaneously impairs prediction performance, interpretability, and the intervention effectiveness. Our analysis identifies a susceptible subset of concepts whose accuracy declines far more than the average gap between noisy and clean supervision and whose corruption accounts for most performance loss. To mitigate this vulnerability we propose a two-stage framework. During training, sharpness-aware minimization stabilizes the learning of noise-sensitive concepts. During inference, where clean labels are unavailable, we rank concepts by predictive entropy and correct only the most uncertain ones, using uncertainty as a proxy for susceptibility. Theoretical analysis and extensive ablations elucidate why sharpness-aware training confers robustness and why uncertainty reliably identifies susceptible concepts, providing a principled basis that preserves both interpretability and resilience in the presence of noise.

[426] A recursive Bayesian neural network for constitutive modeling of sands under monotonic and cyclic loading

Toiba Noor, Soban Nasir Lone, G. V. Ramana, Rajdip Nayek

Main category: cs.LG

TL;DR: A recursive Bayesian neural network (rBNN) framework is developed for soil constitutive modeling that combines temporal sequence learning with Bayesian inference to achieve both accurate predictions and reliable uncertainty quantification across various loading conditions.

Details

Motivation: Traditional constitutive models in geotechnical engineering need to capture complex soil behavior, and while deep learning approaches show promise, they require both accuracy and proper uncertainty quantification for practical deployment.

Method: The study introduces a recursive Bayesian neural network with sliding window recursive structure to capture path-dependent soil responses. It treats network parameters as random variables and infers posterior distributions via generalized variational inference.

Result: The rBNN framework was validated on four datasets covering monotonic and cyclic loading conditions, showing competitive predictive accuracy compared to LSTM, Encoder-Decoder, and GRU models while providing well-calibrated confidence intervals.

Conclusion: The proposed rBNN approach demonstrates adaptability across varying data fidelity and complexity levels, providing both accurate predictions and reliable uncertainty quantification for soil constitutive modeling.

Abstract: In geotechnical engineering, constitutive models are central to capturing soil behavior across diverse drainage conditions, stress paths,and loading histories. While data driven deep learning (DL) approaches have shown promise as alternatives to traditional constitutive formulations, their deployment requires models that are both accurate and capable of quantifying predictive uncertainty. This study introduces a recursive Bayesian neural network (rBNN) framework that unifies temporal sequence learning with generalized Bayesian inference to achieve both predictive accuracy and rigorous uncertainty quantification. A key innovation is the incorporation of a sliding window recursive structure that enables the model to effectively capture path dependent soil responses under monotonic and cyclic loading. By treating network parameters as random variables and inferring their posterior distributions via generalized variational inference, the rBNN produces well calibrated confidence intervals alongside point predictions.The framework is validated against four datasets spanning both simulated and experimental triaxial tests: monotonic loading using a Hardening Soil model simulation and 28 CD tests on Baskarp sand, and cyclic loading using an exponential constitutive simulation of CD CU tests and 37 experimental cyclic CU tests on Ottawa F65 sand. This progression from monotonic to cyclic and from simulated to experimental data demonstrates the adaptability of the proposed approach across varying levels of data fidelity and complexity. Comparative analyses with LSTM, Encoder Decoder,and GRU architectures highlight that rBNN not only achieves competitive predictive accuracy but also provides reliable confidence intervals.

[427] Learning Reward Machines from Partially Observed Policies

Mohamad Louai Shehab, Antoine Aspeel, Necmiye Ozay

Main category: cs.LG

TL;DR: The paper proposes a SAT-based algorithm for inverse reinforcement learning that identifies reward machines from optimal policies or demonstrations, with theoretical guarantees of exact recovery up to an equivalence class given sufficient finite information.

Details

Motivation: To solve the inverse reinforcement learning problem of inferring reward functions from expert demonstrations, particularly when rewards are expressed as reward machines with transitions dependent on atomic propositions in Markov Decision Processes.

Method: Introduces prefix tree policies that map state-proposition sequences to action distributions, characterizes equivalence classes of identifiable reward machines, and develops a SAT-based algorithm that extracts information from prefix tree policies to solve for reward machines.

Result: Proves that with sufficient finite depth of prefix tree policy knowledge, the algorithm recovers the exact reward machine up to the equivalence class. The sufficient depth depends on MDP states and reward machine state count bounds.

Conclusion: The approach is effective and general, demonstrated through discrete grid worlds, block worlds, robotic arm tasks, and real mouse experiment data, extending to cases with only optimal policy demonstrations.

Abstract: Inverse reinforcement learning is the problem of inferring a reward function from an optimal policy or demonstrations by an expert. In this work, it is assumed that the reward is expressed as a reward machine whose transitions depend on atomic propositions associated with the state of a Markov Decision Process (MDP). Our goal is to identify the true reward machine using finite information. To this end, we first introduce the notion of a prefix tree policy which associates a distribution of actions to each state of the MDP and each attainable finite sequence of atomic propositions. Then, we characterize an equivalence class of reward machines that can be identified given the prefix tree policy. Finally, we propose a SAT-based algorithm that uses information extracted from the prefix tree policy to solve for a reward machine. It is proved that if the prefix tree policy is known up to a sufficient (but finite) depth, our algorithm recovers the exact reward machine up to the equivalence class. This sufficient depth is derived as a function of the number of MDP states and (an upper bound on) the number of states of the reward machine. These results are further extended to the case where we only have access to demonstrations from an optimal policy. Several examples, including discrete grid and block worlds, a continuous state-space robotic arm, and real data from experiments with mice, are used to demonstrate the effectiveness and generality of the approach.

[428] Antislop: A Comprehensive Framework for Identifying and Eliminating Repetitive Patterns in Language Models

Samuel Paech, Allen Roush, Judah Goldfeder, Ravid Shwartz-Ziv

Main category: cs.LG

TL;DR: Antislop is a framework that detects and eliminates repetitive “slop” patterns in LLM outputs using three innovations: a backtracking sampler, automated slop profiling, and a novel fine-tuning method called FTPO.

Details

Motivation: Widespread LLM adoption has introduced characteristic repetitive phraseology ("slop") that degrades output quality and makes AI-generated text immediately recognizable.

Method: Combines three approaches: (1) Antislop Sampler using backtracking to suppress unwanted strings, (2) automated pipeline for profiling slop patterns and generating training data, (3) Final Token Preference Optimization (FTPO) - a novel fine-tuning method that surgically adjusts logits.

Result: Successfully suppresses 8,000+ patterns while maintaining quality (vs token banning unusable at 2,000 patterns). FTPO achieves 90% slop reduction while maintaining or improving performance on GSM8K, MMLU, and creative writing tasks.

Conclusion: Antislop framework effectively reduces repetitive patterns in LLM outputs while preserving model quality, outperforming alternatives like DPO which suffer degradation in writing quality and lexical diversity.

Abstract: Widespread LLM adoption has introduced characteristic repetitive phraseology, termed “slop,” which degrades output quality and makes AI-generated text immediately recognizable. We present Antislop, a comprehensive framework providing tools to both detect and eliminate these overused patterns. Our approach combines three innovations: (1) The Antislop Sampler, which uses backtracking to suppress unwanted strings at inference time without destroying vocabulary; (2) An automated pipeline that profiles model-specific slop against human baselines and generates training data; (3) Final Token Preference Optimization (FTPO), a novel fine-tuning method that operates on individual tokens, surgically adjusting logits wherever a banned pattern has appeared in an inference trace. We demonstrate that some slop patterns appear over 1,000x more frequently in LLM output than human text. The Antislop Sampler successfully suppresses 8,000+ patterns while maintaining quality, whereas token banning becomes unusable at just 2,000. Most importantly, FTPO achieves 90% slop reduction while maintaining or improving performance in cross-domain evals including GSM8K, MMLU, and creative writing tasks. In contrast, DPO suffers significant degradation in writing quality and lexical diversity despite achieving weaker suppression. We release all code and results under MIT license: https://github.com/sam-paech/auto-antislop.

[429] QoQ-Med: Building Multimodal Clinical Foundation Models with Domain-Aware GRPO Training

Wei Dai, Peilin Chen, Chanakya Ekbote, Paul Pu Liang

Main category: cs.LG

TL;DR: QoQ-Med-7B/32B is the first open generalist clinical foundation model that jointly reasons across medical images, time-series signals, and text reports using Domain-aware Relative Policy Optimization (DRPO) to handle skewed clinical data distributions.

Details

Motivation: Existing multimodal language models are largely vision-centric and fail to generalize across clinical specialties, while clinical decision-making requires reasoning over heterogeneous data types including images, signals, and text.

Method: Trained on 2.61M instruction tuning pairs across 9 clinical domains using Domain-aware Relative Policy Optimization (DRPO), a novel RL objective that hierarchically scales normalized rewards based on domain rarity and modality difficulty.

Result: DRPO training boosts diagnostic performance by 43% in macro-F1 across visual domains compared to other methods. QoQ-Med achieves 10x higher IoU for salient region highlighting than open models and reaches OpenAI o4-mini performance.

Conclusion: QoQ-Med represents a significant advancement in clinical multimodal reasoning, with full model weights, training pipeline, and reasoning traces released to foster reproducibility and downstream research.

Abstract: Clinical decision-making routinely demands reasoning over heterogeneous data, yet existing multimodal language models (MLLMs) remain largely vision-centric and fail to generalize across clinical specialties. To bridge this gap, we introduce QoQ-Med-7B/32B, the first open generalist clinical foundation model that jointly reasons across medical images, time-series signals, and text reports. QoQ-Med is trained with Domain-aware Relative Policy Optimization (DRPO), a novel reinforcement-learning objective that hierarchically scales normalized rewards according to domain rarity and modality difficulty, mitigating performance imbalance caused by skewed clinical data distributions. Trained on 2.61 million instruction tuning pairs spanning 9 clinical domains, we show that DRPO training boosts diagnostic performance by 43% in macro-F1 on average across all visual domains as compared to other critic-free training methods like GRPO. Furthermore, with QoQ-Med trained on intensive segmentation data, it is able to highlight salient regions related to the diagnosis, with an IoU 10x higher than open models while reaching the performance of OpenAI o4-mini. To foster reproducibility and downstream research, we release (i) the full model weights, (ii) the modular training pipeline, and (iii) all intermediate reasoning traces at https://github.com/DDVD233/QoQ_Med.

[430] Training-Free Constrained Generation With Stable Diffusion Models

Stefano Zampini, Jacob K. Christopher, Luca Oneto, Davide Anguita, Ferdinando Fioretto

Main category: cs.LG

TL;DR: This paper integrates stable diffusion models with constrained optimization to generate outputs that strictly satisfy physical and functional requirements.

Details

Motivation: Existing techniques for incorporating physics-based constraints into generative models are either limited to latent diffusion frameworks or lack strict constraint enforcement capability.

Method: Proposes a novel integration of stable diffusion models with constrained optimization frameworks.

Result: Demonstrated effectiveness through material design experiments requiring precise morphometric properties, inverse design tasks for specific stress-strain responses, and copyright-constrained content generation.

Conclusion: The approach enables generation of outputs satisfying stringent physical and functional requirements, with all code released publicly.

Abstract: Stable diffusion models represent the state-of-the-art in data synthesis across diverse domains and hold transformative potential for applications in science and engineering, e.g., by facilitating the discovery of novel solutions and simulating systems that are computationally intractable to model explicitly. While there is increasing effort to incorporate physics-based constraints into generative models, existing techniques are either limited in their applicability to latent diffusion frameworks or lack the capability to strictly enforce domain-specific constraints. To address this limitation this paper proposes a novel integration of stable diffusion models with constrained optimization frameworks, enabling the generation of outputs satisfying stringent physical and functional requirements. The effectiveness of this approach is demonstrated through material design experiments requiring adherence to precise morphometric properties, challenging inverse design tasks involving the generation of materials inducing specific stress-strain responses, and copyright-constrained content generation tasks. All code has been released at https://github.com/RAISELab-atUVA/Constrained-Stable-Diffusion.

[431] Horizon Reduction Makes RL Scalable

Seohong Park, Kevin Frans, Deepinder Mann, Benjamin Eysenbach, Aviral Kumar, Sergey Levine

Main category: cs.LG

TL;DR: Offline RL algorithms show poor scalability despite large datasets, with long horizons identified as the main barrier. Horizon reduction techniques significantly improve scalability, and the proposed SHARSA method achieves the best performance.

Details

Motivation: To investigate if current offline RL algorithms can scale to solve complex problems given sufficient data and compute, and to understand the factors limiting their scalability.

Method: Evaluated existing offline RL algorithms on diverse, challenging tasks with datasets up to 1000x larger than typical. Identified horizon as the key scaling barrier and proposed SHARSA, a minimal method that explicitly reduces the horizon.

Result: Most offline RL algorithms saturate below maximum performance despite data scaling. Long horizons were empirically verified as the fundamental barrier. Horizon reduction techniques substantially improved scalability, with SHARSA achieving the best asymptotic performance.

Conclusion: Explicit horizon reduction is crucial for unlocking offline RL scalability. SHARSA demonstrates that addressing the horizon barrier enables effective scaling on challenging tasks.

Abstract: In this work, we study the scalability of offline reinforcement learning (RL) algorithms. In principle, a truly scalable offline RL algorithm should be able to solve any given problem, regardless of its complexity, given sufficient data, compute, and model capacity. We investigate if and how current offline RL algorithms match up to this promise on diverse, challenging, previously unsolved tasks, using datasets up to 1000x larger than typical offline RL datasets. We observe that despite scaling up data, many existing offline RL algorithms exhibit poor scaling behavior, saturating well below the maximum performance. We hypothesize that the horizon is the main cause behind the poor scaling of offline RL. We empirically verify this hypothesis through several analysis experiments, showing that long horizons indeed present a fundamental barrier to scaling up offline RL. We then show that various horizon reduction techniques substantially enhance scalability on challenging tasks. Based on our insights, we also introduce a minimal yet scalable method named SHARSA that effectively reduces the horizon. SHARSA achieves the best asymptotic performance and scaling behavior among our evaluation methods, showing that explicitly reducing the horizon unlocks the scalability of offline RL. Code: https://github.com/seohongpark/horizon-reduction

[432] Optimizing Asynchronous Federated Learning: A Delicate Trade-Off Between Model-Parameter Staleness and Update Frequency

Abdelkrim Alahyane, Céline Comte, Matthieu Jonckheere, Éric Moulines

Main category: cs.LG

TL;DR: This paper analyzes asynchronous federated learning to address the straggler effect in synchronous FL. It introduces optimization methods that balance model staleness and system throughput, achieving 10-30% accuracy improvements.

Details

Motivation: Synchronous FL scales poorly due to straggler effect. Asynchronous FL algorithms like FedAsync exist but their design choices' impact isn't well understood, especially considering heterogeneous client speeds and datasets.

Method: Uses stochastic modeling to analyze asynchronous FL design choices. Proves discrete Little’s law to derive relative delay metric for staleness. Introduces alternative metric accounting for both staleness and throughput.

Result: Developed optimization methods that enhance accuracy by 10% to 30% compared to existing approaches. Shows fundamental trade-off between minimizing gradient estimation errors and maximizing system throughput.

Conclusion: Optimizing asynchronous FL requires balancing staleness avoidance with throughput maximization. The proposed metrics and optimization methods significantly improve FL performance by properly addressing this trade-off.

Abstract: Synchronous federated learning (FL) scales poorly with the number of clients due to the straggler effect. Algorithms like FedAsync and GeneralizedFedAsync address this limitation by enabling asynchronous communication between clients and the central server. In this work, we rely on stochastic modeling and analysis to better understand the impact of design choices in asynchronous FL algorithms, such as the concurrency level and routing probabilities, and we leverage this knowledge to optimize loss. Compared to most existing studies, we account for the joint impact of heterogeneous and variable service speeds and heterogeneous datasets at the clients. We characterize in particular a fundamental trade-off for optimizing asynchronous FL: minimizing gradient estimation errors by avoiding model parameter staleness, while also speeding up the system by increasing the throughput of model updates. Our two main contributions can be summarized as follows. First, we prove a discrete variant of Little’s law to derive a closed-form expression for relative delay, a metric that quantifies staleness. This allows us to efficiently minimize the average loss per model update, which has been the gold standard in literature to date, using the upper-bound of Leconte et al. as a proxy. Second, we observe that naively optimizing this metric drastically slows down the system by overemphasizing staleness at the expense of throughput. This motivates us to introduce an alternative metric that also accounts for speed, for which we derive a tractable upper-bound that can be minimized numerically. Extensive numerical results show these optimizations enhance accuracy by 10% to 30%.

[433] Improved Exploration in GFlownets via Enhanced Epistemic Neural Networks

Sajan Muhammad, Salem Lahlou

Main category: cs.LG

TL;DR: Integrates epistemic neural networks with GFlowNets for better uncertainty quantification and exploration, improving trajectory selection in sequential decision problems.

Details

Motivation: Efficiently identifying optimal trajectories for training in GFlowNets requires prioritizing exploration in poorly learned regions of the state space, which calls for uncertainty-driven exploration.

Method: Proposes ENN-GFN-Enhanced algorithm that integrates epistemic neural networks with conventional GFlowNet architecture to enable better joint predictions and uncertainty quantification.

Result: The method demonstrates efficacy and efficiency in grid environments and structured sequence generation across various settings, outperforming baseline GFlowNet methods.

Conclusion: Combining epistemic neural networks with GFlowNets enables more efficient uncertainty-driven exploration and better identification of optimal trajectories in sequential decision problems.

Abstract: Efficiently identifying the right trajectories for training remains an open problem in GFlowNets. To address this, it is essential to prioritize exploration in regions of the state space where the reward distribution has not been sufficiently learned. This calls for uncertainty-driven exploration, in other words, the agent should be aware of what it does not know. This attribute can be measured by joint predictions, which are particularly important for combinatorial and sequential decision problems. In this research, we integrate epistemic neural networks (ENN) with the conventional architecture of GFlowNets to enable more efficient joint predictions and better uncertainty quantification, thereby improving exploration and the identification of optimal trajectories. Our proposed algorithm, ENN-GFN-Enhanced, is compared to the baseline method in GFlownets and evaluated in grid environments and structured sequence generation in various settings, demonstrating both its efficacy and efficiency.

[434] Learning to Add, Multiply, and Execute Algorithmic Instructions Exactly with Neural Networks

Artur Back de Luca, George Giapitzakis, Kimon Fountoulakis

Main category: cs.LG

TL;DR: Neural networks can learn to execute binary algorithmic instructions exactly using NTK framework with logarithmic training data and structured data isolation.

Details

Motivation: Neural networks fail to generalize perfectly on discrete operations like arithmetic, so this work explores whether they can learn to execute binary-encoded algorithmic instructions exactly.

Method: Using Neural Tangent Kernel (NTK) framework to study two-layer fully connected networks in infinite-width limit, with structured training data to isolate bit-level rules and control NTK correlations.

Result: A sufficiently large ensemble of models can be trained to execute exactly four fundamental tasks: binary permutations, binary addition, binary multiplication, and SBN instructions with high probability using only logarithmically many training data.

Conclusion: Since SBN is Turing-complete, this framework extends to computable functions, showing neural networks can learn exact algorithmic execution through proper training structure and NTK regime control.

Abstract: Neural networks are known for their ability to approximate smooth functions, yet they fail to generalize perfectly to unseen inputs when trained on discrete operations. Such operations lie at the heart of algorithmic tasks such as arithmetic, which is often used as a test bed for algorithmic execution in neural networks. In this work, we ask: can neural networks learn to execute binary-encoded algorithmic instructions exactly? We use the Neural Tangent Kernel (NTK) framework to study the training dynamics of two-layer fully connected networks in the infinite-width limit and show how a sufficiently large ensemble of such models can be trained to execute exactly, with high probability, four fundamental tasks: binary permutations, binary addition, binary multiplication, and Subtract and Branch if Negative (SBN) instructions. Since SBN is Turing-complete, our framework extends to computable functions. We show how this can be efficiently achieved using only logarithmically many training data. Our approach relies on two techniques: structuring the training data to isolate bit-level rules, and controlling correlations in the NTK regime to align model predictions with the target algorithmic executions.

[435] Understanding Reasoning in Thinking Language Models via Steering Vectors

Constantin Venhoff, Iván Arcuschin, Philip Torr, Arthur Conmy, Neel Nanda

Main category: cs.LG

TL;DR: A steering approach for thinking LLMs that identifies and controls specific reasoning behaviors like uncertainty expression, hypothesis validation, and backtracking through linear directions in activation space.

Details

Motivation: While thinking language models with extensive reasoning chains achieve improved performance, controlling their reasoning processes remains challenging and requires practical tools for interpretable steering.

Method: Systematic experiments on 500 tasks across 10 categories to identify reasoning behaviors, then extracting steering vectors from linear directions in activation space to modulate specific reasoning aspects.

Result: Demonstrated consistent control over reasoning behaviors across three DeepSeek-R1-Distill models, showing that reasoning behaviors can be mediated and controlled using steering vectors.

Conclusion: Provides practical and interpretable tools for steering reasoning processes in thinking models, enabling controlled manipulation of specific reasoning behaviors across different model architectures.

Abstract: Recent advances in large language models (LLMs) have led to the development of thinking language models that generate extensive internal reasoning chains before producing responses. While these models achieve improved performance, controlling their reasoning processes remains challenging. This work presents a steering approach for thinking LLMs by analyzing and manipulating specific reasoning behaviors in DeepSeek-R1-Distill models. Through a systematic experiment on 500 tasks across 10 diverse categories, we identify several reasoning behaviors exhibited by thinking models, including expressing uncertainty, generating examples for hypothesis validation, and backtracking in reasoning chains. We demonstrate that these behaviors are mediated by linear directions in the model’s activation space and can be controlled using steering vectors. By extracting and applying these vectors, we provide a method to modulate specific aspects of the model’s reasoning process, such as its tendency to backtrack or express uncertainty. Our approach offers practical tools for steering reasoning processes in thinking models in a controlled and interpretable manner. We validate our steering method using three DeepSeek-R1-Distill models, demonstrating consistent control across different model architectures.

[436] Long-term Causal Inference via Modeling Sequential Latent Confounding

Weilin Chen, Ruichu Cai, Yuguang Yan, Zhifeng Hao, José Miguel Hernández-Lobato

Main category: cs.LG

TL;DR: The paper introduces a novel assumption that extends the CAECB assumption to handle temporal short-term outcomes, enabling identification of long-term causal effects when multiple sequential short-term outcomes are available.

Details

Motivation: To address limitations of existing CAECB-based methods that only work with one short-term outcome of the same scale as the long-term outcome, by accommodating multiple temporal short-term outcomes.

Method: Proposes a functional relationship assumption between sequential confounding biases across temporal short-term outcomes, develops an estimator based on the identification result, and analyzes its asymptotic properties.

Result: Theoretical identification of long-term causal effects is established, and extensive experiments validate the theoretical results and demonstrate method effectiveness.

Conclusion: The proposed method successfully extends causal inference capabilities to scenarios with temporal short-term outcomes, overcoming limitations of previous approaches.

Abstract: Long-term causal inference is an important but challenging problem across various scientific domains. To solve the latent confounding problem in long-term observational studies, existing methods leverage short-term experimental data. Ghassami et al. propose an approach based on the Conditional Additive Equi-Confounding Bias (CAECB) assumption, which asserts that the confounding bias in the short-term outcome is equal to that in the long-term outcome, so that the long-term confounding bias and the causal effects can be identified. While effective in certain cases, this assumption is limited to scenarios where there is only one short-term outcome with the same scale as the long-term outcome. In this paper, we introduce a novel assumption that extends the CAECB assumption to accommodate temporal short-term outcomes. Our proposed assumption states a functional relationship between sequential confounding biases across temporal short-term outcomes, under which we theoretically establish the identification of long-term causal effects. Based on the identification result, we develop an estimator and conduct a theoretical analysis of its asymptotic properties. Extensive experiments validate our theoretical results and demonstrate the effectiveness of the proposed method.

[437] Pay Attention to Small Weights

Chao Zhou, Tom Jacobs, Advait Gadhikar, Rebekka Burkholz

Main category: cs.LG

TL;DR: NANOADAM is a finetuning method that dynamically updates only small-magnitude weights to reduce computational cost while maintaining performance.

Details

Motivation: Finetuning large pretrained models is resource-intensive, and analysis shows large gradients correlate with small-magnitude weights during finetuning.

Method: Dynamically update only small-magnitude weights during finetuning, which is gradient-free and preserves large-magnitude weights from pretraining.

Result: NANOADAM allows larger learning rates, reduces catastrophic forgetting, and achieves better generalization on NLP and vision tasks.

Conclusion: Selective updating of small-magnitude weights provides an efficient finetuning strategy with practical advantages over full parameter updates.

Abstract: Finetuning large pretrained neural networks is known to be resource-intensive, both in terms of memory and computational cost. To mitigate this, a common approach is to restrict training to a subset of the model parameters. By analyzing the relationship between gradients and weights during finetuning, we observe a notable pattern: large gradients are often associated with small-magnitude weights. This correlation is more pronounced in finetuning settings than in training from scratch. Motivated by this observation, we propose NANOADAM, which dynamically updates only the small-magnitude weights during finetuning and offers several practical advantages: first, this criterion is gradient-free – the parameter subset can be determined without gradient computation; second, it preserves large-magnitude weights, which are likely to encode critical features learned during pretraining, thereby reducing the risk of catastrophic forgetting; thirdly, it permits the use of larger learning rates and consistently leads to better generalization performance in experiments. We demonstrate this for both NLP and vision tasks.

[438] Learning Spatially Adaptive $\ell_1$-Norms Weights for Convolutional Synthesis Regularization

Andreas Kofler, Luca Calatroni, Christoph Kolbitsch, Kostas Papafitsoros

Main category: cs.LG

TL;DR: Unrolled algorithm approach for learning spatially adaptive parameter maps in convolutional synthesis-based L1 regularization, applied to low-field MRI reconstruction with comparable performance to established methods while maintaining interpretability.

Details

Motivation: To develop an interpretable deep learning approach for image reconstruction that maintains the benefits of model-based methods while providing insight into algorithm mechanisms through spatially adaptive parameter maps.

Method: Unrolls FISTA algorithm to estimate deeply parametrized spatially varying parameters applied to sparse feature maps using pre-trained convolutional filters in a synthesis-based L1 regularization framework.

Result: Produces visually and quantitatively comparable results to spatially adaptive/non-adaptive Total Variation methods and established model-based deep learning approaches for low-field MRI reconstruction.

Conclusion: The approach achieves competitive performance while remaining highly interpretable, with inferred parameter maps providing valuable insight into algorithm mechanisms and potential for filter selection.

Abstract: We propose an unrolled algorithm approach for learning spatially adaptive parameter maps in the framework of convolutional synthesis-based $\ell_1$ regularization. More precisely, we consider a family of pre-trained convolutional filters and estimate deeply parametrized spatially varying parameters applied to the sparse feature maps by means of unrolling a FISTA algorithm to solve the underlying sparse estimation problem. The proposed approach is evaluated for image reconstruction of low-field MRI and compared to spatially adaptive and non-adaptive analysis-type procedures relying on Total Variation regularization and to a well-established model-based deep learning approach. We show that the proposed approach produces visually and quantitatively comparable results with the latter approaches and at the same time remains highly interpretable. In particular, the inferred parameter maps quantify the local contribution of each filter in the reconstruction, which provides valuable insight into the algorithm mechanism and could potentially be used to discard unsuited filters.

[439] PULSE: Practical Evaluation Scenarios for Large Multimodal Model Unlearning

Tatsuki Kawakami, Kazuki Egashira, Atsuyuki Miyai, Go Irie, Kiyoharu Aizawa

Main category: cs.LG

TL;DR: PULSE protocol introduces a realistic evaluation framework for unlearning in large multimodal models, focusing on pre-trained knowledge unlearning and long-term sustainability evaluation.

Details

Motivation: Existing unlearning benchmarks for LMMs only consider single unlearning operations on fine-tuned knowledge, lacking evaluation for pre-trained knowledge and sequential unlearning scenarios.

Method: Proposed PULSE protocol with two perspectives: (i) Pre-trained knowledge Unlearning to analyze effects across different knowledge acquisition phases, and (ii) Long-term Sustainability Evaluation for sequential unlearning requests.

Result: Current unlearning techniques successfully remove fine-tuned knowledge but struggle with pre-trained knowledge. Methods effective in batch unlearning degrade significantly when data is split and unlearned sequentially.

Conclusion: The study highlights limitations of existing unlearning methods and emphasizes the need for more robust techniques that can handle pre-trained knowledge and sequential unlearning scenarios in LMMs.

Abstract: In recent years, unlearning techniques, which are methods for inducing a model to “forget” previously learned information, have attracted attention as a way to address privacy and copyright concerns in large language models (LLMs) and large multimodal models (LMMs). While several unlearning benchmarks have been established for LLMs, a practical evaluation framework for unlearning in LMMs has been less explored. Specifically, existing unlearning benchmark for LMMs considers only scenarios in which the model is required to unlearn fine-tuned knowledge through a single unlearning operation. In this study, we introduce PULSE protocol for realistic unlearning scenarios for LMMs by introducing two critical perspectives: (i) Pre-trained knowledge Unlearning for analyzing the effect across different knowledge acquisition phases and (ii) Long-term Sustainability Evaluation to address sequential requests. We then evaluate existing unlearning methods along these dimensions. Our results reveal that, although some techniques can successfully unlearn knowledge acquired through fine-tuning, they struggle to eliminate information learned during pre-training. Moreover, methods that effectively unlearn a batch of target data in a single operation exhibit substantial performance degradation when the same data are split and unlearned sequentially.

[440] Kolmogorov-Arnold Attention: Is Learnable Attention Better For Vision Transformers?

Subhajit Maity, Killian Hitsman, Xin Li, Aritra Dutta

Main category: cs.LG

TL;DR: This paper introduces Kolmogorov-Arnold Attention (KArAt), the first learnable attention mechanism for Vision Transformers that can operate on various mathematical bases like Fourier, Wavelets, Splines, and Rational Functions.

Details

Motivation: To explore whether Kolmogorov-Arnold Networks (KANs) could learn token interactions in Vision Transformers, moving beyond simply replacing MLPs to creating learnable attention mechanisms.

Method: Designed KArAt with learnable activation functions in attention, addressed memory explosion with low-rank approximation, and implemented Fourier-KArAt variants tested on various ViT architectures including ConViT and Swin-Transformer.

Result: Fourier-KArAt variants sometimes outperform or show comparable performance to traditional softmax attention on CIFAR-10, CIFAR-100, and ImageNet-1K, with improved attention scores and better token interactions, though generalizability doesn’t scale with larger ViTs.

Conclusion: KArAt demonstrates that attention can be learned, encouraging further exploration of learnable attention mechanisms with more advanced architectures, though current computing interfaces limit performance of parameter-heavy KArAts.

Abstract: Kolmogorov-Arnold networks (KANs) are a remarkable innovation that consists of learnable activation functions, with the potential to capture more complex relationships from data. Presently, KANs are deployed by replacing multilayer perceptrons (MLPs) in deep networks, including advanced architectures such as vision Transformers (ViTs). This work asks whether KAN could learn token interactions. In this paper, we design the first learnable attention called Kolmogorov-Arnold Attention (KArAt) for ViTs that can operate on any basis, ranging from Fourier, Wavelets, Splines, to Rational Functions. However, learnable activations in the attention cause a memory explosion. To remedy this, we propose a modular version of KArAt that uses a low-rank approximation. By adopting the Fourier basis, Fourier-KArAt and its variants, in some cases, outperform their traditional softmax counterparts, or show comparable performance on CIFAR-10, CIFAR-100, and ImageNet-1K. We also deploy Fourier KArAt to ConViT and Swin-Transformer, and use it in detection and segmentation with ViT-Det. We dissect the performance of these architectures by analyzing their loss landscapes, weight distributions, optimizer paths, attention visualizations, and transferability to other datasets. KArAt’s learnable activation yields a better attention score across all ViTs, indicating improved token-to-token interactions and contributing to enhanced inference. Still, its generalizability does not scale with larger ViTs. However, many factors, including the present computing interface, affect the relative performance of parameter- and memory-heavy KArAts. We note that the goal of this paper is not to produce efficient attention or challenge the traditional activations; by designing KArAt, we are the first to show that attention can be learned and encourage researchers to explore KArAt in conjunction with more advanced architectures.

[441] PRING: Rethinking Protein-Protein Interaction Prediction from Pairs to Graphs

Xinzhe Zheng, Hao Du, Fanding Xu, Jinzhe Li, Zhiyuan Liu, Wenkang Wang, Tao Chen, Wanli Ouyang, Stan Z. Li, Yan Lu, Nanqing Dong, Yang Zhang

Main category: cs.LG

TL;DR: PRING is the first comprehensive benchmark for protein-protein interaction prediction that evaluates models from a graph-level perspective, addressing limitations of existing pairwise evaluation methods by assessing network reconstruction capabilities.

Details

Motivation: Existing PPI prediction benchmarks focus on isolated pairwise evaluations but overlook models' capability to reconstruct biologically meaningful PPI networks, which is crucial for real-world biology research.

Method: PRING curates a high-quality multi-species PPI network dataset (21,484 proteins, 186,818 interactions) and establishes two evaluation paradigms: topology-oriented tasks (intra/cross-species network construction) and function-oriented tasks (protein complex pathway prediction, GO module analysis, essential protein justification).

Result: Extensive experiments on four model categories (sequence similarity-based, naive sequence-based, protein language model-based, and structure-based) show current PPI models have limitations in recovering both structural and functional properties of PPI networks.

Conclusion: PRING provides a reliable platform to guide development of more effective PPI prediction models, highlighting the gap between current methods and real-world biological applications.

Abstract: Deep learning-based computational methods have achieved promising results in predicting protein-protein interactions (PPIs). However, existing benchmarks predominantly focus on isolated pairwise evaluations, overlooking a model’s capability to reconstruct biologically meaningful PPI networks, which is crucial for biology research. To address this gap, we introduce PRING, the first comprehensive benchmark that evaluates protein-protein interaction prediction from a graph-level perspective. PRING curates a high-quality, multi-species PPI network dataset comprising 21,484 proteins and 186,818 interactions, with well-designed strategies to address both data redundancy and leakage. Building on this golden-standard dataset, we establish two complementary evaluation paradigms: (1) topology-oriented tasks, which assess intra and cross-species PPI network construction, and (2) function-oriented tasks, including protein complex pathway prediction, GO module analysis, and essential protein justification. These evaluations not only reflect the model’s capability to understand the network topology but also facilitate protein function annotation, biological module detection, and even disease mechanism analysis. Extensive experiments on four representative model categories, consisting of sequence similarity-based, naive sequence-based, protein language model-based, and structure-based approaches, demonstrate that current PPI models have potential limitations in recovering both structural and functional properties of PPI networks, highlighting the gap in supporting real-world biological applications. We believe PRING provides a reliable platform to guide the development of more effective PPI prediction models for the community. The dataset and source code of PRING are available at https://github.com/SophieSarceau/PRING.

[442] Customizing Spider Silk: Generative Models with Mechanical Property Conditioning for Protein Engineering

Neeru Dubey, Elin Karlsson, Miguel Angel Redondo, Johan Reimegård, Anna Rising, Hedvig Kjellström

Main category: cs.LG

TL;DR: A computational framework using distilled GPT-based models to design spider silk protein repeat sequences with customizable mechanical properties.

Details

Motivation: Spider silk has remarkable mechanical properties governed by protein repeat sequences, but correlating sequences with mechanical characteristics is challenging due to complex sequence-structure-function relationships and limited annotated data.

Method: Developed a lightweight GPT-based generative model by distilling ProtGPT2, then multilevel fine-tuning using curated Spider Silkome datasets - first with 6,000 MaSp repeats, then with 572 repeats linked to experimental mechanical properties.

Result: The model generates biologically plausible MaSp repeat regions tailored to specific mechanical properties and predicts properties for given sequences. Validation showed accurate physicochemical attributes, motif distributions, secondary structures, and predictive accuracy confirmed by BLAST correlation studies.

Conclusion: This framework advances rational design of spider silk-inspired biomaterials, providing a versatile tool for engineering protein sequences with tailored mechanical attributes.

Abstract: The remarkable mechanical properties of spider silk, including its tensile strength and extensibility, are primarily governed by the repetitive regions of the proteins that constitute the fiber, the major ampullate spidroins (MaSps). However, establishing correlations between mechanical characteristics and repeat sequences is challenging due to the intricate sequence-structure-function relationships of MaSps and the limited availability of annotated datasets. In this study, we present a novel computational framework for designing MaSp repeat sequences with customizable mechanical properties. To achieve this, we developed a lightweight GPT-based generative model by distilling the pre-trained ProtGPT2 protein language model. The distilled model was subjected to multilevel fine-tuning using curated subsets of the Spider Silkome dataset. Specifically, we adapt the model for MaSp repeat generation using 6,000 MaSp repeat sequences and further refine it with 572 repeats associated with experimentally determined fiber-level mechanical properties. Our model generates biologically plausible MaSp repeat regions tailored to specific mechanical properties while also predicting those properties for given sequences. Validation includes sequence-level analysis, assessing physicochemical attributes and expected distribution of key motifs as well as secondary structure compositions. A correlation study using BLAST on the Spider Silkome dataset and a test set of MaSp repeats with known mechanical properties further confirmed the predictive accuracy of the model. This framework advances the rational design of spider silk-inspired biomaterials, offering a versatile tool for engineering protein sequences with tailored mechanical attributes.

[443] TunnElQNN: A Hybrid Quantum-classical Neural Network for Efficient Learning

A. H. Abbas

Main category: cs.LG

TL;DR: TunnElQNN is a hybrid quantum-classical neural network with alternating classical and quantum layers, using Tunnelling Diode Activation Function (TDAF) that outperforms ReLU-based hybrid models on multi-class classification tasks.

Details

Motivation: To leverage complementary strengths of quantum and classical models in machine learning by integrating physics-inspired activation functions with quantum components to enhance expressiveness and robustness.

Method: Developed TunnElQNN - a non-sequential architecture with alternating classical and quantum layers, using Tunnelling Diode Activation Function (TDAF) inspired by quantum tunnelling I-V characteristics. Evaluated on synthetic interleaving half-circle dataset for multi-class classification with varying class overlap.

Result: TunnElQNN consistently outperforms the ReLUQNN baseline. Analysis of decision boundaries shows improved performance under different levels of class overlap compared to fully classical TDAF neural networks.

Conclusion: Integrating physics-inspired activation functions with quantum components enhances the expressiveness and robustness of hybrid quantum-classical machine learning architectures.

Abstract: Hybrid quantum-classical neural networks (HQCNNs) represent a promising frontier in machine learning, leveraging the complementary strengths of both models. In this work, we propose the development of TunnElQNN, a non-sequential architecture composed of alternating classical and quantum layers. Within the classical component, we employ the Tunnelling Diode Activation Function (TDAF), inspired by the I-V characteristics of quantum tunnelling. We evaluate the performance of this hybrid model on a synthetic dataset of interleaving half-circle for multi-class classification tasks with varying degrees of class overlap. The model is compared against a baseline hybrid architecture that uses the conventional ReLU activation function (ReLUQNN). Our results show that the TunnElQNN model consistently outperforms the ReLUQNN counterpart. Furthermore, we analyse the decision boundaries generated by TunnElQNN under different levels of class overlap and compare them to those produced by a neural network implementing TDAF within a fully classical architecture. These findings highlight the potential of integrating physics-inspired activation functions with quantum components to enhance the expressiveness and robustness of hybrid quantum-classical machine learning architectures.

[444] MINGLE: Mixture of Null-Space Gated Low-Rank Experts for Test-Time Continual Model Merging

Zihuan Qiu, Yi Xu, Chiyuan He, Fanman Meng, Linfeng Xu, Qingbo Wu, Hongliang Li

Main category: cs.LG

TL;DR: MINGLE is a test-time continual model merging framework that uses mixture-of-experts architecture with null-space constrained gating and adaptive relaxation to reduce forgetting and handle distribution shifts.

Details

Motivation: Address parameter interference and catastrophic forgetting in continual model merging, while improving adaptability to evolving test distributions without access to original training data.

Method: Uses mixture-of-experts with low-rank experts, null-space constrained gating to restrict updates to orthogonal subspaces, and adaptive relaxation strategy to balance stability and adaptability during test-time.

Result: Achieves robust generalization, significantly reduces forgetting, and surpasses previous state-of-the-art methods by 7-9% on average across diverse task orders.

Conclusion: MINGLE provides an effective solution for test-time continual model merging that successfully addresses parameter interference and distribution shift challenges while maintaining past knowledge.

Abstract: Continual model merging integrates independently fine-tuned models sequentially without access to the original training data, offering a scalable and efficient solution for continual learning. However, existing methods face two critical challenges: parameter interference among tasks, which leads to catastrophic forgetting, and limited adaptability to evolving test distributions. To address these issues, we introduce the task of Test-Time Continual Model Merging (TTCMM), which leverages a small set of unlabeled test samples during inference to alleviate parameter conflicts and handle distribution shifts. We propose MINGLE, a novel framework for TTCMM. MINGLE employs a mixture-of-experts architecture with parameter-efficient, low-rank experts, which enhances adaptability to evolving test distributions while dynamically merging models to mitigate conflicts. To further reduce forgetting, we propose Null-Space Constrained Gating, which restricts gating updates to subspaces orthogonal to prior task representations, thereby suppressing activations on old tasks and preserving past knowledge. We further introduce an Adaptive Relaxation Strategy that adjusts constraint strength dynamically based on interference signals observed during test-time adaptation, striking a balance between stability and adaptability. Extensive experiments on standard continual merging benchmarks demonstrate that MINGLE achieves robust generalization, significantly reduces forgetting, and consistently surpasses previous state-of-the-art methods by 7-9% on average across diverse task orders. Our code is available at: https://github.com/zihuanqiu/MINGLE

[445] Explaining Time Series Classifiers with PHAR: Rule Extraction and Fusion from Post-hoc Attributions

Maciej Mozolewski, Szymon Bobek, Grzegorz J. Nalepa

Main category: cs.LG

TL;DR: PHAR is a framework that converts numeric feature attributions from explainers like LIME and SHAP into human-readable rules for time series classification, improving interpretability by localizing decision-relevant segments.

Details

Motivation: Time series classification models are hard to interpret due to raw time series complexity and high input dimensionality, creating a need for better explanation methods.

Method: PHAR transforms numeric feature attributions into structured rules with human-readable intervals, uses rule fusion with weighted selection and lasso-based refinement, and includes visualization techniques.

Result: PHAR performs comparably to native rule-based methods like Anchor, scales better to long time series, achieves broader instance coverage, and resolves conflicting explanations into coherent insights.

Conclusion: PHAR improves interpretability, decision transparency, and practical applicability for time series classification by providing concise, human-readable rules aligned with model predictions.

Abstract: Explaining machine learning (ML) models for time series (TS) classification remains challenging due to the difficulty of interpreting raw time series and the high dimensionality of the input space. We introduce PHAR-Post-hoc Attribution Rules - a unified framework that transforms numeric feature attributions from post-hoc, instance-wise explainers (e.g., LIME, SHAP) into structured, human-readable rules. These rules define human-readable intervals that indicate where and when decision-relevant segments occur and can enhance model transparency by localizing threshold-based conditions on the raw series. PHAR performs comparably to native rule-based methods, such as Anchor, while scaling more efficiently to long TS sequences and achieving broader instance coverage. A dedicated rule fusion step consolidates rule sets using strategies like weighted selection and lasso-based refinement, balancing key quality metrics: coverage, confidence, and simplicity. This fusion ensures each instance receives a concise and unambiguous rule, improving both explanation fidelity and consistency. We further introduce visualization techniques to illustrate specificity-generalization trade-offs in the derived rules. PHAR resolves conflicting and overlapping explanations - a common effect of the Rashomon phenomenon - into coherent, domain-adaptable insights. Comprehensive experiments on UCR/UEA Time Series Classification Archive demonstrate that PHAR may improve interpretability, decision transparency, and practical applicability for TS classification tasks by providing concise, human-readable rules aligned with model predictions.

[446] CATransformers: Carbon Aware Transformers Through Joint Model-Hardware Optimization

Irene Wang, Newsha Ardalani, Mostafa Elhoushi, Daniel Jiang, Samuel Hsia, Ekin Sumbul, Divya Mahajan, Carole-Jean Wu, Bilge Acun

Main category: cs.LG

TL;DR: CATransformers is a carbon-aware co-optimization framework that reduces total carbon emissions (operational + embodied) by up to 30% for Transformer models while maintaining accuracy and latency.

Details

Motivation: Growing adoption of machine learning increases lifecycle carbon footprint from operational carbon (training/inference) and embodied carbon (hardware manufacturing), requiring sustainable AI solutions.

Method: Carbon-aware co-optimization framework that integrates operational and embodied carbon into early-stage design space exploration for Transformer models and hardware accelerators.

Result: Framework reduces total carbon emissions by up to 30% across various Transformer models while maintaining accuracy and latency; extensible to multi-modal models.

Conclusion: Holistic optimization methods prioritizing carbon efficiency are needed without compromising model capability and execution time performance.

Abstract: Machine learning solutions are rapidly adopted to enable a variety of key use cases, from conversational AI assistants to scientific discovery. This growing adoption is expected to increase the associated lifecycle carbon footprint, including both \emph{operational carbon} from training and inference and \emph{embodied carbon} from AI hardware manufacturing. We introduce \ourframework – the first carbon-aware co-optimization framework for Transformer-based models and hardware accelerators. By integrating both operational and embodied carbon into early-stage design space exploration, \ourframework enables sustainability-driven model architecture and hardware accelerator co-design that reveals fundamentally different trade-offs than latency- or energy-centric approaches. Evaluated across a range of Transformer models, \ourframework consistently demonstrates the potential to reduce total carbon emissions – by up to 30% – while maintaining accuracy and latency. We further highlight its extensibility through a focused case study on multi-modal models. Our results emphasize the need for holistic optimization methods that prioritize carbon efficiency without compromising model capability and execution time performance. The source code of \ourframework is available at {\small{\href{https://github.com/facebookresearch/CATransformers}{\texttt{https://github.com/facebookresearch/CATransformers}}}}.

[447] AmorLIP: Efficient Language-Image Pretraining via Amortization

Haotian Sun, Yitong Li, Yuchen Zhuang, Niao He, Hanjun Dai, Bo Dai

Main category: cs.LG

TL;DR: AmorLIP is an efficient CLIP pretraining framework that uses lightweight neural networks to amortize expensive contrastive learning computations, improving training efficiency and performance without requiring large batch sizes.

Details

Motivation: Standard CLIP methods require extremely large batch sizes and high computational resources (hundreds to thousands of GPUs) for robust representation learning. Existing solutions often compromise performance, prolong training, or face scalability issues with large datasets.

Method: AmorLIP amortizes expensive contrastive learning computations through lightweight neural networks, leveraging insights from spectral factorization of energy-based models. It introduces novel amortization objectives and practical techniques for training stability.

Result: Extensive experiments across 38 downstream tasks show AmorLIP achieves superior zero-shot classification and retrieval capabilities, consistently outperforming standard CLIP baselines with relative improvements up to 12.24%.

Conclusion: AmorLIP provides an efficient and effective alternative to standard CLIP pretraining, substantially improving training efficiency while maintaining or enhancing downstream performance across multiple tasks.

Abstract: Contrastive Language-Image Pretraining (CLIP) has demonstrated strong zero-shot performance across diverse downstream text-image tasks. Existing CLIP methods typically optimize a contrastive objective using negative samples drawn from each minibatch. To achieve robust representation learning, these methods require extremely large batch sizes and escalate computational demands to hundreds or even thousands of GPUs. Prior approaches to mitigate this issue often compromise downstream performance, prolong training duration, or face scalability challenges with very large datasets. To overcome these limitations, we propose AmorLIP, an efficient CLIP pretraining framework that amortizes expensive computations involved in contrastive learning through lightweight neural networks, which substantially improves training efficiency and performance. Leveraging insights from a spectral factorization of energy-based models, we introduce novel amortization objectives along with practical techniques to improve training stability. Extensive experiments across 38 downstream tasks demonstrate the superior zero-shot classification and retrieval capabilities of AmorLIP, consistently outperforming standard CLIP baselines with substantial relative improvements of up to 12.24%.

[448] Breaking the Exploration Bottleneck: Rubric-Scaffolded Reinforcement Learning for General LLM Reasoning

Yang Zhou, Sunzhu Li, Shunyu Liu, Wenkai Fang, Kongcheng Zhang, Jiale Zhao, Jingwen Yang, Yihe Zhou, Jianwei Lv, Tongya Zheng, Hengtong Lu, Wei Chen, Yan Xie, Mingli Song

Main category: cs.LG

TL;DR: RuscaRL is a reinforcement learning framework that uses checklist-style rubrics to break the exploration bottleneck in LLM reasoning by providing explicit guidance during rollout generation and verifiable rewards during training.

Details

Motivation: To address the fundamental dilemma where RL improvement requires high-quality samples but exploration is limited by LLM capabilities, creating a cycle where unexplored patterns cannot be learned.

Method: Uses rubric-scaffolded reinforcement learning with checklist-style rubrics as explicit scaffolding for exploration during rollout generation and as verifiable rewards during model training, with gradual decay of guidance.

Result: Significantly boosts Qwen2.5-7B-Instruct from 23.6 to 50.3 on HealthBench-500, surpassing GPT-4.1, and achieves 61.1 with Qwen3-30B-A3B-Instruct, outperforming OpenAI-o3.

Conclusion: RuscaRL effectively expands reasoning boundaries and breaks the exploration bottleneck in LLM reasoning through instructional scaffolding.

Abstract: Recent advances in Large Language Models (LLMs) have underscored the potential of Reinforcement Learning (RL) to facilitate the emergence of reasoning capabilities. Despite the encouraging results, a fundamental dilemma persists as RL improvement relies on learning from high-quality samples, yet the exploration for such samples remains bounded by the inherent limitations of LLMs. This, in effect, creates an undesirable cycle in which what cannot be explored cannot be learned. In this work, we propose Rubric-Scaffolded Reinforcement Learning (RuscaRL), a novel instructional scaffolding framework designed to break the exploration bottleneck for general LLM reasoning. Specifically, RuscaRL introduces checklist-style rubrics as (1) explicit scaffolding for exploration during rollout generation, where different rubrics are provided as external guidance within task instructions to steer diverse high-quality responses. This guidance is gradually decayed over time, encouraging the model to internalize the underlying reasoning patterns; (2) verifiable rewards for exploitation during model training, where we can obtain robust LLM-as-a-Judge scores using rubrics as references, enabling effective RL on general reasoning tasks. Extensive experiments demonstrate the superiority of the proposed RuscaRL across various benchmarks, effectively expanding reasoning boundaries under the Best-of-N evaluation. Notably, RuscaRL significantly boosts Qwen2.5-7B-Instruct from 23.6 to 50.3 on HealthBench-500, surpassing GPT-4.1. Furthermore, our fine-tuned variant on Qwen3-30B-A3B-Instruct achieves 61.1 on HealthBench-500, outperforming leading LLMs including OpenAI-o3. Our code is available at https://github.com/IANNXANG/RuscaRL.

[449] Fair Supervised Learning Through Constraints on Smooth Nonconvex Unfairness-Measure Surrogates

Zahra Khatti, Daniel P. Robinson, Frank E. Curtis

Main category: cs.LG

TL;DR: A new fair supervised learning strategy using smooth nonconvex surrogates for unfairness measures and hard constraints instead of regularization, enabling tractable optimization with multiple fairness constraints.

Details

Motivation: Existing fair ML approaches use convex surrogates that may fail to ensure fairness, and regularization methods that lead to difficult optimization problems and expensive parameter tuning.

Method: Proposes smooth nonconvex surrogates based on optimization smoothing methods to approximate Heaviside functions in unfairness measures, and uses hard constraints instead of regularization to enforce fairness tolerances.

Result: The method provides tight fairness approximations, allows multiple conflicting unfairness measures simultaneously, and leads to tractable optimization problems with minimal tuning requirements.

Conclusion: The proposed strategy offers practical advantages over existing approaches by ensuring fairness through tight approximations and hard constraints while maintaining computational tractability.

Abstract: A new strategy for fair supervised machine learning is proposed. The main advantages of the proposed strategy as compared to others in the literature are as follows. (a) We introduce a new smooth nonconvex surrogate to approximate the Heaviside functions involved in discontinuous unfairness measures. The surrogate is based on smoothing methods from the optimization literature, and is new for the fair supervised learning literature. The surrogate is a tight approximation which ensures the trained prediction models are fair, as opposed to other (e.g., convex) surrogates that can fail to lead to a fair prediction model in practice. (b) Rather than rely on regularizers (that lead to optimization problems that are difficult to solve) and corresponding regularization parameters (that can be expensive to tune), we propose a strategy that employs hard constraints so that specific tolerances for unfairness can be enforced without the complications associated with the use of regularization. (c) Our proposed strategy readily allows for constraints on multiple (potentially conflicting) unfairness measures at the same time. Multiple measures can be considered with a regularization approach, but at the cost of having even more difficult optimization problems to solve and further expense for tuning. By contrast, through hard constraints, our strategy leads to optimization models that can be solved tractably with minimal tuning.

[450] Uni-Instruct: One-step Diffusion Model through Unified Diffusion Divergence Instruction

Yifei Wang, Weimin Bai, Colin Zhang, Debing Zhang, Weijian Luo, He Sun

Main category: cs.LG

TL;DR: Uni-Instruct unifies over 10 existing one-step diffusion distillation methods into a theory-driven framework based on diffusion expansion theory of f-divergence family, achieving state-of-the-art performance on image generation benchmarks.

Details

Motivation: To provide a unified theoretical framework for understanding and improving one-step diffusion distillation approaches, overcoming the intractability issues in existing methods.

Method: Proposes diffusion expansion theory of f-divergence family and introduces key theories to overcome intractability, resulting in an equivalent tractable loss for training one-step diffusion models by minimizing expanded f-divergence.

Result: Achieves record-breaking FID scores: 1.46 (unconditional) and 1.38 (conditional) on CIFAR10, and 1.02 on ImageNet-64x64, outperforming 79-step teacher diffusion. Also shows decent performance in text-to-3D generation, slightly outperforming SDS and VSD.

Conclusion: Uni-Instruct provides both solid theoretical unification and empirical improvements for one-step diffusion distillation, potentially benefiting future studies on diffusion model knowledge transfer.

Abstract: In this paper, we unify more than 10 existing one-step diffusion distillation approaches, such as Diff-Instruct, DMD, SIM, SiD, $f$-distill, etc, inside a theory-driven framework which we name the \textbf{\emph{Uni-Instruct}}. Uni-Instruct is motivated by our proposed diffusion expansion theory of the $f$-divergence family. Then we introduce key theories that overcome the intractability issue of the original expanded $f$-divergence, resulting in an equivalent yet tractable loss that effectively trains one-step diffusion models by minimizing the expanded $f$-divergence family. The novel unification introduced by Uni-Instruct not only offers new theoretical contributions that help understand existing approaches from a high-level perspective but also leads to state-of-the-art one-step diffusion generation performances. On the CIFAR10 generation benchmark, Uni-Instruct achieves record-breaking Frechet Inception Distance (FID) values of \textbf{\emph{1.46}} for unconditional generation and \textbf{\emph{1.38}} for conditional generation. On the ImageNet-$64\times 64$ generation benchmark, Uni-Instruct achieves a new SoTA one-step generation FID of \textbf{\emph{1.02}}, which outperforms its 79-step teacher diffusion with a significant improvement margin of 1.33 (1.02 vs 2.35). We also apply Uni-Instruct on broader tasks like text-to-3D generation. For text-to-3D generation, Uni-Instruct gives decent results, which slightly outperforms previous methods, such as SDS and VSD, in terms of both generation quality and diversity. Both the solid theoretical and empirical contributions of Uni-Instruct will potentially help future studies on one-step diffusion distillation and knowledge transferring of diffusion models.

[451] Conformal Prediction for Time-series Forecasting with Change Points

Sophia Sun, Rose Yu

Main category: cs.LG

TL;DR: A novel conformal prediction algorithm called CPTC that handles time series with change points by integrating state prediction with online conformal prediction.

Details

Motivation: Current conformal prediction methods struggle with time series data containing change points - sudden shifts in the underlying data-generating process.

Method: CPTC integrates a model to predict the underlying state with online conformal prediction to model uncertainties in non-stationary time series.

Result: Proved CPTC’s validity and improved adaptivity under minimum assumptions, and demonstrated practical effectiveness on 6 synthetic and real-world datasets with improved validity and adaptivity compared to state-of-the-art baselines.

Conclusion: CPTC successfully addresses the gap in handling time series with change points through integrated state prediction and online conformal prediction.

Abstract: Conformal prediction has been explored as a general and efficient way to provide uncertainty quantification for time series. However, current methods struggle to handle time series data with change points - sudden shifts in the underlying data-generating process. In this paper, we propose a novel Conformal Prediction for Time-series with Change points (CPTC) algorithm, addressing this gap by integrating a model to predict the underlying state with online conformal prediction to model uncertainties in non-stationary time series. We prove CPTC’s validity and improved adaptivity in the time series setting under minimum assumptions, and demonstrate CPTC’s practical effectiveness on 6 synthetic and real-world datasets, showing improved validity and adaptivity compared to state-of-the-art baselines.

[452] Discrete Neural Flow Samplers with Locally Equivariant Transformer

Zijing Ou, Ruixiang Zhang, Yingzhen Li

Main category: cs.LG

TL;DR: DNFS is a trainable framework for discrete sampling that learns rate matrices for continuous-time Markov chains, using control variates for variance reduction and locally equivariant Transformers for efficient parameterization.

Details

Motivation: Markov chain Monte Carlo methods for discrete sampling often suffer from slow mixing and poor convergence, motivating the need for more efficient and trainable sampling frameworks.

Method: Learn rate matrices of continuous-time Markov chains to satisfy Kolmogorov equation, use control variates for variance reduction in partition function estimation, and employ locally equivariant Transformers for efficient parameterization.

Result: DNFS demonstrates efficacy in sampling from unnormalised distributions, training discrete energy-based models, and solving combinatorial optimisation problems across various applications.

Conclusion: DNFS provides an effective and efficient framework for discrete sampling that outperforms traditional MCMC methods through learnable rate matrices and optimized parameterization.

Abstract: Sampling from unnormalised discrete distributions is a fundamental problem across various domains. While Markov chain Monte Carlo offers a principled approach, it often suffers from slow mixing and poor convergence. In this paper, we propose Discrete Neural Flow Samplers (DNFS), a trainable and efficient framework for discrete sampling. DNFS learns the rate matrix of a continuous-time Markov chain such that the resulting dynamics satisfy the Kolmogorov equation. As this objective involves the intractable partition function, we then employ control variates to reduce the variance of its Monte Carlo estimation, leading to a coordinate descent learning algorithm. To further facilitate computational efficiency, we propose locally equivaraint Transformer, a novel parameterisation of the rate matrix that significantly improves training efficiency while preserving powerful network expressiveness. Empirically, we demonstrate the efficacy of DNFS in a wide range of applications, including sampling from unnormalised distributions, training discrete energy-based models, and solving combinatorial optimisation problems.

[453] Unfolding Generative Flows with Koopman Operators: Fast and Interpretable Sampling

Erkan Turan, Aristotelis Siozopoulos, Louis Martinez, Julien Gaubil, Emery Pierson, Maks Ovsjanikov

Main category: cs.LG

TL;DR: The paper proposes using Koopman theory to linearize Continuous Normalizing Flows (CNFs), enabling one-step sampling and providing interpretability through spectral analysis.

Details

Motivation: CNFs suffer from slow sampling due to solving nonlinear ODEs with hundreds of function evaluations. Existing acceleration methods still result in nonlinear black box dynamics, limiting efficiency and interpretability.

Method: Lift Conditional Flow Matching (CFM) into a higher-dimensional Koopman space to represent evolution with a single linear operator, enabling closed-form sampling via matrix exponential and spectral analysis.

Result: Achieves competitive sample quality with dramatic speedups while enabling spectral analysis of generative flows through Koopman eigenvalues and modes.

Conclusion: Koopman linearization provides a fundamentally different approach that enables both efficient one-step sampling and novel interpretability for generative flows.

Abstract: Continuous Normalizing Flows (CNFs) enable elegant generative modeling but remain bottlenecked by slow sampling: producing a single sample requires solving a nonlinear ODE with hundreds of function evaluations. Recent approaches such as Rectified Flow and OT-CFM accelerate sampling by straightening trajectories, yet the learned dynamics remain nonlinear black boxes, limiting both efficiency and interpretability. We propose a fundamentally different perspective: globally linearizing flow dynamics via Koopman theory. By lifting Conditional Flow Matching (CFM) into a higher-dimensional Koopman space, we represent its evolution with a single linear operator. This yields two key benefits. First, sampling becomes one-step and parallelizable, computed in closed form via the matrix exponential. Second, the Koopman operator provides a spectral blueprint of generation, enabling novel interpretability through its eigenvalues and modes. We derive a practical, simulation-free training objective that enforces infinitesimal consistency with the teacher’s dynamics and show that this alignment preserves fidelity along the full generative path, distinguishing our method from boundary-only distillation. Empirically, our approach achieves competitive sample quality with dramatic speedups, while uniquely enabling spectral analysis of generative flows.

[454] Beyond Masked and Unmasked: Discrete Diffusion Models via Partial Masking

Chen-Hao Chao, Wei-Fang Sun, Hanwen Liang, Chun-Yi Lee, Rahul G. Krishnan

Main category: cs.LG

TL;DR: Prime introduces partial masking states in masked diffusion models to reduce redundant computation and improve efficiency by allowing tokens to take intermediate states between masked and unmasked, enabling fine-grained denoising.

Details

Motivation: Token sequences often remain unchanged between consecutive sampling steps in masked diffusion models, leading to redundant computation as the model repeatedly processes identical inputs.

Method: Proposes Partial masking scheme (Prime) that allows tokens to take intermediate states interpolated between masked and unmasked states, enabling predictions based on partially observed token information. Derives variational training objective and introduces architectural design for intermediate-state inputs.

Result: Achieves perplexity of 15.36 on OpenWebText, outperforming previous MDM (21.52), autoregressive models (17.54), and hybrid variants (17.58). On image data, attains FID scores of 3.26 on CIFAR-10 and 6.98 on ImageNet-32, comparable to leading continuous generative models.

Conclusion: Prime demonstrates superior performance across diverse generative modeling tasks without relying on autoregressive formulation, achieving state-of-the-art results on both text and image data.

Abstract: Masked diffusion models (MDM) are powerful generative models for discrete data that generate samples by progressively unmasking tokens in a sequence. Each token can take one of two states: masked or unmasked. We observe that token sequences often remain unchanged between consecutive sampling steps; consequently, the model repeatedly processes identical inputs, leading to redundant computation. To address this inefficiency, we propose the Partial masking scheme (Prime), which augments MDM by allowing tokens to take intermediate states interpolated between the masked and unmasked states. This design enables the model to make predictions based on partially observed token information, and facilitates a fine-grained denoising process. We derive a variational training objective and introduce a simple architectural design to accommodate intermediate-state inputs. Our method demonstrates superior performance across a diverse set of generative modeling tasks. On text data, it achieves a perplexity of 15.36 on OpenWebText, outperforming previous MDM (21.52), autoregressive models (17.54), and their hybrid variants (17.58), without relying on an autoregressive formulation. On image data, it attains competitive FID scores of 3.26 on CIFAR-10 and 6.98 on ImageNet-32, comparable to leading continuous generative models.

[455] Q-Palette: Fractional-Bit Quantizers Toward Optimal Bit Allocation for Efficient LLM Deployment

Deokjae Lee, Hyun Oh Song

Main category: cs.LG

TL;DR: Q-Palette is a weight-only post-training quantization method for LLMs that uses fractional-bit quantizers and a mixed-scheme framework to optimize quantization performance under resource constraints.

Details

Motivation: Weight-only PTQ is crucial for reducing LLM memory footprint and latency in edge scenarios, but irregular weight distributions with heavy-tailed outliers complicate quantization. Rotation-based methods transform weights into near-Gaussian distributions, but optimal bit allocation requires fractional-bit quantizers.

Method: Introduces Q-Palette - a collection of fractional-bit quantizers (trellis-coded, vector, and scalar quantizers) with optimized CUDA kernels. Proposes a mixed-scheme quantization framework that jointly optimizes quantizer choices and layer fusion decisions given resource constraints.

Result: The method achieves near-optimal quantization performance by approaching the Gaussian distortion-rate bound through fine-grained fractional-bit quantizers.

Conclusion: Q-Palette provides a versatile quantization solution that bridges theoretical optimal bit allocation with practical implementation, enabling efficient LLM inference on memory-constrained devices.

Abstract: We study weight-only post-training quantization (PTQ), which quantizes the weights of a large language model (LLM) without retraining, using little or no calibration data. Weight-only PTQ is crucial for reducing the memory footprint and latency of LLM inference, especially in memory-bound, small-batch inference scenarios, such as personalized inference on edge devices. Despite its importance, irregular weight distributions with heavy-tailed outliers in LLMs complicate quantization, recently motivating rotation-based methods that transform weights into near-Gaussian distributions, which are more regular with fewer outliers, thereby reducing quantization error. In this work, we first derive the information-theoretically optimal bit allocation for Gaussianized weights under given bit budgets, revealing that fine-grained fractional-bit quantizers approaching the Gaussian distortion-rate bound are essential to achieve near-optimal quantization performance. To bridge this theoretical insight and practical implementation, we introduce Q-Palette, a versatile collection of fractional-bit quantizers that range from trellis-coded quantizers offering near-optimal distortion to simpler vector and scalar quantizers optimized for faster inference, all efficiently implemented with optimized CUDA kernels across various bitwidths. Furthermore, leveraging Q-Palette as a foundational component, we propose a novel mixed-scheme quantization framework, jointly optimizing quantizer choices and layer fusion decisions given resource constraints. The code is available at https://github.com/snu-mllab/Q-Palette.

[456] Diffusion-Based Hierarchical Graph Neural Networks for Simulating Nonlinear Solid Mechanics

Tobias Würth, Niklas Freymuth, Gerhard Neumann, Luise Kärger

Main category: cs.LG

TL;DR: ROBIN is a novel graph-based learned simulator that uses rolling diffusion-batched inference and hierarchical graph neural networks to overcome limitations of existing simulators in capturing global phenomena and reducing error accumulation in solid mechanics.

Details

Motivation: Existing graph-based learned simulators struggle with capturing global phenomena like bending and long-range correlations in solid mechanics, and suffer from error accumulation over long rollouts due to local message passing and direct next-step prediction.

Method: ROBIN integrates two innovations: (1) Rolling Diffusion-Batched Inference (ROBI) - a parallelized inference scheme that amortizes diffusion-based refinement costs across physical time steps, and (2) Hierarchical Graph Neural Network built on algebraic multigrid coarsening for multiscale message passing across different mesh resolutions.

Result: ROBIN achieves state-of-the-art accuracy on challenging 2D and 3D solid mechanics benchmarks with geometric, material, and contact nonlinearities, substantially outperforming existing next-step learned simulators while reducing inference time by up to an order of magnitude compared to standard diffusion simulators.

Conclusion: ROBIN successfully addresses the limitations of current graph-based learned simulators by enabling efficient capture of both local dynamics and global structural effects, making it highly effective for complex solid mechanics simulations.

Abstract: Graph-based learned simulators have emerged as a promising approach for simulating physical systems on unstructured meshes, offering speed and generalization across diverse geometries. However, they often struggle with capturing global phenomena, such as bending or long-range correlations usually occurring in solid mechanics, and suffer from error accumulation over long rollouts due to their reliance on local message passing and direct next-step prediction. We address these limitations by introducing the Rolling Diffusion-Batched Inference Network (ROBIN), a novel learned simulator that integrates two key innovations: (i) Rolling Diffusion-Batched Inference (ROBI), a parallelized inference scheme that amortizes the cost of diffusion-based refinement across physical time steps by overlapping denoising steps across a temporal window. (ii) A Hierarchical Graph Neural Network built on algebraic multigrid coarsening, enabling multiscale message passing across different mesh resolutions. This architecture, implemented via Algebraic-hierarchical Message Passing Networks, captures both fine-scale local dynamics and global structural effects critical for phenomena like beam bending or multi-body contact. We validate ROBIN on challenging 2D and 3D solid mechanics benchmarks involving geometric, material, and contact nonlinearities. ROBIN achieves state-of-the-art accuracy on all tasks, substantially outperforming existing next-step learned simulators while reducing inference time by up to an order of magnitude compared to standard diffusion simulators.

[457] Semi-off-Policy Reinforcement Learning for Vision-Language Slow-Thinking Reasoning

Junhao Shen, Haiteng Zhao, Yuzhe Gu, Songyang Gao, Kuikun Liu, Haian Huang, Jianfei Gao, Dahua Lin, Wenwei Zhang, Kai Chen

Main category: cs.LG

TL;DR: SOPHIA is a semi-off-policy RL method that enhances LVLMs’ slow-thinking reasoning by combining on-policy visual understanding with off-policy reasoning from language models, using outcome-based rewards and visual reward propagation.

Details

Motivation: LVLMs trained with vision-language alignment struggle with complex multimodal tasks due to limited rollout space for on-policy RL, and direct off-policy RL from external models causes visual hallucinations from mismatched visual perception abilities.

Method: Builds semi-off-policy behavior model combining trainable LVLM’s visual understanding with language model’s reasoning, assigns outcome-based rewards to reasoning, propagates visual rewards backward, and learns from reasoning trajectories via off-policy RL.

Result: Improves InternVL3.0-38B by 8.50% on average, achieves SOTA among open-source LVLMs on multiple benchmarks, outperforms GPT-4.1 on MathVision (49.08%) and OlympiadBench (49.95%), and beats supervised fine-tuning and direct on-policy RL.

Conclusion: SOPHIA effectively enhances LVLMs’ slow-thinking reasoning, provides better policy initialization for further training, and demonstrates superior performance over existing methods on challenging multimodal reasoning tasks.

Abstract: Enhancing large vision-language models (LVLMs) with visual slow-thinking reasoning is crucial for solving complex multimodal tasks. However, since LVLMs are mainly trained with vision-language alignment, it is difficult to adopt on-policy reinforcement learning (RL) to develop the slow thinking ability because the rollout space is restricted by its initial abilities. Off-policy RL offers a way to go beyond the current policy, but directly distilling trajectories from external models may cause visual hallucinations due to mismatched visual perception abilities across models. To address these issues, this paper proposes SOPHIA, a simple and scalable Semi-Off-Policy RL for vision-language slow-tHInking reAsoning. SOPHIA builds a semi-off-policy behavior model by combining on-policy visual understanding from a trainable LVLM with off-policy slow-thinking reasoning from a language model, assigns outcome-based rewards to reasoning, and propagates visual rewards backward. Then LVLM learns slow-thinking reasoning ability from the obtained reasoning trajectories using propagated rewards via off-policy RL algorithms. Extensive experiments with InternVL2.5 and InternVL3.0 with 8B and 38B sizes show the effectiveness of SOPHIA. Notably, SOPHIA improves InternVL3.0-38B by 8.50% in average, reaching state-of-the-art performance among open-source LVLMs on multiple multimodal reasoning benchmarks, and even outperforms some closed-source models (e.g., GPT-4.1) on the challenging MathVision and OlympiadBench, achieving 49.08% and 49.95% pass@1 accuracy, respectively. Analysis shows SOPHIA outperforms supervised fine-tuning and direct on-policy RL methods, offering a better policy initialization for further on-policy training.

[458] ACT: Agentic Classification Tree

Vincent Grari, Tim Arni, Thibault Laugel, Sylvain Lamprier, James Zou, Marcin Detyniecki

Main category: cs.LG

TL;DR: ACT extends decision trees to handle unstructured text data by using natural-language questions as splits, refined through impurity evaluation and LLM feedback via TextGrad, achieving competitive performance while maintaining transparency.

Details

Motivation: AI systems in high-stakes settings need transparent, interpretable decisions. Decision trees provide clear rules but only work with tabular data, while LLMs handle unstructured data but lack trustworthy reasoning processes.

Method: ACT formulates each decision tree split as a natural-language question, refined through impurity-based evaluation and LLM feedback using TextGrad to optimize the questions.

Result: Experiments on text benchmarks show ACT matches or surpasses prompting-based baselines while producing transparent and interpretable decision paths.

Conclusion: ACT successfully bridges the gap between interpretable decision trees and unstructured data processing, providing both performance and transparency for high-stakes AI applications.

Abstract: When used in high-stakes settings, AI systems are expected to produce decisions that are transparent, interpretable, and auditable, a requirement increasingly expected by regulations. Decision trees such as CART provide clear and verifiable rules, but they are restricted to structured tabular data and cannot operate directly on unstructured inputs such as text. In practice, large language models (LLMs) are widely used for such data, yet prompting strategies such as chain-of-thought or prompt optimization still rely on free-form reasoning, limiting their ability to ensure trustworthy behaviors. We present the Agentic Classification Tree (ACT), which extends decision-tree methodology to unstructured inputs by formulating each split as a natural-language question, refined through impurity-based evaluation and LLM feedback via TextGrad. Experiments on text benchmarks show that ACT matches or surpasses prompting-based baselines while producing transparent and interpretable decision paths.

[459] SEC-bench: Automated Benchmarking of LLM Agents on Real-World Software Security Tasks

Hwiwon Lee, Ziqi Zhang, Hanxiao Lu, Lingming Zhang

Main category: cs.LG

TL;DR: SEC-bench is the first automated benchmarking framework for evaluating LLM agents on authentic security engineering tasks, revealing significant performance gaps in proof-of-concept generation (18.0% success) and vulnerability patching (34.0% success).

Details

Motivation: Existing benchmarks rely on synthetic challenges or simplified vulnerability datasets that fail to capture the complexity and ambiguity encountered by security engineers in practice, making rigorous security-focused evaluation of LLM agents imperative for establishing trust in their safe deployment.

Method: SEC-bench employs a novel multi-agent scaffold that automatically constructs code repositories with harnesses, reproduces vulnerabilities in isolated environments, and generates gold patches for reliable evaluation. The framework creates high-quality software vulnerability datasets with reproducible artifacts at low cost ($0.87 per instance).

Result: Comprehensive evaluation of state-of-the-art LLM code agents shows significant performance gaps, achieving at most 18.0% success in proof-of-concept generation and 34.0% in vulnerability patching on the complete dataset.

Conclusion: The results highlight the crucial steps needed toward developing LLM agents that are more practical, intelligent, and autonomous for security engineering, demonstrating the need for improved capabilities in real-world security tasks.

Abstract: Rigorous security-focused evaluation of large language model (LLM) agents is imperative for establishing trust in their safe deployment throughout the software development lifecycle. However, existing benchmarks largely rely on synthetic challenges or simplified vulnerability datasets that fail to capture the complexity and ambiguity encountered by security engineers in practice. We introduce SEC-bench, the first fully automated benchmarking framework for evaluating LLM agents on authentic security engineering tasks. SEC-bench employs a novel multi-agent scaffold that automatically constructs code repositories with harnesses, reproduces vulnerabilities in isolated environments, and generates gold patches for reliable evaluation. Our framework automatically creates high-quality software vulnerability datasets with reproducible artifacts at a cost of only $0.87 per instance. Using SEC-bench, we implement two critical software security tasks to rigorously evaluate LLM agents’ capabilities: proof-of-concept (PoC) generation and vulnerability patching. A comprehensive evaluation of state-of-the-art LLM code agents reveals significant performance gaps, achieving at most 18.0% success in PoC generation and 34.0% in vulnerability patching on our complete dataset. These results highlight the crucial steps needed toward developing LLM agents that are more practical, intelligent, and autonomous for security engineering.

[460] BlockGPT: Spatio-Temporal Modelling of Rainfall via Frame-Level Autoregression

Cristian Meo, Varun Sarathchandran, Avijit Majhi, Shao Hung, Carlo Saccardi, Ruben Imhoff, Roberto Deidda, Remko Uijlenhoet, Justin Dauwels

Main category: cs.LG

TL;DR: BlockGPT is a new generative autoregressive transformer for precipitation nowcasting that uses batched tokenization to predict full 2D fields per time step, achieving superior accuracy and much faster inference than existing methods.

Details

Motivation: Current precipitation nowcasting methods have limitations - token-based autoregressive models suffer from flawed inductive biases and slow inference, while diffusion models are computationally intensive. There's a need for models that are both accurate and computationally efficient for real-time weather applications.

Method: BlockGPT uses a model-agnostic paradigm with batched tokenization that factorizes space-time: self-attention within each frame and causal attention across frames. It predicts complete 2D precipitation fields at each time step rather than individual tokens.

Result: BlockGPT achieves superior accuracy and event localization on KNMI (Netherlands) and SEVIR (U.S.) precipitation datasets, with inference speeds up to 31x faster than comparable baselines like NowcastingGPT and DiffCast+Phydnet.

Conclusion: BlockGPT provides an effective solution for precipitation nowcasting that balances accuracy with computational efficiency, making it suitable for real-time weather forecasting applications.

Abstract: Predicting precipitation maps is a highly complex spatiotemporal modeling task, critical for mitigating the impacts of extreme weather events. Short-term precipitation forecasting, or nowcasting, requires models that are not only accurate but also computationally efficient for real-time applications. Current methods, such as token-based autoregressive models, often suffer from flawed inductive biases and slow inference, while diffusion models can be computationally intensive. To address these limitations, we introduce BlockGPT, a generative autoregressive transformer using batched tokenization (Block) method that predicts full two-dimensional fields (frames) at each time step. Conceived as a model-agnostic paradigm for video prediction, BlockGPT factorizes space-time by using self-attention within each frame and causal attention across frames; in this work, we instantiate it for precipitation nowcasting. We evaluate BlockGPT on two precipitation datasets, viz. KNMI (Netherlands) and SEVIR (U.S.), comparing it to state-of-the-art baselines including token-based (NowcastingGPT) and diffusion-based (DiffCast+Phydnet) models. The results show that BlockGPT achieves superior accuracy, event localization as measured by categorical metrics, and inference speeds up to 31x faster than comparable baselines.

[461] Heavy-Ball Momentum Method in Continuous Time and Discretization Error Analysis

Bochen Lyu, Xiaojing Zhang, Fangyi Zheng, He Wang, Zheng Wang, Zhanxing Zhu

Main category: cs.LG

TL;DR: This paper develops a continuous time approximation with explicit discretization error control for the Heavy-Ball momentum method, providing theoretical tools to bridge the gap between discrete optimization methods and continuous differential equations.

Details

Motivation: To comprehensively bridge the gap between discrete Heavy-Ball momentum dynamics and continuous time approximations by accounting for discretization error, which has not been fully addressed despite momentum's crucial role in gradient-based optimization.

Method: Design a first-order piece-wise continuous differential equation with added counter terms to explicitly account for discretization error, allowing control of error to arbitrary order of step size.

Result: Developed a continuous time model for HB momentum method with controlled discretization error, enabling analysis of implicit regularization of directional smoothness and implicit bias for diagonal linear networks.

Conclusion: The proposed continuous approximation with explicit discretization error control provides valuable theoretical tools for studying optimization methods, with applications in deep learning as demonstrated through implicit bias analysis and numerical validation.

Abstract: This paper establishes a continuous time approximation, a piece-wise continuous differential equation, for the discrete Heavy-Ball (HB) momentum method with explicit discretization error. Investigating continuous differential equations has been a promising approach for studying the discrete optimization methods. Despite the crucial role of momentum in gradient-based optimization methods, the gap between the original discrete dynamics and the continuous time approximations due to the discretization error has not been comprehensively bridged yet. In this work, we study the HB momentum method in continuous time while putting more focus on the discretization error to provide additional theoretical tools to this area. In particular, we design a first-order piece-wise continuous differential equation, where we add a number of counter terms to account for the discretization error explicitly. As a result, we provide a continuous time model for the HB momentum method that allows the control of discretization error to arbitrary order of the step size. As an application, we leverage it to find a new implicit regularization of the directional smoothness and investigate the implicit bias of HB for diagonal linear networks, indicating how our results can be used in deep learning. Our theoretical findings are further supported by numerical experiments.

[462] Learning What Matters: Steering Diffusion via Spectrally Anisotropic Forward Noise

Luca Scimeca, Thomas Jiralerspong, Berton Earnshaw, Jason Hartford, Yoshua Bengio

Main category: cs.LG

TL;DR: This paper introduces spectrally anisotropic Gaussian diffusion (SAGD), which replaces isotropic noise in diffusion models with structured frequency-diagonal covariance to shape inductive biases and better accommodate target data distributions.

Details

Motivation: To build explicit inductive biases into diffusion models that better match the target data distribution, rather than relying on implicit biases from isotropic noise.

Method: Proposes an anisotropic noise operator with frequency-diagonal covariance that unifies band-pass masks and power-law weightings, allowing selective emphasis or suppression of frequency bands while maintaining Gaussian forward process.

Result: The method outperforms standard diffusion models across several vision datasets and enables selective omission of known corruptions confined to specific frequency bands.

Conclusion: Anisotropic forward noise provides a principled way to tailor inductive bias in diffusion probabilistic models, offering better performance and corruption handling capabilities.

Abstract: Diffusion Probabilistic Models (DPMs) have achieved strong generative performance, yet their inductive biases remain largely implicit. In this work, we aim to build inductive biases into the training and sampling of diffusion models to better accommodate the target distribution of the data to model. We introduce an anisotropic noise operator that shapes these biases by replacing the isotropic forward covariance with a structured, frequency-diagonal covariance. This operator unifies band-pass masks and power-law weightings, allowing us to emphasize or suppress designated frequency bands, while keeping the forward process Gaussian. We refer to this as spectrally anisotropic Gaussian diffusion (SAGD). In this work, we derive the score relation for anisotropic covariances and show that, under full support, the learned score converges to the true data score as $t!\to!0$, while anisotropy reshapes the probability-flow path from noise to data. Empirically, we show the induced anisotropy outperforms standard diffusion across several vision datasets, and enables selective omission: learning while ignoring known corruptions confined to specific bands. Together, these results demonstrate that carefully designed anisotropic forward noise provides a simple, yet principled, handle to tailor inductive bias in DPMs.

[463] gLSTM: Mitigating Over-Squashing by Increasing Storage Capacity

Hugh Blayney, Álvaro Arroyo, Xiaowen Dong, Michael M. Bronstein

Main category: cs.LG

TL;DR: The paper re-examines GNN over-squashing through storage/retrieval capacity lens, introduces a new synthetic task to demonstrate capacity saturation, and develops a novel GNN architecture inspired by sequence modeling techniques that shows strong performance on both synthetic and real-world benchmarks.

Details

Motivation: GNNs suffer from over-squashing where information from large receptive fields collapses into fixed-size vectors, creating an information bottleneck. The authors aim to understand this phenomenon through the lens of model storage and retrieval capacity.

Method: The authors study limitations of existing over-squashing measurement tasks, introduce a new synthetic capacity task, and adapt ideas from sequence modeling (associative memories, fast weight programmers, xLSTM) to develop a novel GNN architecture with improved capacity.

Result: The proposed architecture demonstrates strong performance on both the synthetic capacity task and a range of real-world graph benchmarks, showing improved handling of information bottlenecks.

Conclusion: By reframing over-squashing as a capacity limitation and incorporating sequence modeling techniques, the authors develop a more effective GNN architecture that addresses information bottleneck issues in graph neural networks.

Abstract: Graph Neural Networks (GNNs) leverage the graph structure to transmit information between nodes, typically through the message-passing mechanism. While these models have found a wide variety of applications, they are known to suffer from over-squashing, where information from a large receptive field of node representations is collapsed into a single fixed sized vector, resulting in an information bottleneck. In this paper, we re-examine the over-squashing phenomenon through the lens of model storage and retrieval capacity, which we define as the amount of information that can be stored in a node’s representation for later use. We study some of the limitations of existing tasks used to measure over-squashing and introduce a new synthetic task to demonstrate that an information bottleneck can saturate this capacity. Furthermore, we adapt ideas from the sequence modeling literature on associative memories, fast weight programmers, and the xLSTM model to develop a novel GNN architecture with improved capacity. We demonstrate strong performance of this architecture both on our capacity synthetic task, as well as a range of real-world graph benchmarks.

[464] Generating Directed Graphs with Dual Attention and Asymmetric Encoding

Alba Carballo-Castro, Manuel Madeira, Yiming Qin, Dorina Thanou, Pascal Frossard

Main category: cs.LG

TL;DR: Directo is the first generative model for directed graphs using discrete flow matching, addressing challenges in modeling edge directionality through positional encodings, dual-attention mechanisms, and a robust generative framework.

Details

Motivation: Directed graphs are essential in many applications but remain underexplored in generation tasks due to the complexity of modeling edge directionality and lack of standardized benchmarks.

Method: Combines principled positional encodings for asymmetric relations, dual-attention mechanism capturing both incoming and outgoing dependencies, and a discrete flow matching framework.

Result: The method performs strongly across diverse settings and competes with specialized models for particular graph classes like directed acyclic graphs.

Conclusion: Directo establishes a solid foundation for future research in directed graph generation, demonstrating effectiveness and generality across various applications.

Abstract: Directed graphs naturally model systems with asymmetric, ordered relationships, essential to applications in biology, transportation, social networks, and visual understanding. Generating such graphs enables tasks such as simulation, data augmentation and novel instance discovery; however, directed graph generation remains underexplored. We identify two key factors limiting progress in this direction: first, modeling edge directionality introduces a substantially larger dependency space, making the underlying distribution harder to learn; second, the absence of standardized benchmarks hinders rigorous evaluation. Addressing the former requires more expressive models that are sensitive to directional topologies. We propose Directo, the first generative model for directed graphs built upon the discrete flow matching framework. Our approach combines: (i) principled positional encodings tailored to asymmetric pairwise relations, (ii) a dual-attention mechanism capturing both incoming and outgoing dependencies, and (iii) a robust, discrete generative framework. To support evaluation, we introduce a benchmark suite covering synthetic and real-world datasets. It shows that our method performs strongly across diverse settings and even competes with specialized models for particular classes, such as directed acyclic graphs. Our results highlight the effectiveness and generality of our approach, establishing a solid foundation for future research in directed graph generation.

[465] LeapFactual: Reliable Visual Counterfactual Explanation Using Conditional Flow Matching

Zhuo Cao, Xuan Zhao, Lena Krieger, Hanno Scharr, Ira Assent

Main category: cs.LG

TL;DR: LeapFactual is a novel counterfactual explanation algorithm using conditional flow matching that generates reliable counterfactuals even when true and learned decision boundaries diverge, overcoming limitations of existing methods like gradient vanishing and discontinuous latent spaces.

Details

Motivation: Current counterfactual generation methods have critical limitations including gradient vanishing, discontinuous latent spaces, and overreliance on alignment between learned and true decision boundaries, which hinders their reliability in high-stakes domains like healthcare and scientific research.

Method: Proposes LeapFactual based on conditional flow matching - a model-agnostic approach that doesn’t require differentiable loss functions and can handle human-in-the-loop systems, expanding counterfactual explanations to domains requiring human annotators.

Result: Extensive experiments on benchmark and real-world datasets show LeapFactual generates accurate, in-distribution counterfactual explanations with actionable insights. Reliable counterfactual samples can be used as new training data to enhance models.

Conclusion: LeapFactual is broadly applicable and enhances both scientific knowledge discovery and non-expert interpretability, addressing critical limitations of existing counterfactual explanation methods.

Abstract: The growing integration of machine learning (ML) and artificial intelligence (AI) models into high-stakes domains such as healthcare and scientific research calls for models that are not only accurate but also interpretable. Among the existing explainable methods, counterfactual explanations offer interpretability by identifying minimal changes to inputs that would alter a model’s prediction, thus providing deeper insights. However, current counterfactual generation methods suffer from critical limitations, including gradient vanishing, discontinuous latent spaces, and an overreliance on the alignment between learned and true decision boundaries. To overcome these limitations, we propose LeapFactual, a novel counterfactual explanation algorithm based on conditional flow matching. LeapFactual generates reliable and informative counterfactuals, even when true and learned decision boundaries diverge. Following a model-agnostic approach, LeapFactual is not limited to models with differentiable loss functions. It can even handle human-in-the-loop systems, expanding the scope of counterfactual explanations to domains that require the participation of human annotators, such as citizen science. We provide extensive experiments on benchmark and real-world datasets showing that LeapFactual generates accurate and in-distribution counterfactual explanations that offer actionable insights. We observe, for instance, that our reliable counterfactual samples with labels aligning to ground truth can be beneficially used as new training data to enhance the model. The proposed method is broadly applicable and enhances both scientific knowledge discovery and non-expert interpretability.

[466] Online Conformal Prediction with Efficiency Guarantees

Vaidehi Srinivas

Main category: cs.LG

TL;DR: The paper studies online conformal prediction for interval forecasting, aiming to optimize efficiency (interval length) while maintaining target coverage rates, with different results for exchangeable vs arbitrary input sequences.

Details

Motivation: To develop online algorithms that construct efficient confidence intervals while maintaining target coverage rates, addressing the gap between exchangeable and arbitrary input sequences in conformal prediction.

Method: Developed deterministic algorithms for online interval prediction that achieve coverage-efficiency tradeoffs, with different approaches for exchangeable sequences (achieving near-optimal efficiency) and arbitrary sequences (showing fundamental limitations and providing matching algorithms).

Result: For exchangeable sequences: algorithms achieve (1-α)-o(1) coverage with length bounded by the best fixed interval. For arbitrary sequences: fundamental tradeoff between approximation ratio and coverage mistakes, with matching algorithm achieving Pareto-optimal settings.

Conclusion: There’s a fundamental gap between exchangeable and arbitrary settings in online conformal prediction, with no single algorithm being Pareto-optimal for both cases, requiring different approaches for different sequence types.

Abstract: We study the problem of conformal prediction in a novel online framework that directly optimizes efficiency. In our problem, we are given a target miscoverage rate $\alpha > 0$, and a time horizon $T$. On each day $t \le T$ an algorithm must output an interval $I_t \subseteq [0, 1]$, then a point $y_t \in [0, 1]$ is revealed. The goal of the algorithm is to achieve coverage, that is, $y_t \in I_t$ on (close to) a $(1 - \alpha)$-fraction of days, while maintaining efficiency, that is, minimizing the average volume (length) of the intervals played. This problem is an online analogue to the problem of constructing efficient confidence intervals. We study this problem over arbitrary and exchangeable (random order) input sequences. For exchangeable sequences, we show that it is possible to construct intervals that achieve coverage $(1 - \alpha) - o(1)$, while having length upper bounded by the best fixed interval that achieves coverage in hindsight. For arbitrary sequences however, we show that any algorithm that achieves a $\mu$-approximation in average length compared to the best fixed interval achieving coverage in hindsight, must make a multiplicative factor more mistakes than $\alpha T$, where the multiplicative factor depends on $\mu$ and the aspect ratio of the problem. Our main algorithmic result is a matching algorithm that can recover all Pareto-optimal settings of $\mu$ and number of mistakes. Furthermore, our algorithm is deterministic and therefore robust to an adaptive adversary. This gap between the exchangeable and arbitrary settings is in contrast to the classical online learning problem. In fact, we show that no single algorithm can simultaneously be Pareto-optimal for arbitrary sequences and optimal for exchangeable sequences. On the algorithmic side, we give an algorithm that achieves the near-optimal tradeoff between the two cases.

[467] What Expressivity Theory Misses: Message Passing Complexity for GNNs

Niklas Kemper, Tom Wollschläger, Stephan Günnemann

Main category: cs.LG

TL;DR: The paper proposes Message Passing Complexity (MPC) as a new framework to replace expressivity theory for analyzing GNNs, arguing that expressivity theory is misguided for practical applications.

Details

Motivation: Current focus on expressivity theory is misguided because higher expressivity is unnecessary for most real-world tasks, and expressivity theory's binary characterization fails to reflect practical GNN capabilities.

Method: Propose Message Passing Complexity (MPC) - a continuous measure that quantifies the difficulty for a GNN architecture to solve tasks through message passing, capturing practical limitations like over-squashing.

Result: MPC’s theoretical predictions correlate with empirical performance on fundamental GNN tasks, successfully explaining architectural successes and failures.

Conclusion: MPC advances beyond expressivity theory to provide a more powerful and nuanced framework for understanding and improving GNN architectures, effectively narrowing the gap between theory and practice.

Abstract: Expressivity theory, characterizing which graphs a GNN can distinguish, has become the predominant framework for analyzing GNNs, with new models striving for higher expressivity. However, we argue that this focus is misguided: First, higher expressivity is not necessary for most real-world tasks as these tasks rarely require expressivity beyond the basic WL test. Second, expressivity theory’s binary characterization and idealized assumptions fail to reflect GNNs’ practical capabilities. To overcome these limitations, we propose Message Passing Complexity (MPC): a continuous measure that quantifies the difficulty for a GNN architecture to solve a given task through message passing. MPC captures practical limitations like over-squashing while preserving the theoretical impossibility results from expressivity theory, effectively narrowing the gap between theory and practice. Through extensive validation on fundamental GNN tasks, we show that MPC’s theoretical predictions correlate with empirical performance, successfully explaining architectural successes and failures. Thereby, MPC advances beyond expressivity theory to provide a more powerful and nuanced framework for understanding and improving GNN architectures.

[468] Conformal Prediction for Signal Temporal Logic Inference

Danyang Li, Yixuan Wang, Matthew Cleaveland, Mingyu Cai, Roberto Tron

Main category: cs.LG

TL;DR: An end-to-end differentiable conformal prediction framework for Signal Temporal Logic inference that provides statistical guarantees while improving both reliability and interpretability of learned formulas.

Details

Motivation: Existing STL inference methods lack formal confidence guarantees for inferred rules, and conformal prediction is typically applied as a post-training wrapper without improving model learning.

Method: Introduces a robustness-based nonconformity score, embeds a smooth CP layer directly into training, and uses a new loss function that simultaneously optimizes inference accuracy and CP prediction sets with a single term.

Result: Experiments show the approach reduces prediction uncertainty (high coverage with smaller prediction sets) and improves accuracy (fewer misclassifications) over state-of-the-art baselines.

Conclusion: The proposed framework successfully integrates conformal prediction into STL inference, providing statistical guarantees while enhancing both reliability and interpretability of the learned temporal logic formulas.

Abstract: Signal Temporal Logic (STL) inference seeks to extract human-interpretable rules from time-series data, but existing methods lack formal confidence guarantees for the inferred rules. Conformal prediction (CP) is a technique that can provide statistical correctness guarantees, but is typically applied as a post-training wrapper without improving model learning. Instead, we introduce an end-to-end differentiable CP framework for STL inference that enhances both reliability and interpretability of the resulting formulas. We introduce a robustness-based nonconformity score, embed a smooth CP layer directly into training, and employ a new loss function that simultaneously optimizes inference accuracy and CP prediction sets with a single term. Following training, an exact CP procedure delivers statistical guarantees for the learned STL formulas. Experiments on benchmark time-series tasks show that our approach reduces uncertainty in predictions (i.e., it achieves high coverage while reducing prediction set size), and improves accuracy (i.e., the number of misclassifications when using a fixed threshold) over state-of-the-art baselines.

[469] MaNGO - Adaptable Graph Network Simulators via Meta-Learning

Philipp Dahlinger, Tai Hoang, Denis Blessing, Niklas Freymuth, Gerhard Neumann

Main category: cs.LG

TL;DR: MaNGO is a meta-learning approach that enables fast adaptation to new physical parameters in graph network simulators without retraining, using conditional neural processes and neural operators to handle varying material properties.

Details

Motivation: Traditional mesh-based simulations are computationally expensive, while data-driven GNSs require retraining for parameter variations. There's a need for efficient simulation methods that can adapt to different physical parameters without extensive retraining.

Method: Proposes Meta Neural Graph Operator (MaNGO) that learns shared latent structure through meta-learning, encodes graph trajectories using conditional neural processes, and combines CNPs with neural operators to mitigate error accumulation.

Result: MaNGO achieves superior performance over existing GNS methods on dynamics prediction tasks with varying material properties, with accuracy on unseen properties close to oracle model performance.

Conclusion: Meta-learning enables efficient adaptation to new physical parameters in physics simulations, eliminating the need for retraining while maintaining high accuracy.

Abstract: Accurately simulating physics is crucial across scientific domains, with applications spanning from robotics to materials science. While traditional mesh-based simulations are precise, they are often computationally expensive and require knowledge of physical parameters, such as material properties. In contrast, data-driven approaches like Graph Network Simulators (GNSs) offer faster inference but suffer from two key limitations: Firstly, they must be retrained from scratch for even minor variations in physical parameters, and secondly they require labor-intensive data collection for each new parameter setting. This is inefficient, as simulations with varying parameters often share a common underlying latent structure. In this work, we address these challenges by learning this shared structure through meta-learning, enabling fast adaptation to new physical parameters without retraining. To this end, we propose a novel architecture that generates a latent representation by encoding graph trajectories using conditional neural processes (CNPs). To mitigate error accumulation over time, we combine CNPs with a novel neural operator architecture. We validate our approach, Meta Neural Graph Operator (MaNGO), on several dynamics prediction tasks with varying material properties, demonstrating superior performance over existing GNS methods. Notably, MaNGO achieves accuracy on unseen material properties close to that of an oracle model.

Jose Tupayachi, Mustafa C. Camur, Kevin Heaslip, Xueping Li

Main category: cs.LG

TL;DR: TW-GCN framework combines Graph Convolutional Networks with temporal models to predict EV charging demand using traffic, weather, and infrastructure data, achieving best performance with 3-hour forecasts and 1DCNN temporal model.

Details

Motivation: Address challenges from uneven spatial distribution and irregular utilization of EV charging infrastructure that affect power grid stability and investment planning.

Method: TW-GCN spatio-temporal forecasting framework combining Graph Convolutional Networks with temporal architectures, using real-world traffic flows, weather conditions, and proprietary EV infrastructure data.

Result: Mid-horizon (3-hour) forecasts achieve best balance between responsiveness and stability, with 1DCNN consistently outperforming other temporal models. Regional analysis shows predictive accuracy disparities across different Tennessee regions.

Conclusion: TW-GCN framework advances integration of data-driven intelligence into EV infrastructure planning, supporting sustainable mobility transitions and resilient grid management.

Abstract: Transportation remains a major contributor to greenhouse gas emissions, highlighting the urgency of transitioning toward sustainable alternatives such as electric vehicles (EVs). Yet, uneven spatial distribution and irregular utilization of charging infrastructure create challenges for both power grid stability and investment planning. This study introduces TW-GCN, a spatio-temporal forecasting framework that combines Graph Convolutional Networks with temporal architectures to predict EV charging demand in Tennessee, United States (U.S.). We utilize real-world traffic flows, weather conditions, and proprietary data provided by one of the largest EV infrastructure company in the U.S. to capture both spatial dependencies and temporal dynamics. Extensive experiments across varying lag horizons, clustering strategies, and sequence lengths reveal that mid-horizon (3-hour) forecasts achieve the best balance between responsiveness and stability, with 1DCNN consistently outperforming other temporal models. Regional analysis shows disparities in predictive accuracy across East, Middle, and West Tennessee, reflecting how station density, population, and local demand variability shape model performance. The proposed TW-GCN framework advances the integration of data-driven intelligence into EV infrastructure planning, supporting both sustainable mobility transitions and resilient grid management.

[471] Incentivizing Time-Aware Fairness in Data Sharing

Jiangwei Chen, Kieu Thao Nguyen Pham, Rachael Hwee Ling Sim, Arun Verma, Zhaoxuan Wu, Chuan-Sheng Foo, Bryan Kian Hsiang Low

Main category: cs.LG

TL;DR: Proposes a time-aware data sharing framework with novel incentives that reward parties who join earlier with higher value, addressing the limitation of simultaneous participation in collaborative ML.

Details

Motivation: Existing collaborative ML frameworks assume all parties join simultaneously, but real-world scenarios involve parties joining at different times due to data cleaning, legal barriers, or unawareness. Early joiners incur higher risk and encourage others, so they deserve higher rewards.

Method: Developed a fair and time-aware data sharing framework with novel time-aware incentives. Created methods for determining reward values that satisfy these incentives and generate model rewards to realize the calculated values.

Result: Empirical demonstration on synthetic and real-world datasets shows the properties of the proposed methods work effectively in practice.

Conclusion: The proposed time-aware framework successfully addresses the limitation of simultaneous participation by providing appropriate incentives for early joiners, making collaborative data sharing more realistic and fair in real-world scenarios.

Abstract: In collaborative data sharing and machine learning, multiple parties aggregate their data resources to train a machine learning model with better model performance. However, as the parties incur data collection costs, they are only willing to do so when guaranteed incentives, such as fairness and individual rationality. Existing frameworks assume that all parties join the collaboration simultaneously, which does not hold in many real-world scenarios. Due to the long processing time for data cleaning, difficulty in overcoming legal barriers, or unawareness, the parties may join the collaboration at different times. In this work, we propose the following perspective: As a party who joins earlier incurs higher risk and encourages the contribution from other wait-and-see parties, that party should receive a reward of higher value for sharing data earlier. To this end, we propose a fair and time-aware data sharing framework, including novel time-aware incentives. We develop new methods for deciding reward values to satisfy these incentives. We further illustrate how to generate model rewards that realize the reward values and empirically demonstrate the properties of our methods on synthetic and real-world datasets.

[472] Preconditioned Norms: A Unified Framework for Steepest Descent, Quasi-Newton and Adaptive Methods

Andrey Veprikov, Arman Bolatov, Samuel Horváth, Aleksandr Beznosikov, Martin Takáč, Slavomir Hanzely

Main category: cs.LG

TL;DR: A unified optimization framework that generalizes steepest descent, quasi-Newton, and adaptive methods through preconditioned matrix norms, revealing that popular optimizers like SGD, Adam, Muon, and others are special cases of the same principle.

Details

Motivation: Existing optimization methods face a fundamental trade-off between adapting to problem geometry and leveraging curvature information. Steepest descent adapts to geometry but is first-order, while quasi-Newton and adaptive methods use curvature but are restricted to Frobenius geometry.

Method: Proposes a unified framework using preconditioned matrix norms, providing systematic treatment of affine and scale invariance in matrix-parameterized settings. Introduces two new methods: MuAdam and MuAdam-SANIA, combining Muon’s spectral geometry with Adam-style preconditioning.

Result: The new optimizers MuAdam and MuAdam-SANIA are competitive with and sometimes outperform existing state-of-the-art methods in experiments.

Conclusion: The framework unifies diverse optimization approaches and enables development of new methods that effectively combine geometric adaptation with curvature utilization.

Abstract: Optimization lies at the core of modern deep learning, yet existing methods often face a fundamental trade-off between adapting to problem geometry and leveraging curvature utilization. Steepest descent algorithms adapt to different geometries through norm choices but remain strictly first-order, whereas quasi-Newton and adaptive optimizers incorporate curvature information but are restricted to Frobenius geometry, limiting their applicability across diverse architectures. In this work, we propose a unified framework generalizing steepest descent, quasi-Newton methods, and adaptive methods through the novel notion of preconditioned matrix norms. This abstraction reveals that widely used optimizers such as SGD and Adam, as well as more advanced approaches like Muon and KL-Shampoo, and recent hybrids including SOAP and SPlus, all emerge as special cases of the same principle. Within this framework, we provide the first systematic treatment of affine and scale invariance in the matrix-parameterized setting, establishing necessary and sufficient conditions under generalized norms. Building on this foundation, we introduce two new methods, $\texttt{MuAdam}$ and $\texttt{MuAdam-SANIA}$, which combine the spectral geometry of Muon with Adam-style preconditioning. Our experiments demonstrate that these optimizers are competitive with, and in some cases outperform, existing state-of-the-art methods. Our code is available at https://github.com/brain-lab-research/LIB/tree/quasi_descent

[473] Graph Few-Shot Learning via Adaptive Spectrum Experts and Cross-Set Distribution Calibration

Yonghao Liu, Yajun Wang, Chunli Guo, Wei Pang, Ximing Li, Fausto Giunchiglia, Xiaoyue Feng, Renchu Guan

Main category: cs.LG

TL;DR: GRACE is a graph few-shot learning framework that addresses limitations of fixed graph filters and distribution mismatches between support and query sets by integrating adaptive spectrum experts and cross-set distribution calibration.

Details

Motivation: Current graph few-shot learning methods use predefined graph filters that don't account for local topological heterogeneity, and assume support and query sets share the same distribution, which leads to suboptimal generalization with limited labeled data.

Method: Proposes GRACE framework with two key components: adaptive spectrum experts to handle local structural variations, and cross-set distribution calibration techniques to address distribution mismatches between support and query sets.

Result: GRACE consistently outperforms state-of-the-art baselines across various experimental settings, demonstrating improved generalization capabilities.

Conclusion: The proposed GRACE framework effectively addresses key limitations in graph few-shot learning by adapting to local structural variations and calibrating cross-set distributions, leading to superior performance compared to existing methods.

Abstract: Graph few-shot learning has attracted increasing attention due to its ability to rapidly adapt models to new tasks with only limited labeled nodes. Despite the remarkable progress made by existing graph few-shot learning methods, several key limitations remain. First, most current approaches rely on predefined and unified graph filters (e.g., low-pass or high-pass filters) to globally enhance or suppress node frequency signals. Such fixed spectral operations fail to account for the heterogeneity of local topological structures inherent in real-world graphs. Moreover, these methods often assume that the support and query sets are drawn from the same distribution. However, under few-shot conditions, the limited labeled data in the support set may not sufficiently capture the complex distribution of the query set, leading to suboptimal generalization. To address these challenges, we propose GRACE, a novel Graph few-shot leaRning framework that integrates Adaptive spectrum experts with Cross-sEt distribution calibration techniques. Theoretically, the proposed approach enhances model generalization by adapting to both local structural variations and cross-set distribution calibration. Empirically, GRACE consistently outperforms state-of-the-art baselines across a wide range of experimental settings. Our code can be found here.

[474] Rebalancing with Calibrated Sub-classes (RCS): A Statistical Fusion-based Framework for Robust Imbalanced Classification across Modalities

Priyobrata Mondal, Faizanuddin Ansari, Swagatam Das

Main category: cs.LG

TL;DR: RCS is a novel distribution calibration framework that uses weighted Gaussian mixtures from majority and intermediate classes to estimate minority class parameters, preventing feature disentanglement and mitigating overgeneralization in imbalanced classification.

Details

Motivation: Class imbalance poses a critical challenge for robust classification by biasing models toward majority classes, and distribution calibration offers a promising solution to estimate more accurate class distributions.

Method: RCS fuses statistical information from majority and intermediate class distributions via weighted Gaussian mixtures, uses an encoder-decoder network to preserve structural relationships, and generates synthetic samples from calibrated distributions.

Result: Extensive experiments on diverse image, text, and tabular datasets demonstrate that RCS consistently outperforms several baseline and state-of-the-art methods.

Conclusion: RCS effectively addresses real-world imbalanced classification challenges through its fusion-based calibration approach that incorporates neighborhood distribution information rather than relying solely on majority-class statistics.

Abstract: Class imbalance, where certain classes have insufficient data, poses a critical challenge for robust classification, often biasing models toward majority classes. Distribution calibration offers a promising avenue to address this by estimating more accurate class distributions. In this work, we propose Rebalancing with Calibrated Sub-classes (RCS) - a novel distribution calibration framework for robust imbalanced classification. RCS aims to fuse statistical information from the majority and intermediate class distributions via a weighted mixture of Gaussian components to estimate minority class parameters more accurately. An encoder-decoder network is trained to preserve structural relationships in imbalanced datasets and prevent feature disentanglement. Post-training, encoder-extracted feature vectors are leveraged to generate synthetic samples guided by the calibrated distributions. This fusion-based calibration effectively mitigates overgeneralization by incorporating neighborhood distribution information rather than relying solely on majority-class statistics. Extensive experiments on diverse image, text, and tabular datasets demonstrate that RCS consistently outperforms several baseline and state-of-the-art methods, highlighting its effectiveness and broad applicability in addressing real-world imbalanced classification challenges.

[475] Rethinking Hebbian Principle: Low-Dimensional Structural Projection for Unsupervised Learning

Shikuang Deng, Jiayuan Zhang, Yuhang Wu, Ting Chen, Shi Gu

Main category: cs.LG

TL;DR: SPHeRe is a novel unsupervised learning method that integrates orthogonality and structural information preservation to address limitations of traditional Hebbian learning, achieving SOTA performance on image classification benchmarks and showing effectiveness in continual learning and transfer learning scenarios.

Details

Motivation: Traditional Hebbian learning suffers from unconstrained connection updates and lack of feedback mediation, limiting its scaling to complex networks and tasks. SPHeRe aims to overcome these limitations while maintaining biological plausibility.

Method: SPHeRe integrates orthogonality constraints and structural information preservation through a local auxiliary nonlinear block. It uses loss for structural preservation that backpropagates through an auxiliary lightweight projection as feedback mediation, with orthogonality constraints bounding update magnitudes.

Result: SPHeRe achieves state-of-the-art performance among unsupervised synaptic plasticity approaches on CIFAR-10, CIFAR-100, and Tiny-ImageNet. It also shows strong effectiveness in continual learning, transfer learning, and image reconstruction tasks, demonstrating robust and generalizable feature extraction.

Conclusion: This work demonstrates the competitiveness and potential of Hebbian unsupervised learning rules in modern deep learning frameworks, showing the possibility of efficient biologically inspired algorithms without strict dependence on backpropagation.

Abstract: Hebbian learning is a biological principle that intuitively describes how neurons adapt their connections through repeated stimuli. However, when applied to machine learning, it suffers serious issues due to the unconstrained updates of the connections and the lack of accounting for feedback mediation. Such shortcomings limit its effective scaling to complex network architectures and tasks. To this end, here we introduce the Structural Projection Hebbian Representation (SPHeRe), a novel unsupervised learning method that integrates orthogonality and structural information preservation through a local auxiliary nonlinear block. The loss for structural information preservation backpropagates to the input through an auxiliary lightweight projection that conceptually serves as feedback mediation while the orthogonality constraints account for the boundedness of updating magnitude. Extensive experimental results show that SPHeRe achieves SOTA performance among unsupervised synaptic plasticity approaches on standard image classification benchmarks, including CIFAR-10, CIFAR-100, and Tiny-ImageNet. Furthermore, the method exhibits strong effectiveness in continual learning and transfer learning scenarios, and image reconstruction tasks show the robustness and generalizability of the extracted features. This work demonstrates the competitiveness and potential of Hebbian unsupervised learning rules within modern deep learning frameworks, demonstrating the possibility of efficient and biologically inspired learning algorithms without the strong dependence on strict backpropagation. Our code is available at https://github.com/brain-intelligence-lab/SPHeRe.

[476] Dual-Weighted Reinforcement Learning for Generative Preference Modeling

Shengyu Feng, Yun He, Shuang Ma, Beibin Li, Yuanhao Xiong, Songlin Li, Karishma Mandyam, Julian Katz-Samuels, Shengjie Bi, Licheng Yu, Hejia Zhang, Karthik Abinav Sankararaman, Han Fang, Riham Mansour, Yiming Yang, Manaal Faruqui

Main category: cs.LG

TL;DR: DWRL is a new RL framework that integrates chain-of-thought reasoning with preference modeling using dual-weighted objectives to handle non-verifiable tasks with human preference pairs.

Details

Motivation: Extending RL from verifiable tasks to non-verifiable preference-based tasks remains challenging and underexplored, requiring new approaches that can handle human preference pairs effectively.

Method: Dual-Weighted Reinforcement Learning (DWRL) integrates CoT reasoning with Bradley-Terry model via dual-weighted RL objective with instance-wise misalignment weight and group-wise conditional preference score to train generative preference models.

Result: DWRL consistently outperforms both GPM baselines and scalar models across multiple benchmarks and model scales (Llama3, Qwen2.5), producing coherent and interpretable thoughts.

Conclusion: DWRL serves as a general framework for reasoning-enhanced preference learning that extends beyond verifiable tasks to handle human preference modeling effectively.

Abstract: Reinforcement learning (RL) has recently proven effective at scaling chain-of-thought (CoT) reasoning in large language models on tasks with verifiable answers. However, extending RL to more general non-verifiable tasks, typically in the format of human preference pairs, remains both challenging and underexplored. In this work, we propose Dual-Weighted Reinforcement Learning (DWRL), a new framework for preference modeling that integrates CoT reasoning with the Bradley-Terry (BT) model via a dual-weighted RL objective that preserves preference-modeling inductive bias. DWRL approximates the maximum-likelihood objective of the BT model with two complementary weights: an instance-wise misalignment weight, which emphasizes under-trained pairs misaligned with human preference, and a group-wise (self-normalized) conditional preference score, which promotes promising thoughts. In this paper, we apply DWRL to preference modeling by training generative preference models (GPMs) to first generate a thought and then predict the human preference score. Across multiple benchmarks and model scales (Llama3 and Qwen2.5), DWRL consistently outperforms both GPM baselines and scalar models, while producing coherent, interpretable thoughts. In summary, our results position DWRL as a general framework for reasoning-enhanced preference learning beyond verifiable tasks.

[477] Feature-driven reinforcement learning for photovoltaic in continuous intraday trading

Arega Getaneh Abate, Xiufeng Liu, Ruyu Liu, Xiaobing Zhang

Main category: cs.LG

TL;DR: A feature-driven reinforcement learning approach for PV intraday trading that uses PPO to learn interpretable bidding policies, outperforming benchmarks across diverse scenarios with transparent decision rules.

Details

Motivation: PV operators face uncertainty in generation and electricity prices, needing real-time position adjustments in continuous intraday markets to improve revenues and reduce imbalance costs.

Method: Cast as Markov Decision Process with reward balancing trading profit and imbalance penalties, solved with Proximal Policy Optimization using predominantly linear, interpretable policy trained on historical market data.

Result: Strategy consistently outperforms benchmark baselines across diverse scenarios in out-of-sample evaluation, showing rapid convergence, real-time inference, and transparent decision rules.

Conclusion: Feature-driven RL offers a practical, data-efficient, and operationally deployable pathway for active intraday participation by PV producers, with learned weights highlighting market microstructure and historical features.

Abstract: Photovoltaic (PV) operators face substantial uncertainty in generation and short-term electricity prices. Continuous intraday markets enable producers to adjust their positions in real time, potentially improving revenues and reducing imbalance costs. We propose a feature-driven reinforcement learning (RL) approach for PV intraday trading that integrates data-driven features into the state and learns bidding policies in a sequential decision framework. The problem is cast as a Markov Decision Process with a reward that balances trading profit and imbalance penalties and is solved with Proximal Policy Optimization (PPO) using a predominantly linear, interpretable policy. Trained on historical market data and evaluated out-of-sample, the strategy consistently outperforms benchmark baselines across diverse scenarios. Extensive validation shows rapid convergence, real-time inference, and transparent decision rules. Learned weights highlight the central role of market microstructure and historical features. Taken together, these results indicate that feature-driven RL offers a practical, data-efficient, and operationally deployable pathway for active intraday participation by PV producers.

[478] 3D-GSRD: 3D Molecular Graph Auto-Encoder with Selective Re-mask Decoding

Chang Wu, Zhiyuan Liu, Wen Shu, Liang Wang, Yanchen Luo, Wenqiang Lei, Yatao Bian, Junfeng Fang, Xiang Wang

Main category: cs.LG

TL;DR: 3D-GSRD is a novel masked graph modeling approach for 3D molecular representation learning that uses selective re-mask decoding to prevent 2D structure leakage while maintaining sufficient 2D context for reconstruction.

Details

Motivation: Extending masked graph modeling from 2D to 3D molecular representation learning is challenging due to conflicting requirements: avoiding 2D structure leakage to the decoder while providing enough 2D context for reconstructing re-masked atoms.

Method: Proposes 3D-GSRD with Selective Re-mask Decoding (SRD) that re-masks only 3D-relevant information while preserving 2D graph structures, combined with a 3D Relational-Transformer encoder and structure-independent decoder.

Result: Achieves state-of-the-art performance on 7 out of 8 targets in the MD17 molecular property prediction benchmark, demonstrating strong downstream performance.

Conclusion: 3D-GSRD effectively addresses the challenges of 3D masked graph modeling through selective re-mask decoding and enhances encoder’s role in molecular representation learning.

Abstract: Masked graph modeling (MGM) is a promising approach for molecular representation learning (MRL).However, extending the success of re-mask decoding from 2D to 3D MGM is non-trivial, primarily due to two conflicting challenges: avoiding 2D structure leakage to the decoder, while still providing sufficient 2D context for reconstructing re-masked atoms. To address these challenges, we propose 3D-GSRD: a 3D Molecular Graph Auto-Encoder with Selective Re-mask Decoding. The core innovation of 3D-GSRD lies in its Selective Re-mask Decoding(SRD), which re-masks only 3D-relevant information from encoder representations while preserving the 2D graph structures. This SRD is synergistically integrated with a 3D Relational-Transformer(3D-ReTrans) encoder alongside a structure-independent decoder. We analyze that SRD, combined with the structure-independent decoder, enhances the encoder’s role in MRL. Extensive experiments show that 3D-GSRD achieves strong downstream performance, setting a new state-of-the-art on 7 out of 8 targets in the widely used MD17 molecular property prediction benchmark. The code is released at https://github.com/WuChang0124/3D-GSRD.

[479] SolverLLM: Leveraging Test-Time Scaling for Optimization Problem via LLM-Guided Search

Dong Li, Xujiang Zhao, Linlin Yu, Yanchi Liu, Wei Cheng, Zhengzhang Chen, Zhong Chen, Feng Chen, Chen Zhao, Haifeng Chen

Main category: cs.LG

TL;DR: SolverLLM is a training-free framework that uses test-time scaling and Monte Carlo Tree Search to generate mathematical formulations and solver code for diverse optimization problems, achieving strong generalization without additional training.

Details

Motivation: Existing methods for using LLMs in optimization either rely on prompt engineering (poor generalization) or require costly supervised training. There's a need for a more generalizable and training-free approach.

Method: Uses test-time scaling to generate mathematical formulations and translate them into solver-ready code. Employs a novel Monte Carlo Tree Search strategy with three modifications: dynamic expansion for adaptive formulation generation, prompt backpropagation for outcome-driven feedback, and uncertainty backpropagation to incorporate reward reliability.

Result: Experiments on six standard benchmark datasets show SolverLLM outperforms both prompt-based and learning-based baselines.

Conclusion: SolverLLM achieves strong generalization across diverse optimization problems without requiring additional training, demonstrating the effectiveness of the proposed MCTS-based approach.

Abstract: Large Language Models (LLMs) offer promising capabilities for tackling complex reasoning tasks, including optimization problems. However, existing methods either rely on prompt engineering, which leads to poor generalization across problem types, or require costly supervised training. We introduce SolverLLM, a training-free framework that leverages test-time scaling to solve diverse optimization problems. Rather than solving directly, SolverLLM generates mathematical formulations and translates them into solver-ready code, guided by a novel Monte Carlo Tree Search (MCTS) strategy. To enhance the search process, we modify classical MCTS with (1) dynamic expansion for adaptive formulation generation, (2) prompt backpropagation to guide exploration via outcome-driven feedback, and (3) uncertainty backpropagation to incorporate reward reliability into decision-making. Experiments on six standard benchmark datasets demonstrate that SolverLLM outperforms both prompt-based and learning-based baselines, achieving strong generalization without additional training.

[480] Disentanglement Beyond Static vs. Dynamic: A Benchmark and Evaluation Framework for Multi-Factor Sequential Representations

Tal Barami, Nimrod Berman, Ilan Naiman, Amos H. Hason, Rotem Ezra, Omri Azencot

Main category: cs.LG

TL;DR: Introduces first standardized benchmark for multi-factor sequential disentanglement with six datasets, automated tools, and state-of-the-art Koopman-inspired model that achieves best results.

Details

Motivation: Prior work focused on simpler two-factor static/dynamic settings, overlooking the inherently multi-factor nature of real-world sequential data across vision, audio, and time series.

Method: Proposes standardized benchmark with six datasets, modular tools for integration/development/evaluation, post-hoc Latent Exploration Stage for automatic alignment, and Koopman-inspired model. Uses Vision-Language Models for automated annotation and evaluation.

Result: Koopman-inspired model achieves state-of-the-art results. Vision-Language Models successfully automate dataset annotation and serve as zero-shot disentanglement evaluators, eliminating need for manual labels.

Conclusion: Provides robust and scalable foundation for advancing multi-factor sequential disentanglement through comprehensive benchmark, automated tools, and state-of-the-art methods.

Abstract: Learning disentangled representations in sequential data is a key goal in deep learning, with broad applications in vision, audio, and time series. While real-world data involves multiple interacting semantic factors over time, prior work has mostly focused on simpler two-factor static and dynamic settings, primarily because such settings make data collection easier, thereby overlooking the inherently multi-factor nature of real-world data. We introduce the first standardized benchmark for evaluating multi-factor sequential disentanglement across six diverse datasets spanning video, audio, and time series. Our benchmark includes modular tools for dataset integration, model development, and evaluation metrics tailored to multi-factor analysis. We additionally propose a post-hoc Latent Exploration Stage to automatically align latent dimensions with semantic factors, and introduce a Koopman-inspired model that achieves state-of-the-art results. Moreover, we show that Vision-Language Models can automate dataset annotation and serve as zero-shot disentanglement evaluators, removing the need for manual labels and human intervention. Together, these contributions provide a robust and scalable foundation for advancing multi-factor sequential disentanglement.

[481] Benchmarking Probabilistic Time Series Forecasting Models on Neural Activity

Ziyu Lu, Anna J. Li, Alexander E. Ladd, Pascha Matveev, Aditya Deole, Eric Shea-Brown, J. Nathan Kutz, Nicholas A. Steinmetz

Main category: cs.LG

TL;DR: Systematic evaluation of probabilistic deep learning models for neural activity forecasting shows they outperform classical statistical methods, with the best model providing informative predictions up to 1.5 seconds ahead.

Details

Motivation: To bridge the gap between advances in deep learning for time series forecasting and their limited application to neural activity forecasting, which is central to understanding neural systems and enabling closed-loop control.

Method: Evaluated eight probabilistic deep learning models (including two foundation models) against four classical statistical models and two baseline methods on spontaneous neural activity recorded from mouse cortex via widefield imaging.

Result: Several deep learning models consistently outperformed classical approaches across prediction horizons, with the best model producing informative forecasts up to 1.5 seconds into the future.

Conclusion: The findings point toward future control applications and open new avenues for probing the intrinsic temporal structure of neural activity.

Abstract: Neural activity forecasting is central to understanding neural systems and enabling closed-loop control. While deep learning has recently advanced the state-of-the-art in the time series forecasting literature, its application to neural activity forecasting remains limited. To bridge this gap, we systematically evaluated eight probabilistic deep learning models, including two foundation models, that have demonstrated strong performance on general forecasting benchmarks. We compared them against four classical statistical models and two baseline methods on spontaneous neural activity recorded from mouse cortex via widefield imaging. Across prediction horizons, several deep learning models consistently outperformed classical approaches, with the best model producing informative forecasts up to 1.5 seconds into the future. Our findings point toward future control applications and open new avenues for probing the intrinsic temporal structure of neural activity.

cs.MA

[482] Local Guidance for Configuration-Based Multi-Agent Pathfinding

Tomoki Arita, Keisuke Okumura

Main category: cs.MA

TL;DR: Local guidance in multi-agent pathfinding improves solution quality without excessive computational cost, establishing new performance frontiers when applied to LaCAM solver.

Details

Motivation: To explore an alternative to global guidance by providing local guidance around each agent to improve coordination efficiency and reduce waiting times in multi-agent pathfinding.

Method: Providing local spatiotemporal guidance cues to planners in the vicinity of each agent, with recomputation as agents move, applied to the LaCAM configuration-based solver.

Result: Significant improvement in solution quality without exceeding moderate time budget, establishing new performance frontiers for MAPF.

Conclusion: Local guidance with informative spatiotemporal cues can effectively enhance multi-agent pathfinding performance while maintaining computational feasibility.

Abstract: Guidance is an emerging concept that improves the empirical performance of real-time, sub-optimal multi-agent pathfinding (MAPF) methods. It offers additional information to MAPF algorithms to mitigate congestion on a global scale by considering the collective behavior of all agents across the entire workspace. This global perspective helps reduce agents’ waiting times, thereby improving overall coordination efficiency. In contrast, this study explores an alternative approach: providing local guidance in the vicinity of each agent. While such localized methods involve recomputation as agents move and may appear computationally demanding, we empirically demonstrate that supplying informative spatiotemporal cues to the planner can significantly improve solution quality without exceeding a moderate time budget. When applied to LaCAM, a leading configuration-based solver, this form of guidance establishes a new performance frontier for MAPF.

[483] SORA-ATMAS: Adaptive Trust Management and Multi-LLM Aligned Governance for Future Smart Cities

Usama Antuley, Shahbaz Siddiqui, Sufian Hameed, Waqas Arif, Subhan Shah, Syed Attique Shah

Main category: cs.MA

TL;DR: SORA-ATMAS is a governance framework that enables policy-aligned coordination of multiple AI agents in smart cities, reducing errors by 35% while maintaining real-time performance and regulatory compliance.

Details

Motivation: Address governance, risk, and compliance challenges in deploying agentic AI across heterogeneous smart city ecosystems, including accountability, data privacy, and regulatory alignment issues.

Method: Implemented SORA-ATMAS framework with three domain agents (Weather, Traffic, Safety) using multiple LLMs (GPT, Grok, DeepSeek), featuring governance policies, fallback mechanisms, and cross-domain rules for safe interoperability.

Result: Achieved 35% MAE reduction across agents, stable weather monitoring, effective high-risk traffic handling (0.85), adaptive trust regulation (0.65), throughput of 13.8-17.2 requests/second, execution times <72ms, and governance delays <100ms.

Conclusion: SORA-ATMAS provides a regulation-aligned, context-aware, and verifiable governance framework that consolidates distributed agent outputs into accountable real-time decisions for resilient smart-city management.

Abstract: The rapid evolution of smart cities has increased the reliance on intelligent interconnected services to optimize infrastructure, resources, and citizen well-being. Agentic AI has emerged as a key enabler by supporting autonomous decision-making and adaptive coordination, allowing urban systems to respond in real time to dynamic conditions. Its benefits are evident in areas such as transportation, where the integration of traffic data, weather forecasts, and safety sensors enables dynamic rerouting and a faster response to hazards. However, its deployment across heterogeneous smart city ecosystems raises critical governance, risk, and compliance (GRC) challenges, including accountability, data privacy, and regulatory alignment within decentralized infrastructures. Evaluation of SORA-ATMAS with three domain agents (Weather, Traffic, and Safety) demonstrated that its governance policies, including a fallback mechanism for high-risk scenarios, effectively steer multiple LLMs (GPT, Grok, DeepSeek) towards domain-optimized, policy-aligned outputs, producing an average MAE reduction of 35% across agents. Results showed stable weather monitoring, effective handling of high-risk traffic plateaus 0.85, and adaptive trust regulation in Safety/Fire scenarios 0.65. Runtime profiling of a 3-agent deployment confirmed scalability, with throughput between 13.8-17.2 requests per second, execution times below 72~ms, and governance delays under 100 ms, analytical projections suggest maintained performance at larger scales. Cross-domain rules ensured safe interoperability, with traffic rerouting permitted only under validated weather conditions. These findings validate SORA-ATMAS as a regulation-aligned, context-aware, and verifiable governance framework that consolidates distributed agent outputs into accountable, real-time decisions, offering a resilient foundation for smart-city management.

[484] ColorAgent: Building A Robust, Personalized, and Interactive OS Agent

Ning Li, Qiqiang Lin, Zheng Wu, Xiaoyun Mo, Weiming Zhang, Yin Zhao, Xiangmou Qu, Jiamu Zhou, Jun Wang, Congmin Zheng, Yuanyi Song, Hongjiang Chen, Heyuan Huang, Jihong Wang, Jiaxin Yin, Jingwei Yu, Junwei Liao, Qiuying Peng, Xingyu Lou, Jun Wang, Weiwen Liu, Zhuosheng Zhang, Weinan Zhang

Main category: cs.MA

TL;DR: ColorAgent is an OS agent that enables long-horizon, robust interactions with the environment and personalized user engagement through reinforcement learning and multi-agent framework, achieving state-of-the-art results on Android benchmarks.

Details

Motivation: With advancements in hardware, software, and LLMs, human-OS interaction is evolving from command-line to AI agent interactions. The goal is to build an OS agent that can execute user instructions faithfully and follow user desires.

Method: Enhanced model capabilities through step-wise reinforcement learning and self-evolving training. Developed a tailored multi-agent framework for generality, consistency, and robustness. Explored personalized user intent recognition and proactive engagement.

Result: Achieved success rates of 77.2% on AndroidWorld and 50.7% on AndroidLab benchmarks, establishing new state-of-the-art performance.

Conclusion: Current benchmarks are insufficient for comprehensive OS agent evaluation. Future work should focus on evaluation paradigms, agent collaboration, and security.

Abstract: With the advancements in hardware, software, and large language model technologies, the interaction between humans and operating systems has evolved from the command-line interface to the rapidly emerging AI agent interactions. Building an operating system (OS) agent capable of executing user instructions and faithfully following user desires is becoming a reality. In this technical report, we present ColorAgent, an OS agent designed to engage in long-horizon, robust interactions with the environment while also enabling personalized and proactive user interaction. To enable long-horizon interactions with the environment, we enhance the model’s capabilities through step-wise reinforcement learning and self-evolving training, while also developing a tailored multi-agent framework that ensures generality, consistency, and robustness. In terms of user interaction, we explore personalized user intent recognition and proactive engagement, positioning the OS agent not merely as an automation tool but as a warm, collaborative partner. We evaluate ColorAgent on the AndroidWorld and AndroidLab benchmarks, achieving success rates of 77.2% and 50.7%, respectively, establishing a new state of the art. Nonetheless, we note that current benchmarks are insufficient for a comprehensive evaluation of OS agents and propose further exploring directions in future work, particularly in the areas of evaluation paradigms, agent collaboration, and security. Our code is available at https://github.com/MadeAgents/mobile-use.

[485] Modeling realistic human behavior using generative agents in a multimodal transport system: Software architecture and Application to Toulouse

Trung-Dung Vu, Benoit Gaudou, Kamaldeep Singh Oberoi

Main category: cs.MA

TL;DR: An architecture combining LLMs with agent-based simulation to model realistic human mobility behavior in multimodal transport systems, demonstrated in Toulouse with context-aware transport decisions and habit formation.

Details

Motivation: To address the challenge of modeling realistic human behavior for understanding mode choices and proposing personalized mobility solutions in complex multimodal transport systems.

Method: Integrates GAMA simulation platform with LLM-based generative agents, GTFS data for public transport, and OpenTripPlanner for multimodal routing in an agent-based simulation framework.

Result: Over a simulated month, agents made context-aware transport decisions and formed habits over time, demonstrating realistic mobility behavior patterns.

Conclusion: Combining LLMs with agent-based simulation is promising for advancing intelligent transportation systems and personalized multimodal mobility solutions, with future work needed on scaling, real-time data integration, and memory model refinement.

Abstract: Modeling realistic human behaviour to understand people’s mode choices in order to propose personalised mobility solutions remains challenging. This paper presents an architecture for modeling realistic human mobility behavior in complex multimodal transport systems, demonstrated through a case study in Toulouse, France. We apply Large Language Models (LLMs) within an agent-based simulation to capture decision-making in a real urban setting. The framework integrates the GAMA simulation platform with an LLM-based generative agent, along with General Transit Feed Specification (GTFS) data for public transport, and OpenTripPlanner for multimodal routing. GAMA platform models the interactive transport environment, providing visualization and dynamic agent interactions while eliminating the need to construct the simulation environment from scratch. This design enables a stronger focus on developing generative agents and evaluating their performance in transport decision-making processes. Over a simulated month, results show that agents not only make context-aware transport decisions but also form habits over time. We conclude that combining LLMs with agent-based simulation offers a promising direction for advancing intelligent transportation systems and personalised multimodal mobility solutions. We also discuss some limitations of this approach and outline future work on scaling to larger regions, integrating real-time data, and refining memory models.

[486] Polynomial-time Configuration Generator for Connected Unlabeled Multi-Agent Pathfinding

Takahiro Suzuki, Keisuke Okumura

Main category: cs.MA

TL;DR: CUMAPF is a variant of MAPF where agents must maintain connectivity at all times, which is NP-hard. The paper proposes PULL, a complete polynomial-time algorithm that preserves connectivity while advancing toward target configurations.

Details

Motivation: Standard MAPF is insufficient for swarm robotics applications like self-reconfiguration and marching that require continuous connectivity between agents. CUMAPF addresses this fundamental requirement.

Method: Proposed PULL algorithm uses a rule-based one-step function to compute subsequent configurations that preserve connectivity while advancing toward targets. It’s lightweight and runs in O(n²) time per step in 2D grids.

Result: PULL finds competitive solution qualities against trivial solutions for hundreds of agents in randomly generated instances. An eventually optimal solver integrating PULL into existing MAPF algorithms provides a tool for small-scale instances.

Conclusion: PULL provides a practical, polynomial-time solution for CUMAPF that maintains connectivity while achieving good performance, with an optimal solver available for smaller instances.

Abstract: We consider Connected Unlabeled Multi-Agent Pathfinding (CUMAPF), a variant of MAPF where the agents must maintain connectivity at all times. This problem is fundamental to swarm robotics applications like self-reconfiguration and marching, where standard MAPF is insufficient as it does not guarantee the required connectivity between agents. While unlabeled MAPF is tractable in optimization, CUMAPF is NP-hard even on highly restricted graph classes. To tackle this challenge, we propose PULL, a complete and polynomial-time algorithm with a simple design. It is based on a rule-based one-step function that computes a subsequent configuration that preserves connectivity and advances towards the target configuration. PULL is lightweight, and runs in $O(n^2)$ time per step in 2D grid, where $n$ is the number of agents. Our experiments further demonstrate its practical performance: PULL finds competitive solution qualities against trivial solutions for hundreds of agents, in randomly generated instances. Furthermore, we develop an eventually optimal solver that integrates PULL into an existing search-based MAPF algorithm, providing a valuable tool for small-scale instances.

[487] Vahana.jl – A framework (not only) for large-scale agent-based models

Steffen Fürst, Tim Conrad, Carlo Jaeger, Sarah Wolf

Main category: cs.MA

TL;DR: Vahana.jl is a high-performance computing framework for agent-based models that addresses computational limitations through distributed computing and Julia’s interactive environment.

Details

Motivation: Traditional ABM platforms struggle with computational demands as simulations scale, failing to fully utilize modern computing resources and hindering large-scale model development.

Method: Built on synchronous graph dynamical systems formalism, implemented in Julia with support for distribution across multiple compute nodes and leveraging Julia’s interactive REPL environment.

Result: Enables simulations beyond single-machine capabilities, especially well-suited for network-focused models, and facilitates rapid model development.

Conclusion: Vahana.jl provides a scalable solution for computationally intensive ABMs, particularly those with social network components, by combining distributed computing with an interactive development environment.

Abstract: Agent-based models (ABMs) offer a powerful framework for understanding complex systems. However, their computational demands often become a significant barrier as the number of agents and complexity of the simulation increase. Traditional ABM platforms often struggle to fully exploit modern computing resources, hindering the development of large-scale simulations. This paper presents Vahana.jl, a high performance computing open source framework that aims to address these limitations. Building on the formalism of synchronous graph dynamical systems, Vahana.jl is especially well suited for models with a focus on (social) networks. The framework seamlessly supports distribution across multiple compute nodes, enabling simulations that would otherwise be beyond the capabilities of a single machine. Implemented in Julia, Vahana.jl leverages the interactive Read-Eval-Print Loop (REPL) environment, facilitating rapid model development and experimentation.

[488] PARCO: Parallel AutoRegressive Models for Multi-Agent Combinatorial Optimization

Federico Berto, Chuanbo Hua, Laurin Luttmann, Jiwoo Son, Junyoung Park, Kyuree Ahn, Changhyun Kwon, Lin Xie, Jinkyoo Park

Main category: cs.MA

TL;DR: PARCO is a reinforcement learning framework for multi-agent combinatorial optimization that enables parallel solution construction through transformer-based communication, multiple pointer mechanism, and priority-based conflict handling.

Details

Motivation: Existing learning-based methods for multi-agent combinatorial optimization suffer from suboptimal coordination, poor generalization, and high computational latency, making them impractical for real-world applications.

Method: PARCO integrates three key components: transformer-based communication layers for agent collaboration, multiple pointer mechanism for parallel decision-making, and priority-based conflict handlers to resolve decision conflicts using learned priorities.

Result: PARCO outperforms state-of-the-art learning methods in multi-agent vehicle routing and scheduling problems, demonstrating strong generalization ability and remarkable computational efficiency.

Conclusion: PARCO provides an effective RL framework for multi-agent combinatorial optimization that addresses coordination, generalization, and efficiency challenges, with publicly available source code to support future research.

Abstract: Combinatorial optimization problems involving multiple agents are notoriously challenging due to their NP-hard nature and the necessity for effective agent coordination. Despite advancements in learning-based methods, existing approaches often face critical limitations, including suboptimal agent coordination, poor generalization, and high computational latency. To address these issues, we propose PARCO (Parallel AutoRegressive Combinatorial Optimization), a general reinforcement learning framework designed to construct high-quality solutions for multi-agent combinatorial tasks efficiently. To this end, PARCO integrates three key novel components: (1) transformer-based communication layers to enable effective agent collaboration during parallel solution construction, (2) a multiple pointer mechanism for low-latency, parallel agent decision-making, and (3) priority-based conflict handlers to resolve decision conflicts via learned priorities. We evaluate PARCO in multi-agent vehicle routing and scheduling problems, where our approach outperforms state-of-the-art learning methods, demonstrating strong generalization ability and remarkable computational efficiency. We make our source code publicly available to foster future research: https://github.com/ai4co/parco.

[489] Visual Multi-Agent System: Mitigating Hallucination Snowballing via Visual Flow

Xinlei Yu, Chengming Xu, Guibin Zhang, Yongbo He, Zhangquan Chen, Zhucun Xue, Jiangning Zhang, Yue Liao, Xiaobin Hu, Yu-Gang Jiang, Shuicheng Yan

Main category: cs.MA

TL;DR: ViF is a lightweight method that mitigates multi-agent visual hallucination snowballing in VLM-powered systems by using visual flow with selected relay tokens and attention reallocation.

Details

Motivation: Multi-agent systems using VLMs suffer from visual hallucination snowballing, where hallucinations from one agent get amplified by others due to over-reliance on textual communication.

Method: ViF uses visual flow with selected vision tokens that preserve visual evidence, and applies attention reallocation to amplify this pattern across agent interactions.

Result: The method significantly reduces hallucination snowballing and improves performance across eight benchmarks using four MAS structures and ten base models.

Conclusion: ViF effectively addresses visual hallucination snowballing in multi-agent VLM systems through visual flow communication and attention reallocation.

Abstract: Multi-Agent System (MAS) powered by Visual Language Models (VLMs) enables challenging tasks but suffers from a novel failure term, multi-agent visual hallucination snowballing, where hallucinations are seeded in a single agent and amplified by following ones due to the over-reliance on textual flow to relay visual information. Through turn-, layer-, and token-wise attention analyses, we provide detailed insights into the essence of hallucination snowballing regarding the reduction of visual attention allocation. It leads us to identify a subset of vision tokens with a unimodal attention peak in middle layers that best preserve visual evidence but gradually diminish in deeper agent turns, resulting in the visual hallucination snowballing in MAS. Thus, we propose ViF, a lightweight, plug-and-play mitigation paradigm that relays inter-agent messages with Visual Flow powered by the selected visual relay tokens and applies attention reallocation to amplify this pattern. The experiment results demonstrate that our method markedly reduces hallucination snowballing, consistently improving the performance across eight benchmarks based on four common MAS structures and ten base models. The source code is publicly available at: https://github.com/YU-deep/ViF.git.

[490] Adaptive Coopetition: Leveraging Coarse Verifier Signals for Resilient Multi-Agent LLM Reasoning

Rui Jerry Huang, Wendy Liu, Anastasia Miin, Lei Ding

Main category: cs.MA

TL;DR: AdCo is a novel inference-time framework that uses adaptive coopetition between LLM agents with UCB-based coordination to improve reasoning performance without relying on high-performance verifiers.

Details

Motivation: Existing inference-time methods have limitations: self-correction reinforces biases, Multi-Agent Collaboration lacks efficient coordination, and reliable verifiers require substantial training.

Method: LLM agents use adaptive UCB-based coopetition mechanism with coarse verifier signals to decide between collaboration and competition, iteratively refining reasoning through peer feedback.

Result: Achieves 20% relative improvement over baselines on challenging mathematical reasoning benchmarks, with robust performance across different sample sizes and configurations.

Conclusion: The adaptive coopetition framework enhances reasoning robustness by leveraging model diversity and uncertainty-driven exploration, offering a new approach to resilient multi-agent LLM systems.

Abstract: Inference-time computation is a critical yet challenging paradigm for enhancing the reasoning performance of large language models (LLMs). While existing strategies improve reasoning stability and consistency, they suffer from notable limitations: self-correction often reinforces the model’s initial biases, and Multi-Agent Collaboration (MAC) often fails due to the lack of efficient coordination mechanisms, leading to collective errors. Although high-performing verifiers can detect reasoning errors, making them reliable requires substantial training. To address these challenges, we introduce a novel inference-time framework, Adaptive Coopetition (AdCo), in which LLM agents utilize an adaptive, UCB-based “coopetition” mechanism. At each round, agents leverage coarse verifier signals to determine whether to collaborate or compete, and iteratively refine their reasoning based on peer feedback. Without relying on high-performance verifiers, our adaptive strategy achieves significant performance gains on mathematical reasoning benchmarks, yielding a 20% relative improvement over baselines on the more challenging dataset. Our approach remains robust and consistent in terms of accuracy under different sample sizes and configurations. This adaptive, signal-guided “coopetition” framework enhances reasoning robustness by leveraging both model knowledge diversity and reasoning trace measures, while also promoting uncertainty-driven exploration, especially when participants have comparable capabilities. From this perspective, our work offers a fresh lens on inference-time computation and paves the way for more resilient multi-agent LLM systems. Our code is available at: https://github.com/AdCo-Research/adaptive-coopetition.

[491] The Emergence of Complex Behavior in Large-Scale Ecological Environments

Joseph Bejjani, Chase Van Amburg, Chengrui Wang, Chloe Huangyuan Su, Sarah M. Pratt, Yasin Mazloumi, Naeem Khoshnevis, Sham M. Kakade, Kianté Brantley, Aaron Walsman

Main category: cs.MA

TL;DR: The paper investigates how physical scale and population size influence the emergence of complex behaviors in open-ended ecological environments through unsupervised evolution, reproduction, mutation, and natural selection.

Details

Motivation: To understand how complex behaviors naturally emerge in large populations due to competition and environmental pressures, rather than optimizing single policies, and explore ecology as a machine learning instrument in the era of abundant computational resources.

Method: Conducted experiments in large-scale worlds with populations over 60,000 agents, each with evolved neural network policies, using unsupervised evolution through reproduction, mutation, and natural selection in dynamic ecological environments.

Result: Identified emergent behaviors including long-range resource extraction, vision-based foraging, and predation that arise under competitive pressures. Found that larger environments and populations increase behavioral stability and consistency, with some behaviors only appearing at sufficient scales.

Conclusion: Scaling results provide promising directions for exploring ecology as a machine learning approach, demonstrating that complex behaviors naturally emerge in large-scale evolutionary settings through environmental and competitive pressures.

Abstract: We explore how physical scale and population size shape the emergence of complex behaviors in open-ended ecological environments. In our setting, agents are unsupervised and have no explicit rewards or learning objectives but instead evolve over time according to reproduction, mutation, and natural selection. As they act, agents also shape their environment and the population around them in an ongoing dynamic ecology. Our goal is not to optimize a single high-performance policy, but instead to examine how behaviors emerge and evolve across large populations due to natural competition and environmental pressures. In an effort to discover how complex behaviors naturally emerge, we conduct experiments in large-scale worlds that reach populations of more than 60,000 individual agents, each with their own evolved neural network policy. We identify various emergent behaviors such as long-range resource extraction, vision-based foraging, and predation that arise under competitive and survival pressures. We examine how sensing modalities and environmental scale affect the emergence of these behaviors, finding that some appear only in sufficiently large environments and populations, with larger scales increasing behavioral stability and consistency. While there is a rich history of research in evolutionary settings, our scaling results provide promising new directions to explore ecology as an instrument of machine learning in an era of abundant computational resources. Experimental code is available at https://github.com/jbejjani2022/ecological-emergent-behavior.

cs.MM

[492] Step-Aware Residual-Guided Diffusion for EEG Spatial Super-Resolution

Hongjun Liu, Leyu Zhou, Zijianghao Yang, Chao Yao

Main category: cs.MM

TL;DR: SRGDiff is a step-aware residual-guided diffusion model for EEG spatial super-resolution that dynamically generates high-density EEG signals from sparse measurements while maintaining fidelity and consistency.

Details

Motivation: Lightweight EEG systems are cost-effective but suffer from spatial sparsity that limits spatial fidelity and introduces bias. Existing EEG spatial super-resolution methods face challenges with distribution shift and signal distortion.

Method: Uses a diffusion model with dynamic residual conditioning that predicts step-wise temporal and spatial details from low-density input. Features additive fusion with denoiser feature maps and step-dependent affine modulation to guide high-density recovery.

Result: Achieves consistent gains of up to 40% over strong baselines across SEED, SEED-IV, and Localize-MI datasets. Shows superior performance in EEG spatial super-resolution and mitigates spatial-spectral shift between low- and high-density recordings.

Conclusion: SRGDiff effectively addresses EEG spatial super-resolution challenges, providing high-fidelity reconstructions while maintaining consistency between low- and high-density EEG signals.

Abstract: For real-world BCI applications, lightweight Electroencephalography (EEG) systems offer the best cost-deployment balance. However, such spatial sparsity of EEG limits spatial fidelity, hurting learning and introducing bias. EEG spatial super-resolution methods aim to recover high-density EEG signals from sparse measurements, yet is often hindered by distribution shift and signal distortion and thus reducing fidelity and usability for EEG analysis and visualization. To overcome these challenges, we introduce SRGDiff, a step-aware residual-guided diffusion model that formulates EEG spatial super-resolution as dynamic conditional generation. Our key idea is to learn a dynamic residual condition from the low-density input that predicts the step-wise temporal and spatial details to add and uses the evolving cue to steer the denoising process toward high-density reconstructions. At each denoising step, the proposed residual condition is additively fused with the previous denoiser feature maps, then a step-dependent affine modulation scales and shifts the activation to produce the current features. This iterative procedure dynamically extracts step-wise temporal rhythms and spatial-topographic cues to steer high-density recovery and maintain a fidelity-consistency balance. We adopt a comprehensive evaluation protocol spanning signal-, feature-, and downstream-level metrics across SEED, SEED-IV, and Localize-MI and multiple upsampling scales. SRGDiff achieves consistent gains of up to 40% over strong baselines, proving its superiority in the task of EEG spatial super-resolution. Moreover, topographic visualizations comparison and substantial EEG-FID gains jointly indicate that our SR EEG mitigates the spatial-spectral shift between low- and high-density recordings.

[493] CDI-DTI: A Strong Cross-domain Interpretable Drug-Target Interaction Prediction Framework Based on Multi-Strategy Fusion

Xiangyu Li, Haojie Yang, Kaimiao Hu, Runzhi Wu, Liangliang Liu, Ran Su

Main category: cs.MM

TL;DR: CDI-DTI is a cross-domain interpretable framework for drug-target interaction prediction that addresses generalization, cold-start, and interpretability challenges through multi-modal feature fusion and attention mechanisms.

Details

Motivation: Existing DTI prediction methods fail to address cross-domain generalization, cold-start prediction, and interpretability challenges, which are crucial for practical drug discovery applications.

Method: Integrates multi-modal features (textual, structural, functional) using multi-strategy fusion, multi-source cross-attention for feature alignment, bidirectional cross-attention for fine-grained interactions, Gram Loss for feature alignment, and deep orthogonal fusion to eliminate redundancy.

Result: Significantly outperforms existing methods on benchmark datasets, particularly in cross-domain and cold-start tasks, while maintaining high interpretability.

Conclusion: CDI-DTI provides a robust and interpretable solution for DTI prediction that effectively addresses key limitations of existing methods, making it suitable for practical drug discovery applications.

Abstract: Accurate prediction of drug-target interactions (DTI) is pivotal for drug discovery, yet existing methods often fail to address challenges like cross-domain generalization, cold-start prediction, and interpretability. In this work, we propose CDI-DTI, a novel cross-domain interpretable framework for DTI prediction, designed to overcome these limitations. By integrating multi-modal features-textual, structural, and functional-through a multi-strategy fusion approach, CDI-DTI ensures robust performance across different domains and in cold-start scenarios. A multi-source cross-attention mechanism is introduced to align and fuse features early, while a bidirectional cross-attention layer captures fine-grained intra-modal drug-target interactions. To enhance model interpretability, we incorporate Gram Loss for feature alignment and a deep orthogonal fusion module to eliminate redundancy. Experimental results on several benchmark datasets demonstrate that CDI-DTI significantly outperforms existing methods, particularly in cross-domain and cold-start tasks, while maintaining high interpretability for practical applications in drug-target interaction prediction.

eess.AS

[494] RIR-Mega: a large-scale simulated room impulse response dataset for machine learning and room acoustics modeling

Mandip Goswami

Main category: eess.AS

TL;DR: RIR-Mega is a large collection of 50,000 simulated room impulse responses with metadata, tools, and a baseline model for RT60 prediction.

Details

Motivation: Room impulse responses are essential for dereverberation, speech recognition, source localization, and room acoustics estimation, but existing datasets lack comprehensive metadata and tools.

Method: Created a dataset with simulated RIRs using a compact metadata schema, distributed with validation tools, Hugging Face loader, and a Random Forest baseline model using time and spectral features.

Result: The baseline model achieved MAE of 0.013s and RMSE of 0.022s on train/validation splits of 36,000/4,000 examples. Dataset includes 1,000 linear array and 3,000 circular array RIRs on Hugging Face, with full 50,000 RIRs on Zenodo.

Conclusion: RIR-Mega provides a comprehensive, publicly available dataset with tools to support reproducible research in room acoustics and related applications.

Abstract: Room impulse responses are a core resource for dereverberation, robust speech recognition, source localization, and room acoustics estimation. We present RIR-Mega, a large collection of simulated RIRs described by a compact, machine friendly metadata schema and distributed with simple tools for validation and reuse. The dataset ships with a Hugging Face Datasets loader, scripts for metadata checks and checksums, and a reference regression baseline that predicts RT60 like targets from waveforms. On a train and validation split of 36,000 and 4,000 examples, a small Random Forest on lightweight time and spectral features reaches a mean absolute error near 0.013 s and a root mean square error near 0.022 s. We host a subset with 1,000 linear array RIRs and 3,000 circular array RIRs on Hugging Face for streaming and quick tests, and preserve the complete 50,000 RIR archive on Zenodo. The dataset and code are public to support reproducible studies.

[495] StutterZero and StutterFormer: End-to-End Speech Conversion for Stuttering Transcription and Correction

Qianheng Xu

Main category: eess.AS

TL;DR: StutterZero and StutterFormer are the first end-to-end waveform-to-waveform models that directly convert stuttered speech into fluent speech while jointly predicting transcription, achieving significant improvements over existing methods.

Details

Motivation: Over 70 million people worldwide experience stuttering, but current automatic speech systems often misinterpret disfluent utterances. Existing methods rely on handcrafted features or multi-stage pipelines that separate transcription from audio reconstruction and amplify distortions.

Method: StutterZero uses a convolutional-bidirectional LSTM encoder-decoder with attention, while StutterFormer integrates a dual-stream Transformer with shared acoustic-linguistic representations. Both are trained on paired stuttered-fluent data from SEP-28K and LibriStutter corpora.

Result: StutterZero achieved 24% decrease in Word Error Rate and 31% improvement in semantic similarity (BERTScore) compared to Whisper-Medium. StutterFormer performed even better with 28% decrease in WER and 34% improvement in BERTScore.

Conclusion: The results validate the feasibility of direct end-to-end stutter-to-fluent speech conversion, offering new opportunities for inclusive human-computer interaction, speech therapy, and accessibility-oriented AI systems.

Abstract: Over 70 million people worldwide experience stuttering, yet most automatic speech systems misinterpret disfluent utterances or fail to transcribe them accurately. Existing methods for stutter correction rely on handcrafted feature extraction or multi-stage automatic speech recognition (ASR) and text-to-speech (TTS) pipelines, which separate transcription from audio reconstruction and often amplify distortions. This work introduces StutterZero and StutterFormer, the first end-to-end waveform-to-waveform models that directly convert stuttered speech into fluent speech while jointly predicting its transcription. StutterZero employs a convolutional-bidirectional LSTM encoder-decoder with attention, whereas StutterFormer integrates a dual-stream Transformer with shared acoustic-linguistic representations. Both architectures are trained on paired stuttered-fluent data synthesized from the SEP-28K and LibriStutter corpora and evaluated on unseen speakers from the FluencyBank dataset. Across all benchmarks, StutterZero had a 24% decrease in Word Error Rate (WER) and a 31% improvement in semantic similarity (BERTScore) compared to the leading Whisper-Medium model. StutterFormer achieved better results, with a 28% decrease in WER and a 34% improvement in BERTScore. The results validate the feasibility of direct end-to-end stutter-to-fluent speech conversion, offering new opportunities for inclusive human-computer interaction, speech therapy, and accessibility-oriented AI systems.

[496] Auditory Attention Decoding from Ear-EEG Signals: A Dataset with Dynamic Attention Switching and Rigorous Cross-Validation

Yuanming Zhang, Zeyan Song, Jing Lu, Fei Chen, Zhibin Lin

Main category: eess.AS

TL;DR: This paper introduces a novel cEEGrid ear-EEG dataset for auditory attention decoding (AAD) that captures dynamic attentional switching across multiple speakers in realistic spatial scenarios, using rigorous validation methods and achieving 41-42% accuracy with rule-based models.

Details

Motivation: To address the gap in dynamic attentional state tracking in real-world contexts using portable ear-EEG (cEEGrid) systems, as prior studies often neglect the dynamic nature of attention in realistic scenarios.

Method: Created a novel cEEGrid dataset with three concurrent speakers across five spatial locations, used nested leave-one-out validation for rigor, and evaluated four rule-based models: Wiener filter (WF), canonical component analysis (CCA), common spatial pattern (CSP), and Riemannian Geometry-based classifier (RGC).

Result: WF and CCA achieved 41.5% and 41.4% accuracy with 30-second windows, while CSP and RGC achieved 37.8% and 37.6% with 10-second windows. WF and CCA successfully tracked attentional switches across all tasks, with better performance from upper cEEGrid electrodes and right ear positions.

Conclusion: Dynamic ecological paradigms and rigorous validation are crucial for advancing AAD research with cEEGrid, demonstrating the system’s utility for tracking real-world attentional dynamics.

Abstract: Recent promising results in auditory attention decoding (AAD) using scalp electroencephalography (EEG) have motivated the exploration of cEEGrid, a flexible and portable ear-EEG system. While prior cEEGrid-based studies have confirmed the feasibility of AAD, they often neglect the dynamic nature of attentional states in real-world contexts. To address this gap, a novel cEEGrid dataset featuring three concurrent speakers distributed across three of five distinct spatial locations is introduced. The novel dataset is designed to probe attentional tracking and switching in realistic scenarios. Nested leave-one-out validation-an approach more rigorous than conventional single-loop leave-one-out validation-is employed to reduce biases stemming from EEG’s intricate temporal dynamics. Four rule-based models are evaluated: Wiener filter (WF), canonical component analysis (CCA), common spatial pattern (CSP) and Riemannian Geometry-based classifier (RGC). With a 30-second decision window, WF and CCA models achieve decoding accuracies of 41.5% and 41.4%, respectively, while CSP and RGC models yield 37.8% and 37.6% accuracies using a 10-second window. Notably, both WF and CCA successfully track attentional state switches across all experimental tasks. Additionally, higher decoding accuracies are observed for electrodes positioned at the upper cEEGrid layout and near the listener’s right ear. These findings underscore the utility of dynamic, ecologically valid paradigms and rigorous validation in advancing AAD research with cEEGrid.

[497] An Efficient Neural Network for Modeling Human Auditory Neurograms for Speech

Eylon Zohar, Israel Nelken, Boaz Rafaely

Main category: eess.AS

TL;DR: A compact convolutional encoder that approximates the deterministic mean-rate pathway of classical auditory-periphery models, enabling efficient neurogram generation without stochastic spiking effects.

Details

Motivation: Classical auditory-periphery models are computationally demanding and stochastic, limiting large-scale experimentation and low-latency applications. Existing neural encoders rarely reproduce deterministic rate-domain neurograms for direct comparison.

Method: A compact convolutional encoder trained to approximate the Bruce mean-rate pathway, mapping audio to multi-frequency neurograms with deterministic outputs (identical results for identical inputs).

Result: The encoder achieves close correspondence to the reference Bruce model while significantly reducing computational requirements, enabling efficient modeling.

Conclusion: The proposed encoder provides an efficient alternative to classical auditory-periphery models for auditory neuroscience and audio signal processing applications, maintaining accuracy while reducing computational demands.

Abstract: Classical auditory-periphery models, exemplified by Bruce et al., 2018, provide high-fidelity simulations but are stochastic and computationally demanding, limiting large-scale experimentation and low-latency use. Prior neural encoders approximate aspects of the periphery; however, few are explicitly trained to reproduce the deterministic, rate-domain neurogram , hindering like-for-like evaluation. We present a compact convolutional encoder that approximates the Bruce mean-rate pathway and maps audio to a multi-frequency neurogram. We deliberately omit stochastic spiking effects and focus on a deterministic mapping (identical outputs for identical inputs). Using a computationally efficient design, the encoder achieves close correspondence to the reference while significantly reducing computation, enabling efficient modeling and front-end processing for auditory neuroscience and audio signal processing applications.

[498] EchoFake: A Replay-Aware Dataset for Practical Speech Deepfake Detection

Tong Zhang, Yihuan Huang, Yanzhen Ren

Main category: eess.AS

TL;DR: EchoFake dataset addresses the performance gap in speech anti-spoofing systems against physical replay attacks, containing 120+ hours of audio from 13,000+ speakers with TTS speech and real-world replay recordings.

Details

Motivation: Existing anti-spoofing systems perform well on lab-generated synthetic speech but fail against practical physical replay attacks used in real-world scenarios like telephone fraud, with accuracy dropping to 59.6% on replayed audio.

Method: Created EchoFake dataset with 120+ hours of audio from 13,000+ speakers, featuring zero-shot TTS speech and physical replay recordings collected under varied devices and real-world environmental settings. Evaluated three baseline detection models.

Result: Models trained on EchoFake achieve lower average Equal Error Rates (EERs) across datasets, indicating better generalization compared to models trained on existing datasets.

Conclusion: EchoFake provides a more realistic foundation for advancing spoofing detection methods by introducing practical challenges relevant to real-world deployment, bridging the gap between lab performance and real-world effectiveness.

Abstract: The growing prevalence of speech deepfakes has raised serious concerns, particularly in real-world scenarios such as telephone fraud and identity theft. While many anti-spoofing systems have demonstrated promising performance on lab-generated synthetic speech, they often fail when confronted with physical replay attacks-a common and low-cost form of attack used in practical settings. Our experiments show that models trained on existing datasets exhibit severe performance degradation, with average accuracy dropping to 59.6% when evaluated on replayed audio. To bridge this gap, we present EchoFake, a comprehensive dataset comprising more than 120 hours of audio from over 13,000 speakers, featuring both cutting-edge zero-shot text-to-speech (TTS) speech and physical replay recordings collected under varied devices and real-world environmental settings. Additionally, we evaluate three baseline detection models and show that models trained on EchoFake achieve lower average EERs across datasets, indicating better generalization. By introducing more practical challenges relevant to real-world deployment, EchoFake offers a more realistic foundation for advancing spoofing detection methods.

[499] Relative Transfer Matrix Estimator using Covariance Subtraction

Wageesha N. Manamperi, Thushara D. Abhayapala

Main category: eess.AS

TL;DR: A flexible method using covariance subtraction to estimate Relative Transfer Matrix (ReTM) for independent sound sources, validated through speaker separation in reverberant conditions.

Details

Motivation: Blind estimation of ReTM from multichannel recordings is beneficial for practical applications in speech enhancement and speaker separation.

Method: Covariance subtraction approach to estimate ReTM for selected independent sound sources.

Result: Validated through speaker separation in reverberant conditions, showing performance at low SNR levels compared to existing ReTM and RTF-based estimators.

Conclusion: The proposed method is versatile and practically viable for ReTM estimation in real-world environments.

Abstract: The Relative Transfer Matrix (ReTM), recently introduced as a generalization of the relative transfer function for multiple receivers and sources, shows promising performance when applied to speech enhancement and speaker separation in noisy environments. Blindly estimating the ReTM of sound sources by exploiting the covariance matrices of multichannel recordings is highly beneficial for practical applications. In this paper, we use covariance subtraction to present a flexible and practically viable method for estimating the ReTM for a select set of independent sound sources. To show the versatility of the method, we validated it through a speaker separation application under reverberant conditions. Separation performance is evaluated at low signal-to-noise ratio levels in comparison with existing ReTM-based and relative transfer function-based estimators, in both simulated and real-life environments.

[500] VBx for End-to-End Neural and Clustering-based Diarization

Petr Pálka, Jiangyu Han, Marc Delcroix, Naohiro Tawara, Lukáš Burget

Main category: eess.AS

TL;DR: Improvements to EEND-VC speaker diarization framework focusing on second-stage clustering by filtering unreliable embeddings, reassigning them after clustering, and integrating VBx clustering for better robustness with large speaker counts and limited speaking durations.

Details

Motivation: To enhance the second stage of EEND-VC framework for better speaker diarization performance, particularly addressing challenges with unreliable embeddings and robustness in scenarios with many speakers and limited speaking durations.

Method: Two-stage approach: 1) Conformer-based EEND model with WavLM features for frame-level speaker activity in short windows, 2) Improved clustering stage with unreliable embedding filtering, reassignment after clustering, and VBx clustering integration for better handling of large speaker counts.

Result: The system generalizes well across multiple domains without fine-tuning or parameter tuning per dataset, matching or exceeding recent state-of-the-art performance on a compound benchmark spanning multiple domains.

Conclusion: The proposed improvements to the second stage clustering in EEND-VC framework effectively enhance speaker diarization performance and generalization across diverse domains without requiring dataset-specific tuning.

Abstract: We present improvements to speaker diarization in the two-stage end-to-end neural diarization with vector clustering (EEND-VC) framework. The first stage employs a Conformer-based EEND model with WavLM features to infer frame-level speaker activity within short windows. The identities and counts of global speakers are then derived in the second stage by clustering speaker embeddings across windows. The focus of this work is to improve the second stage; we filter unreliable embeddings from short segments and reassign them after clustering. We also integrate the VBx clustering to improve robustness when the number of speakers is large and individual speaking durations are limited. Evaluation on a compound benchmark spanning multiple domains is conducted without fine-tuning the EEND model or tuning clustering parameters per dataset. Despite this, the system generalizes well and matches or exceeds recent state-of-the-art performance.

[501] SongBloom: Coherent Song Generation via Interleaved Autoregressive Sketching and Diffusion Refinement

Chenyu Yang, Shuai Wang, Hangting Chen, Wei Tan, Jianwei Yu, Haizhou Li

Main category: eess.AS

TL;DR: SongBloom is a novel framework for full-length song generation that combines autoregressive sketching with diffusion-based refinement to create coherent, high-quality music with balanced global structure and local fidelity.

Details

Motivation: Existing music generation methods struggle with balancing global coherence and local fidelity, often producing outputs that lack musicality or have incoherent progression and mismatched lyrics.

Method: Uses an interleaved paradigm of autoregressive sketching and diffusion-based refinement, gradually extending musical sketches from short to long and refining details from coarse to fine-grained.

Result: Outperforms existing methods across both subjective and objective metrics, achieving performance comparable to state-of-the-art commercial music generation platforms.

Conclusion: SongBloom effectively integrates prior semantic and acoustic context to guide generation, demonstrating superior performance in generating coherent, high-fidelity full-length songs.

Abstract: Generating music with coherent structure, harmonious instrumental and vocal elements remains a significant challenge in song generation. Existing language models and diffusion-based methods often struggle to balance global coherence with local fidelity, resulting in outputs that lack musicality or suffer from incoherent progression and mismatched lyrics. This paper introduces $\textbf{SongBloom}$, a novel framework for full-length song generation that leverages an interleaved paradigm of autoregressive sketching and diffusion-based refinement. SongBloom employs an autoregressive diffusion model that combines the high fidelity of diffusion models with the scalability of language models. Specifically, it gradually extends a musical sketch from short to long and refines the details from coarse to fine-grained. The interleaved generation paradigm effectively integrates prior semantic and acoustic context to guide the generation process. Experimental results demonstrate that SongBloom outperforms existing methods across both subjective and objective metrics and achieves performance comparable to the state-of-the-art commercial music generation platforms. Audio samples are available on our demo page: https://cypress-yang.github.io/SongBloom_demo. The code and model weights have been released on https://github.com/Cypress-Yang/SongBloom .

[502] Are Modern Speech Enhancement Systems Vulnerable to Adversarial Attacks?

Rostislav Makarov, Lea Schönherr, Timo Gerkmann

Main category: eess.AS

TL;DR: Advanced speech enhancement models are vulnerable to adversarial attacks where carefully crafted noise can manipulate the enhanced output to convey different semantic meaning, though diffusion models show inherent robustness.

Details

Motivation: As machine learning approaches for speech enhancement become more expressive, they introduce new vulnerabilities to adversarial manipulation of semantic meaning in enhanced speech.

Method: Injecting carefully crafted adversarial noise that is psychoacoustically masked by the original input to manipulate speech enhancement models’ outputs.

Result: Experimental verification shows contemporary predictive speech enhancement models can be manipulated to produce enhanced speech with entirely different semantic meaning.

Conclusion: Diffusion models with stochastic samplers exhibit inherent robustness to such adversarial attacks by design, providing a more secure alternative.

Abstract: Machine learning approaches for speech enhancement are becoming increasingly expressive, enabling ever more powerful modifications of input signals. In this paper, we demonstrate that this expressiveness introduces a vulnerability: advanced speech enhancement models can be susceptible to adversarial attacks. Specifically, we show that adversarial noise, carefully crafted and psychoacoustically masked by the original input, can be injected such that the enhanced speech output conveys an entirely different semantic meaning. We experimentally verify that contemporary predictive speech enhancement models can indeed be manipulated in this way. Furthermore, we highlight that diffusion models with stochastic samplers exhibit inherent robustness to such adversarial attacks by design.

eess.IV

[503] TinyUSFM: Towards Compact and Efficient Ultrasound Foundation Models

Chen Ma, Jing Jiao, Shuyu Liang, Junhu Fu, Qin Wang, Zeju Li, Yuanyuan Wang, Yi Guo

Main category: eess.IV

TL;DR: TinyUSFM is a lightweight ultrasound foundation model that achieves computational efficiency while maintaining performance through knowledge distillation with curated datasets, domain-separated masked image modeling, and consistency-driven dynamic distillation.

Details

Motivation: Foundation models for medical imaging require substantial computational resources, limiting deployment in resource-constrained clinical environments. There's a need for lightweight models that maintain performance while being computationally efficient.

Method: Proposed TinyUSFM using: 1) Feature-gradient driven coreset selection for high-quality compact training data, 2) Domain-separated masked image modeling assisted consistency-driven dynamic distillation, 3) Knowledge distillation from large foundation models leveraging teacher model consistency across different domain masks.

Result: TinyUSFM matches the original USFM’s performance with only 6.36% of parameters and 6.40% of GFLOPs. It outperforms vanilla model by 9.45% in classification and 7.72% in segmentation, achieving 84.91% average classification accuracy and 85.78% average segmentation Dice score across diverse medical devices and centers.

Conclusion: TinyUSFM successfully demonstrates that lightweight foundation models can maintain superior performance while being computationally efficient, enabling deployment in resource-constrained clinical settings without sacrificing organ versatility and task adaptability.

Abstract: Foundation models for medical imaging demonstrate superior generalization capabilities across diverse anatomical structures and clinical applications. Their outstanding performance relies on substantial computational resources, limiting deployment in resource-constrained clinical environments. This paper presents TinyUSFM, the first lightweight ultrasound foundation model that maintains superior organ versatility and task adaptability of our large-scale Ultrasound Foundation Model (USFM) through knowledge distillation with strategically curated small datasets, delivering significant computational efficiency without sacrificing performance. Considering the limited capacity and representation ability of lightweight models, we propose a feature-gradient driven coreset selection strategy to curate high-quality compact training data, avoiding training degradation from low-quality redundant images. To preserve the essential spatial and frequency domain characteristics during knowledge transfer, we develop domain-separated masked image modeling assisted consistency-driven dynamic distillation. This novel framework adaptively transfers knowledge from large foundation models by leveraging teacher model consistency across different domain masks, specifically tailored for ultrasound interpretation. For evaluation, we establish the UniUS-Bench, the largest publicly available ultrasound benchmark comprising 8 classification and 10 segmentation datasets across 15 organs. Using only 200K images in distillation, TinyUSFM matches USFM’s performance with just 6.36% of parameters and 6.40% of GFLOPs. TinyUSFM significantly outperforms the vanilla model by 9.45% in classification and 7.72% in segmentation, surpassing all state-of-the-art lightweight models, and achieving 84.91% average classification accuracy and 85.78% average segmentation Dice score across diverse medical devices and centers.

[504] Automated Morphological Analysis of Neurons in Fluorescence Microscopy Using YOLOv8

Banan Alnemri, Arwa Basbrain

Main category: eess.IV

TL;DR: A pipeline for automated neuron instance segmentation and morphological analysis using YOLOv8 trained on stem-cell-derived neuron microscopy images, achieving high segmentation accuracy and enabling scalable neuron morphology quantification.

Details

Motivation: Manual segmentation and morphological analysis of neuronal cells in fluorescence microscopy images is labor-intensive and time-consuming, requiring significant manual effort and expertise.

Method: Uses YOLOv8 trained on manually annotated microscopy images from a high-resolution dataset of stem-cell-derived neurons, with pipeline extracting biologically significant features from both ground truth and predicted masks.

Result: Achieved high segmentation accuracy exceeding 97%, and overall accuracy of extracted morphological measurements (cell length, width, area, grayscale intensity) reached 75.32%.

Conclusion: The integrated framework offers a valuable tool for automated analysis in cell imaging and neuroscience research, reducing manual annotation needs and enabling scalable, precise quantification of neuron morphology.

Abstract: Accurate segmentation and precise morphological analysis of neuronal cells in fluorescence microscopy images are crucial steps in neuroscience and biomedical imaging applications. However, this process is labor-intensive and time-consuming, requiring significant manual effort and expertise to ensure reliable outcomes. This work presents a pipeline for neuron instance segmentation and measurement based on a high-resolution dataset of stem-cell-derived neurons. The proposed method uses YOLOv8, trained on manually annotated microscopy images. The model achieved high segmentation accuracy, exceeding 97%. In addition, the pipeline utilized both ground truth and predicted masks to extract biologically significant features, including cell length, width, area, and grayscale intensity values. The overall accuracy of the extracted morphological measurements reached 75.32%, further supporting the effectiveness of the proposed approach. This integrated framework offers a valuable tool for automated analysis in cell imaging and neuroscience research, reducing the need for manual annotation and enabling scalable, precise quantification of neuron morphology.

[505] Discretized Gaussian Representation for Tomographic Reconstruction

Shaokai Wu, Yuxiang Lu, Yapan Guo, Wei Ji, Suizhi Huang, Fengyu Yang, Shalayiding Sirejiding, Qichen He, Jing Tong, Yanbiao Ji, Yue Ding, Hongtao Lu

Main category: eess.IV

TL;DR: Proposes Discretized Gaussian Representation (DGR) for efficient 3D CT volume reconstruction using discretized Gaussian functions with Fast Volume Reconstruction for parallel processing.

Details

Motivation: Address challenges in balancing CT reconstruction quality and computational efficiency, overcoming limitations of deep learning methods (large data requirements) and scene reconstruction approaches (unsuitable for direct volumetric CT).

Method: DGR framework reconstructs 3D volume directly using discretized Gaussian functions in end-to-end manner, enhanced by Fast Volume Reconstruction - a parallelized technique that aggregates Gaussian contributions into voxel grid with minimal overhead.

Result: Extensive experiments on real-world and synthetic datasets show DGR achieves superior reconstruction quality and runtime performance across various CT reconstruction scenarios.

Conclusion: DGR provides an effective solution for high-quality, efficient CT volume reconstruction with publicly available implementation.

Abstract: Computed Tomography (CT) enables detailed cross-sectional imaging but continues to face challenges in balancing reconstruction quality and computational efficiency. While deep learning-based methods have significantly improved image quality and noise reduction, they typically require large-scale training data and intensive computation. Recent advances in scene reconstruction, such as Neural Radiance Fields and 3D Gaussian Splatting, offer alternative perspectives but are not well-suited for direct volumetric CT reconstruction. In this work, we propose Discretized Gaussian Representation (DGR), a novel framework that reconstructs the 3D volume directly using a set of discretized Gaussian functions in an end-to-end manner. To further enhance efficiency, we introduce Fast Volume Reconstruction, a highly parallelized technique that aggregates Gaussian contributions into the voxel grid with minimal overhead. Extensive experiments on both real-world and synthetic datasets demonstrate that DGR achieves superior reconstruction quality and runtime performance across various CT reconstruction scenarios. Our code is publicly available at https://github.com/wskingdom/DGR.

[506] Fast MRI for All: Bridging Access Gaps by Training without Raw Data

Yaşar Utku Alçalar, Merve Gülle, Mehmet Akçakaya

Main category: eess.IV

TL;DR: CUPID enables physics-driven deep learning for fast MRI reconstruction using only routine clinical images instead of raw k-space data, making advanced MRI accessible to rural and under-resourced areas.

Details

Motivation: Current PD-DL methods require raw k-space data only available at specialized centers, limiting access for rural/under-resourced areas that only have reconstructed images. This creates generalization challenges for rare pathologies and different populations.

Method: CUPID uses compressibility-inspired unsupervised learning via parallel imaging fidelity, evaluating output quality with compressibility-based approach while ensuring consistency with clinical parallel imaging through well-designed perturbations.

Result: CUPID achieves similar quality to established PD-DL training requiring k-space data, outperforms compressed sensing and diffusion-based methods, and works effectively in zero-shot training for retrospective and prospective sub-sampled acquisitions.

Conclusion: CUPID presents a radical approach that can provide broader access to fast MRI for remote populations by eliminating the need for raw k-space data, reducing obstacles associated with expensive MRI imaging.

Abstract: Physics-driven deep learning (PD-DL) approaches have become popular for improved reconstruction of fast magnetic resonance imaging (MRI) scans. Though PD-DL offers higher acceleration rates than existing clinical fast MRI techniques, their use has been limited outside specialized MRI centers. A key challenge is generalization to rare pathologies or different populations, noted in multiple studies, with fine-tuning on target populations suggested for improvement. However, current approaches for PD-DL training require access to raw k-space measurements, which is typically only available at specialized MRI centers that have research agreements for such data access. This is especially an issue for rural and under-resourced areas, where commercial MRI scanners only provide access to a final reconstructed image. To tackle these challenges, we propose Compressibility-inspired Unsupervised Learning via Parallel Imaging Fidelity (CUPID) for high-quality PD-DL training using only routine clinical reconstructed images exported from an MRI scanner. CUPID evaluates output quality with a compressibility-based approach while ensuring that the output stays consistent with the clinical parallel imaging reconstruction through well-designed perturbations. Our results show CUPID achieves similar quality to established PD-DL training that requires k-space data while outperforming compressed sensing (CS) and diffusion-based generative methods. We further demonstrate its effectiveness in a zero-shot training setup for retrospectively and prospectively sub-sampled acquisitions, attesting to its minimal training burden. As an approach that radically deviates from existing strategies, CUPID presents an opportunity to provide broader access to fast MRI for remote and rural populations in an attempt to reduce the obstacles associated with this expensive imaging modality.

[507] Advancing Image Super-resolution Techniques in Remote Sensing: A Comprehensive Survey

Yunliang Qi, Meng Lou, Yimin Liu, Lu Li, Zhen Yang, Wen Nie

Main category: eess.IV

TL;DR: This paper provides a comprehensive review of remote sensing image super-resolution (RSISR) methods, categorizing them into supervised, unsupervised, and quality evaluation approaches, and identifies limitations in preserving textures and geometric structures.

Details

Motivation: Despite the growing number of RSISR methods, there is a lack of systematic and comprehensive review of these methods, making it difficult for researchers to understand current trends and challenges in the field.

Method: The paper conducts a thorough review of RSISR algorithms by analyzing methodologies, datasets, and evaluation metrics, categorizing existing methods into supervised, unsupervised, and quality evaluation approaches.

Result: The review reveals significant limitations in existing RSISR methods, particularly in preserving fine-grained textures and geometric structures under large-scale degradation scenarios.

Conclusion: Future research should focus on developing domain-specific architectures and robust evaluation protocols to bridge the gap between synthetic and real-world RSISR scenarios.

Abstract: Remote sensing image super-resolution (RSISR) is a crucial task in remote sensing image processing, aiming to reconstruct high-resolution (HR) images from their low-resolution (LR) counterparts. Despite the growing number of RSISR methods proposed in recent years, a systematic and comprehensive review of these methods is still lacking. This paper presents a thorough review of RSISR algorithms, covering methodologies, datasets, and evaluation metrics. We provide an in-depth analysis of RSISR methods, categorizing them into supervised, unsupervised, and quality evaluation approaches, to help researchers understand current trends and challenges. Our review also discusses the strengths, limitations, and inherent challenges of these techniques. Notably, our analysis reveals significant limitations in existing methods, particularly in preserving fine-grained textures and geometric structures under large-scale degradation. Based on these findings, we outline future research directions, highlighting the need for domain-specific architectures and robust evaluation protocols to bridge the gap between synthetic and real-world RSISR scenarios.

[508] Team Westwood Solution for MIDOG 2025 Challenge: An Ensemble-CNN-Based Approach For Mitosis Detection And Classification

Tengyou Xu, Haochen Yang, Xiang ‘Anthony’ Chen, Hongyan Gu, Mohammad Haeri

Main category: eess.IV

TL;DR: Team Westwood’s solution for MIDOG 2025 challenge uses nnUNetV2 for initial mitosis detection and random forest classifiers ensembling multiple CNN models for both mitosis detection and atypical mitosis classification tasks.

Details

Motivation: To develop an effective solution for mitosis detection and atypical mitosis classification in the MIDOG 2025 challenge, addressing the need for accurate automated analysis of cell division processes in medical imaging.

Method: Two-stage approach: 1) nnUNetV2 for initial high-sensitivity mitosis candidate screening, 2) Random forest classifiers ensembling predictions from multiple CNNs (EfficientNet-b3, EfficientNet-b5, EfficientNetV2-s for detection; EfficientNet-b3, EfficientNet-b5, InceptionV3 for classification).

Result: Preliminary test: F1 score 0.7450 for mitosis detection, balanced accuracy 0.8722 for atypical classification. Final test: F1 score 0.6972 for mitosis detection, balanced accuracy 0.8242 for atypical classification.

Conclusion: The proposed ensemble approach combining nnUNetV2 with random forest classifiers and multiple CNN models demonstrates competitive performance in both mitosis detection and atypical mitosis classification tasks on the MIDOG 2025 challenge datasets.

Abstract: This abstract presents our solution (Team Westwood) for mitosis detection and atypical mitosis classification in the MItosis DOmain Generalization (MIDOG) 2025 challenge. For mitosis detection, we trained an nnUNetV2 for initial mitosis candidate screening with high sensitivity, followed by a random forest classifier ensembling predictions of three convolutional neural networks (CNNs): EfficientNet-b3, EfficientNet-b5, and EfficientNetV2-s. For the atypical mitosis classification, we trained another random forest classifier ensembling the predictions of three CNNs: EfficientNet-b3, EfficientNet-b5, and InceptionV3. On the preliminary test set, our solution achieved an F1 score of 0.7450 for track 1 mitosis detection, and a balanced accuracy of 0.8722 for track 2 atypical mitosis classification. On the final test set, our solution achieved an F1 score of 0.6972 for track 1 mitosis detection, and a balanced accuracy of 0.8242 for track 2 atypical mitosis classification.

[509] Variable Rate Image Compression via N-Gram Context based Swin-transformer

Priyanka Mudgal

Main category: eess.IV

TL;DR: An N-gram context-based Swin Transformer for learned image compression that achieves variable-rate compression with a single model and improves high-resolution reconstruction quality.

Details

Motivation: To overcome the Swin Transformer's limitation of neglecting larger regions during high-resolution image reconstruction due to its restricted receptive field, and to improve variable-rate compression performance.

Method: Incorporates N-gram context into the Swin Transformer architecture to expand the regions considered for pixel restoration and increase context awareness across neighboring windows.

Result: Achieves -5.86% improvement in BD-Rate over existing variable-rate learned image compression techniques and improves the quality of regions of interest (ROI) in images.

Conclusion: The proposed method effectively enhances high-resolution image reconstruction quality and is particularly beneficial for object-focused applications in manufacturing and industrial vision systems.

Abstract: This paper presents an N-gram context-based Swin Transformer for learned image compression. Our method achieves variable-rate compression with a single model. By incorporating N-gram context into the Swin Transformer, we overcome its limitation of neglecting larger regions during high-resolution image reconstruction due to its restricted receptive field. This enhancement expands the regions considered for pixel restoration, thereby improving the quality of high-resolution reconstructions. Our method increases context awareness across neighboring windows, leading to a -5.86% improvement in BD-Rate over existing variable-rate learned image compression techniques. Additionally, our model improves the quality of regions of interest (ROI) in images, making it particularly beneficial for object-focused applications in fields such as manufacturing and industrial vision systems.

[510] Untangling Vascular Trees for Surgery and Interventional Radiology

Guillaume Houry, Tom Boeken, Stéphanie Allassonnière, Jean Feydy

Main category: eess.IV

TL;DR: A method for creating 2D planar representations of 3D vascular networks that preserves topology, length, and curvature for catheter navigation assistance.

Details

Motivation: The diffusion of minimally invasive endovascular interventions requires visualization methods for complex vascular networks to assist in catheter navigation.

Method: Optimized morphological filters and a new recursive embedding algorithm that preserves global orientation, producing 2D maps from 3D digital angiography within seconds.

Result: Successfully applied to peroperative images of brain, pelvic and knee artery networks, simplifying device choice and reducing navigation risks.

Conclusion: The method enables large population studies on branching patterns and tortuosity of blood vessels, with code released as open-source in scikit-shapes library.

Abstract: The diffusion of minimally invasive, endovascular interventions motivates the development of visualization methods for complex vascular networks. We propose a planar representation of blood vessel trees which preserves the properties that are most relevant to catheter navigation: topology, length and curvature. Taking as input a three-dimensional digital angiography, our algorithm produces a faithful two-dimensional map of the patient’s vessels within a few seconds. To this end, we propose optimized implementations of standard morphological filters and a new recursive embedding algorithm that preserves the global orientation of the vascular network. We showcase our method on peroperative images of the brain, pelvic and knee artery networks. On the clinical side, our method simplifies the choice of devices prior to and during the intervention. This lowers the risk of failure during navigation or device deployment and may help to reduce the gap between expert and common intervention centers. From a research perspective, our method simulates the cadaveric display of artery trees from anatomical dissections. This opens the door to large population studies on the branching patterns and tortuosity of fine human blood vessels. Our code is released under the permissive MIT license as part of the scikit-shapes Python library (https://scikit-shapes.github.io ).

Today’s Research Highlights

Table of Contents

cs.CL

[1] Contextual Augmentation for Entity Linking using Large Language Models

[2] Small Language Models Offer Significant Potential for Science Community

[3] When Models Can’t Follow: Testing Instruction Adherence Across 256 LLMs

[4] Transformer-Based Low-Resource Language Translation: A Study on Standard Bengali to Sylheti

[5] DuoLens: A Framework for Robust Detection of Machine-Generated Multilingual Text and Code

[6] Improving Topic Modeling of Social Media Short Texts with Rephrasing: A Case Study of COVID-19 Related Tweets

[7] Re-evaluating Minimum Bayes Risk Decoding for Automatic Speech Recognition

[8] Learning from the Best, Differently: A Diversity-Driven Rethinking on Data Selection

[9] Which Evaluation for Which Model? A Taxonomy for Speech Model Assessment

[10] Context-aware Fairness Evaluation and Mitigation in LLMs

[11] MMAO-Bench: MultiModal All in One Benchmark Reveals Compositional Law between Uni-modal and Omni-modal in OmniModels

[12] Misinformation Detection using Large Language Models with Explainability

[13] Evaluating LLM Story Generation through Large-scale Network Analysis of Social Structures

[14] Lost in the Maze: Overcoming Context Limitations in Long-Horizon Agentic Search

[15] ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge

[16] Dynamic Evaluation for Oversensitivity in LLMs

[17] Are they lovers or friends? Evaluating LLMs’ Social Reasoning in English and Korean Dialogues

[18] Re:Member: Emotional Question Generation from Personal Memories

[19] When Can We Trust LLMs in Mental Health? Large-Scale Benchmarks for Reliable LLM Evaluation

[20] From Memorization to Generalization: Fine-Tuning Large Language Models for Biomedical Term-to-Identifier Normalization

[21] That’s Deprecated! Understanding, Detecting, and Steering Knowledge Conflicts in Language Models for Code Generation

[22] A Graph Signal Processing Framework for Hallucination Detection in Large Language Models

[23] Training-Free Spectral Fingerprints of Voice Processing in Transformers

[24] Tibetan Language and AI: A Comprehensive Survey of Resources, Methods and Challenges

[25] “You Are Rejected!”: An Empirical Study of Large Language Models Taking Hiring Evaluations

[26] Think Straight, Stop Smart: Structured Reasoning for Efficient Multi-Hop RAG

[27] When Facts Change: Probing LLMs on Evolving Knowledge with evolveQA

[28] Interpretable Question Answering with Knowledge Graphs

[29] Multi-Faceted Evaluation of Tool-Augmented Dialogue Systems

[30] DiSRouter: Distributed Self-Routing for LLM Selections

[31] Modality Matching Matters: Calibrating Language Distances for Cross-Lingual Transfer in URIEL+

[32] SheetBrain: A Neuro-Symbolic Agent for Accurate Reasoning over Complex and Large Spreadsheets

[33] Difficulty-Controllable Multiple-Choice Question Generation Using Large Language Models and Direct Preference Optimization

[34] TheMCPCompany: Creating General-purpose Agents with Task-specific Tools

[35] JointCQ: Improving Factual Hallucination Detection with Joint Claim and Query Generation

[36] KORE: Enhancing Knowledge Injection for Large Multimodal Models via Knowledge-Oriented Augmentations and Constraints

[37] HAD: HAllucination Detection Language Models Based on a Comprehensive Hallucination Taxonomy

[38] Balancing Rewards in Text Summarization: Multi-Objective Reinforcement Learning via HyperVolume Optimization

[39] Slot Filling as a Reasoning Task for SpeechLLMs

[40] Algorithmic Fairness in NLP: Persona-Infused LLMs for Human-Centric Hate Speech Detection

[41] Local Obfuscation by GLINER for Impartial Context Aware Lineage: Development and evaluation of PII Removal system

[42] Modeling Turn-Taking with Semantically Informed Gestures

[43] M3-SLU: Evaluating Speaker-Attributed Reasoning in Multimodal Large Language Models

[44] AgenticMath: Enhancing LLM Reasoning via Agentic-based Math Data Generation

[45] LoongRL:Reinforcement Learning for Advanced Reasoning over Long Contexts

[46] The Massive Legal Embedding Benchmark (MLEB)

[47] MoE-Prism: Disentangling Monolithic Experts for Elastic MoE Services via Model-System Co-Designs

[48] Sign Language Translation with Sentence Embedding Supervision

[49] ToMMeR – Efficient Entity Mention Detection from Large Language Models

[50] SONAR-SLT: Multilingual Sign Language Translation via Language-Agnostic Sentence Embedding Supervision

[51] Spatio-temporal Sign Language Representation and Translation

[52] BLiSS 1.0: Evaluating Bilingual Learner Competence in Second Language Small Language Models

[53] MINED: Probing and Updating with Multimodal Time-Sensitive Knowledge for Large Multimodal Models

[54] VideoAgentTrek: Computer Use Pretraining from Unlabeled Videos

[55] Machine Text Detectors are Membership Inference Attacks

[56] What is the Best Sequence Length for BABYLM?

[57] Lookahead Routing for Large Language Models

[58] Conditions for Catastrophic Forgetting in Multilingual Translation

[59] Detecting Latin in Historical Books with Large Language Models: A Multimodal Benchmark

[60] PBBQ: A Persian Bias Benchmark Dataset Curated with Human-AI Collaboration for Large Language Models

[61] CrossNews-UA: A Cross-lingual News Semantic Similarity Benchmark for Ukrainian, Polish, Russian, and English

[62] Style Attack Disguise: When Fonts Become a Camouflage for Adversarial Intent

[63] LLavaCode: Compressed Code Representations for Retrieval-Augmented Code Generation

[64] Unraveling Emotions with Pre-Trained Models

[65] DiffAdapt: Difficulty-Adaptive Reasoning for Token-Efficient LLM Inference

[66] CoSense-LLM: Semantics at the Edge with Cost- and Uncertainty-Aware Cloud-Edge Cooperation

[67] Are Large Language Models Sensitive to the Motives Behind Communication?

[68] Do Prompts Reshape Representations? An Empirical Study of Prompting Effects on Embeddings

[69] From Answers to Guidance: A Proactive Dialogue System for Legal Documents

[70] Zhyper: Factorized Hypernetworks for Conditioned LLM Fine-Tuning

[71] SmartSwitch: Advancing LLM Reasoning by Overcoming Underthinking via Promoting Deeper Thought Exploration

[72] AdaSPEC: Selective Knowledge Distillation for Efficient Speculative Decoders

[73] Adapting Multilingual Models to Code-Mixed Tasks via Model Merging

[74] ToolDreamer: Instilling LLM Reasoning Into Tool Retrievers

[75] The Art of Asking: Multilingual Prompt Optimization for Synthetic Data

[76] Scaf-GRPO: Scaffolded Group Relative Policy Optimization for Enhancing LLM Reasoning

[77] Hubble: a Model Suite to Advance the Study of LLM Memorization