Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 165]
cs.CV [Total: 140]
cs.AI [Total: 47]
cs.SD [Total: 9]
cs.LG [Total: 179]
cs.MA [Total: 3]
cs.MM [Total: 0]
eess.AS [Total: 7]
eess.IV [Total: 11]

cs.CL

[1] OpenStaxQA: A multilingual dataset based on open-source college textbooks

Pranav Gupta

Main category: cs.CL

TL;DR: OpenStaxQA is a college-level educational evaluation benchmark using 43 open-source textbooks in multiple languages, with LLM fine-tuning and evaluation using QLoRa.

Details

Motivation: To create a specialized evaluation benchmark for college-level educational applications using open-source textbooks and assess LLM performance on educational tasks.

Method: Used 43 open-source college textbooks in English, Spanish, and Polish; fine-tuned ~7B parameter LLMs with quantized low rank adapters (QLoRa); performed zero-shot evaluation on AI2 reasoning challenge.

Result: Developed the OpenStaxQA benchmark and evaluated LLM performance, with additional testing on other reasoning tasks to assess transfer learning capabilities.

Conclusion: OpenStaxQA provides a valuable educational evaluation benchmark and the paper discusses broader impacts of such datasets for educational AI applications.

Abstract: We present OpenStaxQA, an evaluation benchmark specific to college-level educational applications based on 43 open-source college textbooks in English, Spanish, and Polish, available under a permissive Creative Commons license. We finetune and evaluate large language models (LLMs) with approximately 7 billion parameters on this dataset using quantized low rank adapters (QLoRa). Additionally we also perform a zero-shot evaluation on the AI2 reasoning challenge dev dataset in order to check if OpenStaxQA can lead to an improved performance on other tasks. We also discuss broader impacts relevant to datasets such as OpenStaxQA.

[2] Knowledge Graph-Guided Multi-Agent Distillation for Reliable Industrial Question Answering with Datasets

Jiqun Pan, Zhenke Duan, Jiani Tu, Anzhi Cheng, Yanqing Wang

Main category: cs.CL

TL;DR: KG-MASD is a method that distills multi-agent reasoning capabilities into lightweight models using knowledge graphs for verification, improving industrial QA system reliability and accuracy.

Details

Motivation: Industrial QA systems need higher safety than general dialogue models, but current multi-agent LLMs have uncontrolled iterations and unverifiable outputs, while conventional distillation fails to transfer collaborative reasoning to deployable models.

Method: Formulates distillation as Markov Decision Process, incorporates knowledge graph as verifiable structured prior to enrich state representation and ensure convergence, integrates collaborative reasoning with knowledge grounding to generate high-confidence instruction-tuning data.

Result: Improves accuracy by 2.4% to 20.1% over baselines on industrial QA dataset, significantly enhances reliability for safety-critical industrial scenarios.

Conclusion: KG-MASD enables trustworthy AI deployment in safety-critical industrial scenarios by jointly distilling reasoning depth and verifiability into compact student models suitable for edge deployment.

Abstract: Industrial question-answering (QA) systems require higher safety and reliability than general-purpose dialogue models, as errors in high-risk scenarios such as equipment fault diagnosis can have severe consequences. Although multi-agent large language models enhance reasoning depth, they suffer from uncontrolled iterations and unverifiable outputs, and conventional distillation methods struggle to transfer collaborative reasoning capabilities to lightweight, deployable student models. To address these challenges, we propose Knowledge Graph-guided Multi-Agent System Distillation (KG-MASD). Our approach formulates distillation as a Markov Decision Process and incorporates a knowledge graph as a verifiable structured prior to enrich state representation and ensure convergence. By integrating collaborative reasoning with knowledge grounding, KG-MASD generates high-confidence instruction-tuning data and jointly distills reasoning depth and verifiability into compact student models suitable for edge deployment. Experiments on an industrial QA dataset show that KG-MASD improves accuracy by 2.4 per cent to 20.1 per cent over baselines and significantly enhances reliability, enabling trustworthy AI deployment in safety-critical industrial scenarios. Code and data are available at https://github.com/erwinmsmith/KG-MAD/.

[3] Transparent Reference-free Automated Evaluation of Open-Ended User Survey Responses

Subin An, Yugyeong Ji, Junyoung Kim, Heejin Kook, Yang Lu, Josh Seltzer

Main category: cs.CL

TL;DR: Proposes a two-stage framework for evaluating human-written survey responses, addressing limitations of existing methods designed for LLM-generated text.

Details

Motivation: Low-quality open-ended survey responses burden researchers and risk misleading conclusions, while existing automatic evaluation methods inadequately assess human-written responses with their distinct characteristics.

Method: Two-stage framework: 1) gibberish filtering to remove nonsensical responses, 2) evaluation across three dimensions (effort, relevance, completeness) using LLM capabilities, grounded in empirical analysis of real-world survey data.

Result: Validation on English and Korean datasets shows the framework outperforms existing metrics, demonstrates high practical applicability for response quality prediction and rejection, and shows strong correlations with expert assessment.

Conclusion: The proposed framework effectively addresses the unique characteristics of human survey responses and provides reliable automatic evaluation for real-world applications.

Abstract: Open-ended survey responses provide valuable insights in marketing research, but low-quality responses not only burden researchers with manual filtering but also risk leading to misleading conclusions, underscoring the need for effective evaluation. Existing automatic evaluation methods target LLM-generated text and inadequately assess human-written responses with their distinct characteristics. To address such characteristics, we propose a two-stage evaluation framework specifically designed for human survey responses. First, gibberish filtering removes nonsensical responses. Then, three dimensions-effort, relevance, and completeness-are evaluated using LLM capabilities, grounded in empirical analysis of real-world survey data. Validation on English and Korean datasets shows that our framework not only outperforms existing metrics but also demonstrates high practical applicability for real-world applications such as response quality prediction and response rejection, showing strong correlations with expert assessment.

[4] CoT Referring: Improving Referring Expression Tasks with Grounded Reasoning

Qihua Dong, Luis Figueroa, Handong Zhao, Kushal Kafle, Jason Kuen, Zhihong Ding, Scott Cohen, Yun Fu

Main category: cs.CL

TL;DR: Proposes CoT Referring, a chain-of-thought approach that enhances multimodal reasoning for referring expression comprehension and segmentation, achieving 2.5%+ improvement over baselines.

Details

Motivation: Referring Expression Comprehension and Segmentation are critical benchmarks for assessing Multimodal Large Language Models' integration of language understanding and image comprehension.

Method: Restructures training data with chain-of-thought format, parses textual structures to sequential referring steps, integrates detection and segmentation in unified MLLM framework with adaptive weighted loss.

Result: Experimental results show 2.5%+ improvement over baseline models on curated benchmark and RefCOCO/+/g datasets.

Conclusion: CoT Referring effectively enhances multimodal reasoning through structured chain-of-thought training, improving performance in complex referring scenarios.

Abstract: Referring Expression Comprehension and Segmentation are critical tasks for assessing the integration of language understanding and image comprehension, serving as benchmarks for Multimodal Large Language Models (MLLMs) capabilities. To address these challenges, we propose a new strategy, CoT Referring, which enhances model reasoning across modalities through a structured, chain-of-thought training data structure. Our approach systematically parses textual structures to a sequential referring step, where in each step it identifies relationships and ensures consistent reference alignment, thereby improving accuracy in complex query scenarios. We restructure the training data to enforce a new output form, providing new annotations for existing datasets and compiling an evaluation benchmark from existing resources. This benchmark is designed explicitly for complex referring cases. We also integrate detection and segmentation capabilities into a unified MLLM framework, training it with a novel adaptive weighted loss to optimize performance. Experimental results on our curated benchmark and RefCOCO/+/g demonstrate the effectiveness of our approach, with a notable increase of 2.5%+ over baseline models.

[5] Evaluating Embedding Frameworks for Scientific Domain

Nouman Ahmed, Ronin Wu, Victor Botev

Main category: cs.CL

TL;DR: This paper focuses on finding optimal word representation and tokenization methods for scientific domain NLP, building a comprehensive evaluation suite to test various algorithms.

Details

Motivation: Domain-specific data requires optimal word representations as words can have different meanings across domains. While transformer models generate good contextual embeddings, they are computationally expensive to pre-train from scratch.

Method: Built an evaluation suite with multiple downstream tasks and relevant datasets for scientific domain NLP. Tested various word representation and tokenization algorithms using this evaluation framework.

Result: Developed a comprehensive evaluation suite for scientific domain word representations and tested multiple algorithms, though specific performance results are not detailed in the abstract.

Conclusion: The research provides both optimal word representation/tokenization methods for scientific NLP tasks and a reusable evaluation framework for future algorithm comparisons in the scientific domain.

Abstract: Finding an optimal word representation algorithm is particularly important in terms of domain specific data, as the same word can have different meanings and hence, different representations depending on the domain and context. While Generative AI and transformer architecture does a great job at generating contextualized embeddings for any given work, they are quite time and compute extensive, especially if we were to pre-train such a model from scratch. In this work, we focus on the scientific domain and finding the optimal word representation algorithm along with the tokenization method that could be used to represent words in the scientific domain. The goal of this research is two fold: 1) finding the optimal word representation and tokenization methods that can be used in downstream scientific domain NLP tasks, and 2) building a comprehensive evaluation suite that could be used to evaluate various word representation and tokenization algorithms (even as new ones are introduced) in the scientific domain. To this end, we build an evaluation suite consisting of several downstream tasks and relevant datasets for each task. Furthermore, we use the constructed evaluation suite to test various word representation and tokenization algorithms.

[6] TRepLiNa: Layer-wise CKA+REPINA Alignment Improves Low-Resource Machine Translation in Aya-23 8B

Toshiki Nakai, Ravi Kiran Chikkala, Lena Sophie Oberkircher, Nicholas Jennings, Natalia Skachkova, Tatiana Anikina, Jesujoba Oluwadara Alabi

Main category: cs.CL

TL;DR: TRepLiNa method combines CKA and REPINA to improve low-resource language translation by enforcing cross-lingual similarity in mid-level layers of multilingual LLMs.

Details

Motivation: Address the linguistic gap for India's diverse low-resource languages by improving translation quality from LRL to high-resource languages.

Method: Combine Centered Kernel Alignment (CKA) with REPINA regularization to enforce cross-lingual similarity in specific internal layers of decoder-only multilingual LLMs (Aya-23 8B with QLoRA), tested in zero-shot, few-shot, and fine-tuning settings.

Result: Aligning mid-level layers using TRepLiNa (CKA+REPINA) improves LRL translation quality, especially in data-scarce settings, as a low-cost practical approach.

Conclusion: TRepLiNa provides an effective method for enhancing low-resource language translation through targeted cross-lingual alignment in multilingual models.

Abstract: The 2025 Multimodal Models for Low-Resource Contexts and Social Impact (MMLoSo) Language Challenge addresses one of India’s most pressing linguistic gaps: the lack of resources for its diverse low-resource languages (LRLs). In this study, we investigate whether enforcing cross-lingual similarity in specific internal layers of a decoder-only multilingual large language model (LLM) can improve translation quality from LRL to high-resource language (HRL). Specifically, we combine Centered Kernel Alignment (CKA), a similarity metric that encourages representations of different languages to align, with REPINA, a regularization method that constrains parameter updates to remain close to the pretrained model, into a joint method we call TRepLiNa. In this research project, we experiment with zero-shot, few-shot, and fine-tuning settings using Aya-23 8B with QLoRA across MMLoSo shared task language pairs (Mundari, Santali, Bhili) with Hindi/English pivots. Our results show that aligning mid-level layers using TRepLiNa (CKA+REPINA) is a low-cost, practical approach to improving LRL translation, especially in data-scarce settings.

[7] Scalable multilingual PII annotation for responsible AI in LLMs

Bharti Meena, Joanna Skubisz, Harshit Rajgarhia, Nand Dave, Kiran Ganesh, Shivali Dalmia, Abhishek Mukherji, Vasudevan Sundarababu, Olga Pospelova

Main category: cs.CL

TL;DR: A scalable multilingual data curation framework for high-quality PII annotation across 13 underrepresented locales, improving recall and reducing false positives through human-in-the-loop methodology and analytics-driven pipelines.

Details

Motivation: As LLMs gain wider adoption, ensuring reliable handling of Personally Identifiable Information (PII) across diverse regulatory contexts has become essential.

Method: Phased, human-in-the-loop annotation methodology combining linguistic expertise with rigorous quality assurance, using inter-annotator agreement metrics and root-cause analysis to resolve annotation inconsistencies.

Result: Substantial improvements in recall and false positive rates from pilot, training, and production phases, creating high-fidelity datasets suitable for supervised LLM fine-tuning.

Conclusion: Iterative, analytics-driven pipelines can enhance both annotation quality and downstream model reliability for multilingual PII labeling.

Abstract: As Large Language Models (LLMs) gain wider adoption, ensuring their reliable handling of Personally Identifiable Information (PII) across diverse regulatory contexts has become essential. This work introduces a scalable multilingual data curation framework designed for high-quality PII annotation across 13 underrepresented locales, covering approximately 336 locale-specific PII types. Our phased, human-in-the-loop annotation methodology combines linguistic expertise with rigorous quality assurance, leading to substantial improvements in recall and false positive rates from pilot, training, and production phases. By leveraging inter-annotator agreement metrics and root-cause analysis, the framework systematically uncovers and resolves annotation inconsistencies, resulting in high-fidelity datasets suitable for supervised LLM fine-tuning. Beyond reporting empirical gains, we highlight common annotator challenges in multilingual PII labeling and demonstrate how iterative, analytics-driven pipelines can enhance both annotation quality and downstream model reliability.

[8] FURINA: A Fully Customizable Role-Playing Benchmark via Scalable Multi-Agent Collaboration Pipeline

Haotian Wu, Shufan Jiang, Chios Chen, Yiyang Feng, Hehai Lin, Heqing Zou, Yao Shu, Yanran Li, Chengwei Qin

Main category: cs.CL

TL;DR: FURINA-Builder is a multi-agent pipeline that automatically creates customizable role-playing benchmarks, addressing limitations of existing benchmarks. It enables evaluation of arbitrary characters across diverse scenarios and formats.

Details

Motivation: Existing role-playing benchmarks are becoming obsolete due to narrow scope, outdated interaction paradigms, and limited adaptability across diverse application scenarios.

Method: FURINA-Builder uses a multi-agent collaboration pipeline that simulates dialogues between test characters and other characters from a character-scene pool. An LLM judge selects evaluation dimensions and adjusts test character responses into final test utterances.

Result: The pipeline built FURINA-Bench with comprehensive evaluation. o3 and DeepSeek-R1 achieved best performance on English and Chinese RP tasks respectively. Established characters consistently outperformed synthesized ones. Reasoning capabilities amplified performance disparities. Model scale didn’t monotonically reduce hallucinations. Reasoning LLMs showed a trade-off: improved RP performance but increased hallucinations.

Conclusion: FURINA-Builder effectively addresses benchmark limitations and FURINA-Bench poses significant challenges, revealing a Pareto frontier between RP performance and reliability across all LLMs, particularly highlighting the novel trade-off for reasoning models.

Abstract: As large language models (LLMs) advance in role-playing (RP) tasks, existing benchmarks quickly become obsolete due to their narrow scope, outdated interaction paradigms, and limited adaptability across diverse application scenarios. To address this gap, we introduce FURINA-Builder, a novel multi-agent collaboration pipeline that automatically constructs fully customizable RP benchmarks at any scale. It enables evaluation of arbitrary characters across diverse scenarios and prompt formats, as the first benchmark builder in RP area for adaptable assessment. FURINA-Builder simulates dialogues between a test character and other characters drawn from a well-constructed character-scene pool, while an LLM judge selects fine-grained evaluation dimensions and adjusts the test character’s responses into final test utterances. Using this pipeline, we build FURINA-Bench, a new comprehensive role-playing benchmark featuring both established and synthesized test characters, each assessed with dimension-specific evaluation criteria. Human evaluation and preliminary separability analysis justify our pipeline and benchmark design. We conduct extensive evaluations of cutting-edge LLMs and find that o3 and DeepSeek-R1 achieve the best performance on English and Chinese RP tasks, respectively. Across all models, established characters consistently outperform synthesized ones, with reasoning capabilities further amplifying this disparity. Interestingly, we observe that model scale does not monotonically reduce hallucinations. More critically, for reasoning LLMs, we uncover a novel trade-off: reasoning improves RP performance but simultaneously increases RP hallucinations. This trade-off extends to a broader Pareto frontier between RP performance and reliability for all LLMs. These findings demonstrate the effectiveness of FURINA-Builder and the challenge posed by FURINA-Bench.

[9] Prakriti200: A Questionnaire-Based Dataset of 200 Ayurvedic Prakriti Assessments

Aryan Kumar Singh, Janvi Singh

Main category: cs.CL

TL;DR: A bilingual English-Hindi Prakriti Assessment Questionnaire dataset for evaluating Ayurvedic physical, physiological, and psychological characteristics using 24 multiple-choice items.

Details

Motivation: To provide a structured platform for computational intelligence, Ayurvedic studies, and personalized health analytics research based on classical Ayurvedic principles.

Method: Developed a standardized 24-item questionnaire following AYUSH/CCRAS guidelines, collected data via Google Forms with hidden dosha labels and automated scoring to map individual traits to dosha-specific scores.

Result: Created a comprehensive dataset enabling analysis of trait distributions, correlations, and predictive modeling for Prakriti-based studies.

Conclusion: The dataset serves as a valuable reference for future Ayurvedic research and development of intelligent health applications supporting personalized health analytics.

Abstract: This dataset provides responses to a standardized, bilingual (English-Hindi) Prakriti Assessment Questionnaire designed to evaluate the physical, physiological, and psychological characteristics of individuals according to classical Ayurvedic principles. The questionnaire consists of 24 multiple-choice items covering body features, appetite, sleep patterns, energy levels, and temperament. It was developed following AYUSH/CCRAS guidelines to ensure comprehensive and accurate data collection. All questions are mandatory and neutrally phrased to minimize bias, and dosha labels (Vata, Pitta, Kapha) are hidden from participants. Data were collected via a Google Forms deployment, enabling automated scoring of responses to map individual traits to dosha-specific scores. The resulting dataset provides a structured platform for research in computational intelligence, Ayurvedic studies, and personalized health analytics, supporting analysis of trait distributions, correlations, and predictive modeling. It can also serve as a reference for future Prakriti-based studies and the development of intelligent health applications.

[10] Learning to Rewrite Prompts for Bootstrapping LLMs on Downstream Tasks

Qinhao Zhou, Xiang Xiang, Kun He, John E. Hopcroft

Main category: cs.CL

TL;DR: A novel prompt optimization method for machine translation that uses small-parameter models with back-translation training, focusing on optimizing the input component rather than instruction.

Details

Motivation: Existing prompt engineering methods focus on optimizing instructions for general tasks using large LLMs, but are limited for machine translation where the input component is more critical.

Method: Uses small-parameter models trained with back-translation strategy to optimize prompts, specifically targeting the input component for machine translation tasks.

Result: Significantly reduces training overhead for single-task optimization while delivering highly effective performance.

Conclusion: The method effectively addresses limitations of existing approaches for machine translation and can be adapted for other downstream tasks.

Abstract: In recent years, the growing interest in Large Language Models (LLMs) has significantly advanced prompt engineering, transitioning from manual design to model-based optimization. Prompts for LLMs generally comprise two components: the \textit{instruction}, which defines the task or objective, and the \textit{input}, which is tailored to the instruction type. In natural language generation (NLG) tasks such as machine translation, the \textit{input} component is particularly critical, while the \textit{instruction} component tends to be concise. Existing prompt engineering methods primarily focus on optimizing the \textit{instruction} component for general tasks, often requiring large-parameter LLMs as auxiliary tools. However, these approaches exhibit limited applicability for tasks like machine translation, where the \textit{input} component plays a more pivotal role. To address this limitation, this paper introduces a novel prompt optimization method specifically designed for machine translation tasks. The proposed approach employs a small-parameter model trained using a back-translation-based strategy, significantly reducing training overhead for single-task optimization while delivering highly effective performance. With certain adaptations, this method can also be extended to other downstream tasks.

[11] Dual-stage and Lightweight Patient Chart Summarization for Emergency Physicians

Jiajun Wu, Swaleh Zaidi, Braden Teitge, Henry Leung, Jiayu Zhou, Jessalyn Holodinsky, Steve Drew

Main category: cs.CL

TL;DR: A two-stage offline EHR summarization system using dual Jetson Nano devices for privacy-preserving clinical data processing, achieving useful summaries in under 30 seconds.

Details

Motivation: Emergency physicians are overwhelmed by extensive unstructured EHR data and need quick access to critical information while maintaining patient privacy.

Method: Dual-device architecture: Jetson Nano-R retrieves relevant EHR sections, Jetson Nano-S generates structured summaries using small language models, with LLM-as-Judge evaluation for quality assessment.

Result: System effectively produces useful summaries in under 30 seconds on MIMIC-IV and real EHR data, with good factual accuracy, completeness, and clarity.

Conclusion: The offline dual-device system successfully enables privacy-preserving clinical summarization that can assist emergency physicians in critical information extraction.

Abstract: Electronic health records (EHRs) contain extensive unstructured clinical data that can overwhelm emergency physicians trying to identify critical information. We present a two-stage summarization system that runs entirely on embedded devices, enabling offline clinical summarization while preserving patient privacy. In our approach, a dual-device architecture first retrieves relevant patient record sections using the Jetson Nano-R (Retrieve), then generates a structured summary on another Jetson Nano-S (Summarize), communicating via a lightweight socket link. The summarization output is two-fold: (1) a fixed-format list of critical findings, and (2) a context-specific narrative focused on the clinician’s query. The retrieval stage uses locally stored EHRs, splits long notes into semantically coherent sections, and searches for the most relevant sections per query. The generation stage uses a locally hosted small language model (SLM) to produce the summary from the retrieved text, operating within the constraints of two NVIDIA Jetson devices. We first benchmarked six open-source SLMs under 7B parameters to identify viable models. We incorporated an LLM-as-Judge evaluation mechanism to assess summary quality in terms of factual accuracy, completeness, and clarity. Preliminary results on MIMIC-IV and de-identified real EHRs demonstrate that our fully offline system can effectively produce useful summaries in under 30 seconds.

[12] SHANKS: Simultaneous Hearing and Thinking for Spoken Language Models

Cheng-Han Chiang, Xiaofei Wang, Linjie Li, Chung-Ching Lin, Kevin Lin, Shujie Liu, Zhendong Wang, Zhengyuan Yang, Hung-yi Lee, Lijuan Wang

Main category: cs.CL

TL;DR: SHANKS is a framework that enables spoken language models to generate unspoken reasoning while listening to user input, allowing real-time interruption and tool calls during speech interaction.

Details

Motivation: Current LLMs/SLMs only think after users finish speaking, causing high latency that's unsuitable for speech-to-speech interaction where real-time exchange is crucial.

Method: Streams input speech in chunks, generates unspoken chain-of-thought reasoning after each chunk while user continues speaking, uses reasoning to decide interruptions and tool calls.

Result: 37.1% higher interruption accuracy in math problem scenarios and 56.9% of tool calls completed before user finishes turn in tool-augmented dialogues.

Conclusion: SHANKS enables models to think throughout conversations rather than only after turns end, improving real-time interaction capabilities.

Abstract: Current large language models (LLMs) and spoken language models (SLMs) begin thinking and taking actions only after the user has finished their turn. This prevents the model from interacting during the user’s turn and can lead to high response latency while it waits to think. Consequently, thinking after receiving the full input is not suitable for speech-to-speech interaction, where real-time, low-latency exchange is important. We address this by noting that humans naturally “think while listening.” In this paper, we propose SHANKS, a general inference framework that enables SLMs to generate unspoken chain-of-thought reasoning while listening to the user input. SHANKS streams the input speech in fixed-duration chunks and, as soon as a chunk is received, generates unspoken reasoning based on all previous speech and reasoning, while the user continues speaking. SHANKS uses this unspoken reasoning to decide whether to interrupt the user and to make tool calls to complete the task. We demonstrate that SHANKS enhances real-time user-SLM interaction in two scenarios: (1) when the user is presenting a step-by-step solution to a math problem, SHANKS can listen, reason, and interrupt when the user makes a mistake, achieving 37.1% higher interruption accuracy than a baseline that interrupts without thinking; and (2) in a tool-augmented dialogue, SHANKS can complete 56.9% of the tool calls before the user finishes their turn. Overall, SHANKS moves toward models that keep thinking throughout the conversation, not only after a turn ends. Animated illustrations of Shanks can be found at https://d223302.github.io/SHANKS/

[13] Open ASR Leaderboard: Towards Reproducible and Transparent Multilingual and Long-Form Speech Recognition Evaluation

Vaibhav Srivastav, Steven Zheng, Eric Bezzam, Eustache Le Bihan, Nithin Koluguri, Piotr Żelasko, Somshubra Majumdar, Adel Moumen, Sanchit Gandhi

Main category: cs.CL

TL;DR: The Open ASR Leaderboard is a reproducible benchmark comparing 60+ ASR systems across 11 datasets with standardized evaluation metrics including WER and efficiency (RTFx), revealing trade-offs between accuracy and speed across different decoder architectures.

Details

Motivation: Current ASR evaluation lacks standardization, focuses mainly on short-form English, and rarely reports efficiency metrics, making fair comparisons difficult.

Method: Created a fully reproducible benchmark with standardized text normalization, evaluating 60+ open-source and proprietary systems across 11 datasets including multilingual and long-form tracks, reporting both WER and inverse real-time factor.

Result: Conformer encoders with LLM decoders achieve best WER but are slower; CTC and TDT decoders offer better efficiency for long-form/offline use; Whisper-derived encoders improve English accuracy but reduce multilingual coverage.

Conclusion: The benchmark enables transparent ASR evaluation with both accuracy and efficiency metrics, revealing important trade-offs between different architectures for various use cases.

Abstract: Despite rapid progress, ASR evaluation remains saturated with short-form English, and efficiency is rarely reported. We present the Open ASR Leaderboard, a fully reproducible benchmark and interactive leaderboard comparing 60+ open-source and proprietary systems across 11 datasets, including dedicated multilingual and long-form tracks. We standardize text normalization and report both word error rate (WER) and inverse real-time factor (RTFx), enabling fair accuracy-efficiency comparisons. For English transcription, Conformer encoders paired with LLM decoders achieve the best average WER but are slower, while CTC and TDT decoders deliver much better RTFx, making them attractive for long-form and offline use. Whisper-derived encoders fine-tuned for English improve accuracy but often trade off multilingual coverage. All code and dataset loaders are open-sourced to support transparent, extensible evaluation.

[14] A Comprehensive Survey of Hallucination in Large Language Models: Causes, Detection, and Mitigation

Aisha Alansari, Hamzah Luqman

Main category: cs.CL

TL;DR: This survey comprehensively reviews hallucination in large language models (LLMs), covering causes, detection methods, and mitigation strategies to address the problem of LLMs generating fluent but factually inaccurate information.

Details

Motivation: LLMs often produce false or fabricated information (hallucinations) despite their fluency, which undermines their reliability and trustworthiness, especially in domains requiring factual accuracy.

Method: The survey presents taxonomies of hallucination types, analyzes root causes across the LLM development lifecycle, examines detection approaches and mitigation strategies, and reviews evaluation benchmarks and metrics.

Result: The paper provides a structured framework for understanding, detecting, and mitigating hallucinations in LLMs, analyzing the strengths and limitations of current approaches.

Conclusion: The survey outlines key open challenges and promising research directions to develop more truthful and trustworthy LLMs, establishing a foundation for future work in this critical area.

Abstract: Large language models (LLMs) have transformed natural language processing, achieving remarkable performance across diverse tasks. However, their impressive fluency often comes at the cost of producing false or fabricated information, a phenomenon known as hallucination. Hallucination refers to the generation of content by an LLM that is fluent and syntactically correct but factually inaccurate or unsupported by external evidence. Hallucinations undermine the reliability and trustworthiness of LLMs, especially in domains requiring factual accuracy. This survey provides a comprehensive review of research on hallucination in LLMs, with a focus on causes, detection, and mitigation. We first present a taxonomy of hallucination types and analyze their root causes across the entire LLM development lifecycle, from data collection and architecture design to inference. We further examine how hallucinations emerge in key natural language generation tasks. Building on this foundation, we introduce a structured taxonomy of detection approaches and another taxonomy of mitigation strategies. We also analyze the strengths and limitations of current detection and mitigation approaches and review existing evaluation benchmarks and metrics used to quantify LLMs hallucinations. Finally, we outline key open challenges and promising directions for future research, providing a foundation for the development of more truthful and trustworthy LLMs.

[15] Making Machines Sound Sarcastic: LLM-Enhanced and Retrieval-Guided Sarcastic Speech Synthesis

Zhu Li, Yuqing Zhang, Xiyuan Gao, Shekhar Nayak, Matt Coler

Main category: cs.CL

TL;DR: Proposes an LLM-enhanced Retrieval-Augmented framework for sarcasm-aware speech synthesis, combining semantic embeddings from fine-tuned LLaMA 3 with prosodic exemplars via RAG, integrated in VITS backbone.

Details

Motivation: Sarcasm poses significant challenges for speech synthesis due to its reliance on nuanced semantic, contextual, and prosodic cues, and remains largely unexplored compared to broad emotional categories.

Method: Combines semantic embeddings from LoRA-fine-tuned LLaMA 3 (capturing pragmatic incongruity and discourse-level cues) with prosodic exemplars retrieved via RAG module, integrated within VITS backbone for dual conditioning.

Result: Outperforms baselines in both objective measures and subjective evaluations, yielding improvements in speech naturalness, sarcastic expressivity, and downstream sarcasm detection.

Conclusion: The proposed framework enables more natural and contextually appropriate sarcastic speech synthesis by effectively capturing both semantic and prosodic aspects of sarcasm.

Abstract: Sarcasm is a subtle form of non-literal language that poses significant challenges for speech synthesis due to its reliance on nuanced semantic, contextual, and prosodic cues. While existing speech synthesis research has focused primarily on broad emotional categories, sarcasm remains largely unexplored. In this paper, we propose a Large Language Model (LLM)-enhanced Retrieval-Augmented framework for sarcasm-aware speech synthesis. Our approach combines (1) semantic embeddings from a LoRA-fine-tuned LLaMA 3, which capture pragmatic incongruity and discourse-level cues of sarcasm, and (2) prosodic exemplars retrieved via a Retrieval Augmented Generation (RAG) module, which provide expressive reference patterns of sarcastic delivery. Integrated within a VITS backbone, this dual conditioning enables more natural and contextually appropriate sarcastic speech. Experiments demonstrate that our method outperforms baselines in both objective measures and subjective evaluations, yielding improvements in speech naturalness, sarcastic expressivity, and downstream sarcasm detection.

[16] Language models for longitudinal analysis of abusive content in Billboard Music Charts

Rohitash Chandra, Yathin Suresh, Divyansh Raj Sinha, Sanchit Jindal

Main category: cs.CL

TL;DR: Deep learning analysis of Billboard music charts over 70 years shows significant increase in explicit content (profanity, sexual references) in popular music since 1990.

Details

Motivation: Lack of validated studies on increasing abusive/sexually explicit content in music despite evidence of harmful effects on children and youth, requiring data-driven policy development.

Method: Used deep learning and language models to analyze lyrics from Billboard Charts over 7 decades, employing sentiment analysis and abuse detection including sexually explicit content detection.

Result: Significant rise in explicit content from 1990 onwards, with increasing prevalence of profane, sexually explicit, and inappropriate language in popular music lyrics.

Conclusion: Language models effectively capture nuanced patterns in lyrical content evolution, reflecting societal norm shifts over time, confirming the trend of increasing explicit content in popular music.

Abstract: There is no doubt that there has been a drastic increase in abusive and sexually explicit content in music, particularly in Billboard Music Charts. However, there is a lack of studies that validate the trend for effective policy development, as such content has harmful behavioural changes in children and youths. In this study, we utilise deep learning methods to analyse songs (lyrics) from Billboard Charts of the United States in the last seven decades. We provide a longitudinal study using deep learning and language models and review the evolution of content using sentiment analysis and abuse detection, including sexually explicit content. Our results show a significant rise in explicit content in popular music from 1990 onwards. Furthermore, we find an increasing prevalence of songs with lyrics containing profane, sexually explicit, and otherwise inappropriate language. The longitudinal analysis of the ability of language models to capture nuanced patterns in lyrical content, reflecting shifts in societal norms and language use over time.

[17] Reproducibility Study of “XRec: Large Language Models for Explainable Recommendation”

Ranjan Mishra, Julian I. Bibo, Quinten van Engelen, Henk Schaapman

Main category: cs.CL

TL;DR: This paper reproduces XRec framework using Llama 3 instead of GPT-3.5-turbo, finding it generates personalized explanations effectively but doesn’t consistently outperform all baselines. Modifications to Mixture of Experts embeddings reveal their importance in explanation structures.

Details

Motivation: To replicate the results of the original XRec paper using a different LLM (Llama 3) and extend the analysis by modifying the Mixture of Experts module embeddings to understand their role in explanation generation.

Method: Built on the original XRec source code, using Llama 3 for evaluation instead of GPT-3.5-turbo. Extended analysis by modifying input embeddings and deleting output embeddings of XRec’s Mixture of Experts module.

Result: XRec effectively generates personalized explanations and its stability improves with collaborative information. However, it did not consistently outperform all baseline models in every metric. Modifications to Mixture of Experts embeddings significantly impact explanation structures.

Conclusion: The Mixture of Experts embeddings play a crucial role in shaping explanation structures, demonstrating how collaborative signals interact with language modeling. The study provides an open-source implementation to enhance accessibility for researchers.

Abstract: In this study, we reproduced the work done in the paper “XRec: Large Language Models for Explainable Recommendation” by Ma et al. (2024). The original authors introduced XRec, a model-agnostic collaborative instruction-tuning framework that enables large language models (LLMs) to provide users with comprehensive explanations of generated recommendations. Our objective was to replicate the results of the original paper, albeit using Llama 3 as the LLM for evaluation instead of GPT-3.5-turbo. We built on the source code provided by Ma et al. (2024) to achieve our goal. Our work extends the original paper by modifying the input embeddings or deleting the output embeddings of XRec’s Mixture of Experts module. Based on our results, XRec effectively generates personalized explanations and its stability is improved by incorporating collaborative information. However, XRec did not consistently outperform all baseline models in every metric. Our extended analysis further highlights the importance of the Mixture of Experts embeddings in shaping the explanation structures, showcasing how collaborative signals interact with language modeling. Through our work, we provide an open-source evaluation implementation that enhances accessibility for researchers and practitioners alike. Our complete code repository can be found at https://github.com/julianbibo/xrec-reproducibility.

[18] Type and Complexity Signals in Multilingual Question Representations

Robin Kokot, Wessel Poelman

Main category: cs.CL

TL;DR: This paper investigates how multilingual transformer models represent morphosyntactic properties of questions using a new QTC dataset across 7 languages, comparing neural probes against statistical baselines.

Details

Motivation: To understand how multilingual transformer models represent question morphosyntax and evaluate when contextual representations outperform statistical baselines.

Method: Created QTC dataset with 7 languages annotated with question type and complexity metrics; used layer-wise probing on frozen Glot500-m representations with regression labels and selectivity controls; compared against subword TF-IDF baselines and fine-tuned models.

Result: Statistical features classify questions effectively in languages with explicit marking, while neural probes capture fine-grained structural complexity patterns better.

Conclusion: Contextual representations outperform statistical baselines for capturing structural complexity, and parameter updates may reduce availability of pre-trained linguistic information.

Abstract: This work investigates how a multilingual transformer model represents morphosyntactic properties of questions. We introduce the Question Type and Complexity (QTC) dataset with sentences across seven languages, annotated with type information and complexity metrics including dependency length, tree depth, and lexical density. Our evaluation extends probing methods to regression labels with selectivity controls to quantify gains in generalizability. We compare layer-wise probes on frozen Glot500-m (Imani et al., 2023) representations against subword TF-IDF baselines, and a fine-tuned model. Results show that statistical features classify questions effectively in languages with explicit marking, while neural probes capture fine-grained structural complexity patterns better. We use these results to evaluate when contextual representations outperform statistical baselines and whether parameter updates reduce the availability of pre-trained linguistic information.

[19] LLM Bias Detection and Mitigation through the Lens of Desired Distributions

Ingroj Shrestha, Padmini Srinivasan

Main category: cs.CL

TL;DR: The paper proposes a weighted adaptive loss fine-tuning method to align LLM’s gender-profession outputs with desired distributions (either equal or real-world), achieving significant bias reduction while preserving language modeling capability.

Details

Motivation: Prior bias mitigation work focused on social equality, but less attention was given to aligning LLMs with desired distributions like real-world data for factual grounding. The paper defines bias as deviation from desired distributions.

Method: A weighted adaptive loss based fine-tuning method that aligns LLM’s gender-profession output distribution with desired distributions. Uses 3 profession sets (male/female-dominated, gender-balanced) from U.S. labor statistics.

Result: Masked language models showed bias under both distributions. Achieved near-complete mitigation under equality and 30-75% reduction under real-world settings. Autoregressive LLMs showed no bias under equality but notable bias under real-world settings, with Llama Instruct models achieving 50-62% reduction.

Conclusion: The proposed method effectively reduces bias in LLM outputs by aligning them with desired distributions while maintaining language modeling performance, demonstrating applicability for both equality and real-world alignment goals.

Abstract: Although prior work on bias mitigation has focused on promoting social equality and demographic parity, less attention has been given to aligning LLM’s outputs to desired distributions. For example, we might want to align a model with real-world distributions to support factual grounding. Thus, we define bias as deviation from a desired distribution, which may be an equal or real-world distribution, depending on application goals. We propose a weighted adaptive loss based fine-tuning method that aligns LLM’s gender-profession output distribution with the desired distribution, while preserving language modeling capability. Using 3 profession sets – male-dominated, female-dominated, and gender-balanced – derived from U.S. labor statistics (2024), we assess both our adaptive method for reflecting reality and a non-adaptive variant for equality. Across three masked language models, bias is observed under both distributions. We achieve near-complete mitigation under equality and 30-75% reduction under real-world settings. Autoregressive LLMs show no bias under equality but notable bias under real-world settings, with the Llama Instruct models (3.2-3B, 3.1-8B) achieving a 50-62% reduction.

[20] AutoMind: Adaptive Knowledgeable Agent for Automated Data Science

Yixin Ou, Yujie Luo, Jingsheng Zheng, Lanning Wei, Zhuoyun Yu, Shuofei Qiao, Jintian Zhang, Da Zheng, Yuren Mao, Yunjun Gao, Huajun Chen, Ningyu Zhang

Main category: cs.CL

TL;DR: AutoMind is an adaptive LLM-agent framework that improves automated data science by incorporating expert knowledge, strategic solution exploration, and dynamic code generation.

Details

Motivation: Existing LLM-driven data science agents are limited by rigid workflows and coding strategies, failing to capture human expertise for complex, innovative tasks.

Method: Three key advances: curated expert knowledge base, agentic knowledgeable tree search algorithm, and self-adaptive coding strategy that dynamically tailors code generation to task complexity.

Result: Superior performance on automated data science benchmarks versus state-of-the-art baselines, with favorable effectiveness, efficiency, and qualitative solution quality.

Conclusion: AutoMind represents an efficient and robust step toward fully automated data science by overcoming limitations of existing frameworks.

Abstract: Large Language Model (LLM) agents have shown great potential in addressing real-world data science problems. LLM-driven data science agents promise to automate the entire machine learning pipeline, yet their real-world effectiveness remains limited. Existing frameworks depend on rigid, pre-defined workflows and inflexible coding strategies; consequently, they excel only on relatively simple, classical problems and fail to capture the empirical expertise that human practitioners bring to complex, innovative tasks. In this work, we introduce AutoMind, an adaptive, knowledgeable LLM-agent framework that overcomes these deficiencies through three key advances: (1) a curated expert knowledge base that grounds the agent in domain expert knowledge, (2) an agentic knowledgeable tree search algorithm that strategically explores possible solutions, and (3) a self-adaptive coding strategy that dynamically tailors code generation to task complexity. Evaluations on two automated data science benchmarks demonstrate that AutoMind delivers superior performance versus state-of-the-art baselines. Additional analyses confirm favorable effectiveness, efficiency, and qualitative solution quality, highlighting AutoMind as an efficient and robust step toward fully automated data science. Code is at https://github.com/innovatingAI/AutoMind.

[21] EVALUESTEER: Measuring Reward Model Steerability Towards Values and Preference

Kshitish Ghate, Andy Liu, Devansh Jain, Taylor Sorensen, Atoosa Kasirzadeh, Aylin Caliskan, Mona T. Diab, Maarten Sap

Main category: cs.CL

TL;DR: EVALUESTEER is a benchmark for evaluating LLMs’ and reward models’ ability to align with diverse user value and style preferences, showing current models struggle with complex preference profiles.

Details

Motivation: To address the need for pluralistic AI systems that can accommodate diverse global user preferences and values, and fill the gap in existing datasets for controlled evaluation of reward model steering.

Method: Created synthetic dataset of 165,888 preference pairs systematically varying 4 value dimensions and 4 style dimensions, then evaluated 6 LLMs/RMs under 16 prompting conditions and 6 preference comparison scenarios.

Result: Best models achieved <75% accuracy with full user profiles vs >99% accuracy when only relevant preferences were provided, highlighting current limitations in adapting to complex user profiles.

Conclusion: EVALUESTEER reveals significant limitations in current reward models’ ability to identify and adapt to relevant user profile information, providing a challenging testbed for developing more steerable AI systems.

Abstract: As large language models (LLMs) are deployed globally, creating pluralistic systems that can accommodate the diverse preferences and values of users worldwide becomes essential. We introduce EVALUESTEER, a benchmark to measure LLMs’ and reward models’ (RMs) steerability towards users’ value and stylistic preference profiles grounded in psychology and human-LLM interaction literature. To address the gap in existing datasets that do not support controlled evaluations of RM steering, we synthetically generated 165,888 preference pairs – systematically varying pairs along 4 value dimensions (traditional, secular-rational, survival, and self-expression) and 4 style dimensions (verbosity, readability, confidence, and warmth). We use EVALUESTEER to evaluate whether, given a user profile and a pair of candidate value-laden and style-laden responses, LLMs and RMs are able to select the output that aligns with the user’s preferences. We evaluate six open-source and proprietary LLMs and RMs under sixteen systematic prompting conditions and six preference comparison scenarios. Notably, our results show that, when given the user’s full profile of values and stylistic preferences, the best models achieve <75% accuracy at choosing the correct response, in contrast to >99% accuracy when only relevant style and value preferences are provided. EVALUESTEER thus highlights the limitations of current RMs at identifying and adapting to relevant user profile information, and provides a challenging testbed for developing RMs that can be steered towards diverse human values and preferences.

[22] PredGen: Accelerated Inference of Large Language Models through Input-Time Speculation for Real-Time Speech Interaction

Shufan Li, Aditya Grover

Main category: cs.CL

TL;DR: Predictive Generation (PredGen) reduces LLM latency in voice chat by speculatively generating responses while users are still speaking, enabling faster TTS processing.

Details

Motivation: LLMs in voice chat applications suffer from noticeable latency between user input and audio output, especially on consumer hardware, primarily due to the time needed to generate the first sentence for TTS systems.

Method: Proposes Predictive Generation (PredGen) framework using speculative decoding at input time to generate candidate responses while users are still speaking.

Result: Simulated experiments on Lmsys and MT-Bench datasets show PredGen reduces latency by around 2x across various use cases with minimal additional computation cost.

Conclusion: PredGen effectively mitigates LLM latency in voice chat applications by leveraging otherwise unused computation during user speech to pre-generate responses.

Abstract: Large Language Models (LLMs) are widely used in real-time voice chat applications, typically in combination with text-to-speech (TTS) systems to generate audio responses. However, their large size often leads to noticeable latency between the end of user input and the start of audio output, resulting in suboptimal user experiences. This latency is particularly evident when LLMs are deployed as single-user voice assistants on consumer-grade hardware with limited computing capacity. We discovered that this latency is primarily dominated by the time it takes for the LLMs to generate the first sentence, which is required as input by the TTS systems that synthesize audio responses on a sentence-by-sentence basis. To address this bottleneck, we propose Predictive Generation (PredGen), a novel framework that mitigates-or even eliminates-this delay through speculative decoding at input time. PredGen generates candidate responses while the user is still speaking, enabling the system to begin TTS processing with minimal delay. Simulated experiments on the Lmsys and MT-Bench datasets show that the proposed method can effectively reduce the latency by around 2x across a wide range of use cases, while incurring only minimal additional computation cost at input time-computation that would otherwise go unused.

[23] The Sound of Syntax: Finetuning and Comprehensive Evaluation of Language Models for Speech Pathology

Fagun Patel, Duc Q. Nguyen, Sang T. Truong, Jody Vaynshtok, Sanmi Koyejo, Nick Haber

Main category: cs.CL

TL;DR: This paper presents the first comprehensive benchmark for evaluating multimodal language models (MLMs) in speech-language pathology applications, revealing performance disparities and limitations while showing potential through domain-specific fine-tuning.

Details

Motivation: There is a significant shortage of speech-language pathologists (20:1 ratio to affected children) and limited understanding of MLM performance in clinical settings, creating a need for technological support and systematic evaluation.

Method: Developed a taxonomy of real-world MLM use cases in collaboration with domain experts, created a benchmark with 5,000 manually annotated data points across five core use cases, and evaluated 15 state-of-the-art MLMs with robustness tests under various conditions.

Result: No single MLM consistently outperformed others across all tasks; systematic disparities were found (better performance on male speakers); chain-of-thought prompting degraded performance on certain tasks; domain-specific fine-tuning achieved over 10% improvement over base models.

Conclusion: Current MLMs show both potential and limitations for speech-language pathology applications, highlighting the need for further research and targeted development to address performance disparities and optimize for clinical use.

Abstract: According to the U.S. National Institutes of Health, more than 3.4 million children experience speech disorders that require clinical intervention. The number of speech-language pathologists (SLPs) is roughly 20 times fewer than the number of affected children, highlighting a significant gap in children’s care and a pressing need for technological support that improves the productivity of SLPs. State-of-the-art multimodal language models (MLMs) show promise for supporting SLPs, but their use remains underexplored largely due to a limited understanding of their performance in high-stakes clinical settings. To address this gap, we collaborate with domain experts to develop a taxonomy of real-world use cases of MLMs in speech-language pathologies. Building on this taxonomy, we introduce the first comprehensive benchmark for evaluating MLM across five core use cases, each containing 1,000 manually annotated data points. This benchmark includes robustness and sensitivity tests under various settings, including background noise, speaker gender, and accent. Our evaluation of 15 state-of-the-art MLMs reveals that no single model consistently outperforms others across all tasks. Notably, we find systematic disparities, with models performing better on male speakers, and observe that chain-of-thought prompting can degrade performance on classification tasks with large label spaces and narrow decision boundaries. Furthermore, we study fine-tuning MLMs on domain-specific data, achieving improvements of over 10% compared to base models. These findings highlight both the potential and limitations of current MLMs for speech-language pathology applications, underscoring the need for further research and targeted development.

[24] EverydayMMQA: A Multilingual and Multimodal Framework for Culturally Grounded Spoken Visual QA

Firoj Alam, Ali Ezzat Shahroor, Md. Arid Hasan, Zien Sheikh Ali, Hunzalah Hassan Bhatti, Mohamed Bayan Kmainasi, Shammur Absar Chowdhury, Basel Mousi, Fahim Dalvi, Nadir Durrani, Natasa Milic-Frayling

Main category: cs.CL

TL;DR: The paper introduces EverydayMMQA framework and OASIS dataset for culturally-grounded multimodal QA, addressing limitations of current models in low-resource languages and cultural contexts.

Details

Motivation: Large multimodal models fail on culturally grounded knowledge in low-resource languages, requiring datasets that reflect diverse real-world situations and cultural contexts.

Method: Developed EverydayMMQA framework to create large-scale culturally-grounded datasets, resulting in OASIS dataset with 0.92M images, 14.8M QA pairs, and 3.7M spoken questions across speech, text, and image modalities.

Result: Created OASIS dataset covering English and Arabic varieties from 18 countries, enabling four input combinations and testing models on pragmatic, commonsense, and culturally aware reasoning tasks.

Conclusion: EverydayMMQA and OASIS provide benchmark and training data for building multimodal LLMs that handle everyday tasks within cultural contexts, with framework and dataset to be publicly released.

Abstract: Large-scale multimodal models achieve strong results on tasks like Visual Question Answering (VQA), but they often fail when queries require culturally grounded, everyday knowledge, particularly in low-resource and underrepresented languages. To bridge this gap, we introduce Everyday Multimodal and Multilingual QA (EverydayMMQA), a framework for creating large-scale, culturally-grounded datasets for spoken and visual question answering (SVQA). Using this framework, we developed OASIS, a multimodal dataset integrating speech, images, and text. With over ~0.92M images and 14.8M QA pairs, OASIS contains 3.7M spoken questions, enabling four unique input combinations: speech-only, text-only, speech+image, and text+image. Focused on English and Arabic varieties, 18 countries, the dataset content is curated to reflect diverse, real-world situations. OASIS tests models on tasks beyond object recognition that involve pragmatic, commonsense, and culturally aware reasoning. We benchmarked four closed-source models, three open-source models, and one fine-tuned model. EverydayMMQA and OASIS together provide a benchmark and training dataset for building multimodal LLMs for a comprehensive set of everyday tasks within cultural contexts. The framework and dataset will be made publicly available to the community.

[25] Semantic Regexes: Auto-Interpreting LLM Features with a Structured Language

Angie Boggust, Donghao Ren, Yannick Assogba, Dominik Moritz, Arvind Satyanarayan, Fred Hohman

Main category: cs.CL

TL;DR: Semantic regexes provide structured language descriptions for LLM features, offering more precise and consistent descriptions than natural language while matching accuracy and enabling new types of model-wide analyses.

Details

Motivation: Natural language descriptions of LLM features are often vague, inconsistent, and require manual relabeling, creating a need for more structured and precise feature description methods.

Method: Semantic regexes combine linguistic and semantic primitives with modifiers for contextualization, composition, and quantification to create structured feature descriptions.

Result: Semantic regexes match natural language accuracy while providing more concise and consistent descriptions, enable quantification of feature complexity across layers, and help users build accurate mental models of feature activations.

Conclusion: Semantic regexes offer a structured alternative to natural language for automated interpretability, improving precision and consistency while enabling scalable analysis from individual features to model-wide patterns.

Abstract: Automated interpretability aims to translate large language model (LLM) features into human understandable descriptions. However, these natural language feature descriptions are often vague, inconsistent, and require manual relabeling. In response, we introduce semantic regexes, structured language descriptions of LLM features. By combining primitives that capture linguistic and semantic feature patterns with modifiers for contextualization, composition, and quantification, semantic regexes produce precise and expressive feature descriptions. Across quantitative benchmarks and qualitative analyses, we find that semantic regexes match the accuracy of natural language while yielding more concise and consistent feature descriptions. Moreover, their inherent structure affords new types of analyses, including quantifying feature complexity across layers, scaling automated interpretability from insights into individual features to model-wide patterns. Finally, in user studies, we find that semantic regex descriptions help people build accurate mental models of LLM feature activations.

[26] Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning

Sikuan Yan, Xiufeng Yang, Zuchao Huang, Ercong Nie, Zifeng Ding, Zonggen Li, Xiaowen Ma, Kristian Kersting, Jeff Z. Pan, Hinrich Schütze, Volker Tresp, Yunpu Ma

Main category: cs.CL

TL;DR: Memory-R1 is an RL framework that enables LLMs to actively manage external memory through specialized agents for memory operations and reasoning, achieving strong performance with minimal training data.

Details

Motivation: LLMs are stateless with limited context windows, and existing memory augmentation approaches are static and heuristic-driven without learned mechanisms for memory management decisions.

Method: Uses reinforcement learning (PPO and GRPO) to train two specialized agents: a Memory Manager that learns structured operations (ADD, UPDATE, DELETE, NOOP) and an Answer Agent that pre-selects and reasons over relevant memory entries.

Result: Outperforms strong baselines with only 152 training QA pairs, generalizes across diverse question types, three benchmarks (LoCoMo, MSC, LongMemEval), and multiple model scales (3B-14B).

Conclusion: The RL framework successfully enables adaptive memory management with minimal supervision, demonstrating effective learned memory operations for long-horizon reasoning.

Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities across a wide range of NLP tasks, but they remain fundamentally stateless, constrained by limited context windows that hinder long-horizon reasoning. Recent efforts to address this limitation often augment LLMs with an external memory bank, yet most existing pipelines are static and heuristic-driven, lacking a learned mechanism for deciding what to store, update, or retrieve. We present Memory-R1, a reinforcement learning (RL) framework that equips LLMs with the ability to actively manage and utilize external memory through two specialized agents: a Memory Manager that learns structured operations, including ADD, UPDATE, DELETE, and NOOP; and an Answer Agent that pre-selects and reasons over relevant entries. Both agents are fine-tuned with outcome-driven RL (PPO and GRPO), enabling adaptive memory management with minimal supervision. With only 152 training QA pairs, Memory-R1 outperforms strong baselines and generalizes across diverse question types, three benchmarks (LoCoMo, MSC, LongMemEval), and multiple model scales (3B-14B).

[27] Protecting De-identified Documents from Search-based Linkage Attacks

Pierre Lison, Mark Anderson

Main category: cs.CL

TL;DR: This paper presents a method to prevent search-based linkage attacks on de-identified documents by identifying rare N-grams and using LLM-based rewriting to reformulate them while preserving semantic integrity.

Details

Motivation: Current de-identification models fail to address linkage risks, where attackers can map de-identified text back to its source by searching for unique phrases in the original dataset.

Method: Two-step approach: 1) Build inverted index of N-grams to identify those appearing in fewer than k documents, 2) Use iterative LLM-based rewriting to reformulate these rare spans until linkage becomes impossible.

Result: Experimental results on court cases show the method effectively prevents search-based linkages while maintaining faithfulness to original content.

Conclusion: The proposed method successfully counters search-based linkage attacks on de-identified documents by systematically identifying and rewriting rare phrases, preserving both privacy and semantic integrity.

Abstract: While de-identification models can help conceal the identity of the individual(s) mentioned in a document, they fail to address linkage risks, defined as the potential to map the de-identified text back to its source. One straightforward way to perform such linkages is to extract phrases from the de-identified document and then check their presence in the original dataset. This paper presents a method to counter search-based linkage attacks while preserving the semantic integrity of the text. The method proceeds in two steps. We first construct an inverted index of the N-grams occurring in the document collection, making it possible to efficiently determine which N-grams appear in less than $k$ documents (either alone or in combination with other N-grams). An LLM-based rewriter is then iteratively queried to reformulate those spans until linkage is no longer possible. Experimental results on a collection of court cases show that the method is able to effectively prevent search-based linkages while remaining faithful to the original content.

[28] Controllable Stylistic Text Generation with Train-Time Attribute-Regularized Diffusion

Fan Zhou, Chang Tian, Tim Van de Cruys

Main category: cs.CL

TL;DR: RegDiff is a regularized diffusion framework for controllable text generation that uses attribute features without requiring pretrained classifiers during sampling, achieving better attribute control with lower computational costs.

Details

Motivation: Existing diffusion methods for text generation have limitations - classifier-free guidance preserves semantics but lacks effective attribute control, while classifier guidance enables better attribute alignment but has high computational costs and classifier generalization issues.

Method: Uses VAE-based encoder-decoder architecture for reconstruction fidelity and latent diffusion model trained with attribute supervision. Attribute information is injected only during training, eliminating need for classifiers during sampling.

Result: Outperforms strong baselines on five datasets spanning multiple stylistic attributes, demonstrating effective generation of stylistic texts.

Conclusion: RegDiff provides an efficient solution for attribute-controllable text diffusion that balances computational efficiency with effective attribute control.

Abstract: Generating stylistic text with specific attributes is a key problem in controllable text generation. Recently, diffusion models have emerged as a powerful paradigm for both visual and textual generation. Existing approaches can be broadly categorized into classifier-free guidance (CFG) and classifier guidance (CG) methods. While CFG effectively preserves semantic content, it often fails to provide effective attribute control. In contrast, CG modifies the denoising trajectory using classifier gradients, enabling better attribute alignment but incurring high computational costs during sampling and suffering from classifier generalization issues. In this work, we propose RegDiff, a regularized diffusion framework that leverages attribute features without requiring a pretrained classifier during sampling, thereby achieving controllable generation with reduced computational costs. Specifically, RegDiff employs a VAE-based encoder–decoder architecture to ensure reconstruction fidelity and a latent diffusion model trained with attribute supervision to enable controllable text generation. Attribute information is injected only during training. Experiments on five datasets spanning multiple stylistic attributes demonstrate that RegDiff outperforms strong baselines in generating stylistic texts. These results validate the effectiveness of RegDiff as an efficient solution for attribute-controllable text diffusion. Our code, datasets, and resources will be released upon publication at https://github.com/xxxx.

[29] Reward Model Perspectives: Whose Opinions Do Reward Models Reward?

Elle

Main category: cs.CL

TL;DR: RMs often misalign with demographic groups and reinforce harmful stereotypes; steering prompts are insufficient to fix these biases.

Details

Motivation: To understand RM behavior, measure their alignment with human opinions, and investigate sociodemographic biases in RMs.

Method: Formalize a framework for measuring RM opinion alignment, study biases across demographics, and explore prompting effects to steer rewards.

Result: RMs show poor alignment with several demographic groups and systematically reward harmful stereotypes; steering alone cannot overcome these limitations.

Conclusion: Careful consideration of RM behavior is needed in model alignment to prevent propagation of unwanted social biases in language technologies.

Abstract: Reward models (RMs) are central to the alignment of language models (LMs). An RM often serves as a proxy for human preferences to guide downstream LM behavior. However, our understanding of RM behavior is limited. Our work (i) formalizes a framework for measuring the alignment of opinions captured by RMs, (ii) investigates the extent to which RMs demonstrate sociodemographic biases, and (iii) explores the effects of prompting to steer rewards towards the preferences of a target group. We study the subjective and diverse perspectives on controversial topics, which allows us to quantify RM perspectives in terms of their opinions, attitudes, and values. We show that RMs are poorly aligned with several demographic groups and can systematically reward harmful stereotypes, and steering alone is not enough to overcome these limitations. Our findings underscore the need for more careful consideration of RM behavior in model alignment during preference learning to prevent the propagation of unwanted social biases in the language technologies that we use.

[30] Instructional Goal-Aligned Question Generation for Student Evaluation in Virtual Lab Settings: How Closely Do LLMs Actually Align?

R. Alexander Knipper, Indrani Dey, Souvika Sarkar, Hari Narayanan, Sadhana Puntambekar, Santu Karmaker

Main category: cs.CL

TL;DR: A framework using LLMs to help teachers generate instructional goal-aligned questions for virtual labs through natural language interaction, improving question quality and alignment with pedagogical goals.

Details

Motivation: Teachers struggle to adapt virtual labs to their instructional goals due to misaligned third-party materials and the difficulty of developing custom resources.

Method: An alignment framework with four components: instructional goal understanding via teacher-LLM dialogue, lab understanding via knowledge analysis, question taxonomy for cognitive intent, and TELeR taxonomy for prompt control.

Result: The framework improved question quality by 0.29-0.39 points, achieved 80% parsability and >90% format adherence, with larger models showing the strongest gains (+37.1% parsability, +25.7% adherence, +0.8 quality points).

Conclusion: LLMs can effectively generate pedagogically meaningful, simulation-aligned questions for virtual labs when guided by proper instructional frameworks and taxonomies.

Abstract: Virtual Labs offer valuable opportunities for hands-on, inquiry-based science learning, yet teachers often struggle to adapt them to fit their instructional goals. Third-party materials may not align with classroom needs, and developing custom resources can be time-consuming and difficult to scale. Recent advances in Large Language Models (LLMs) offer a promising avenue for addressing these limitations. In this paper, we introduce a novel alignment framework for instructional goal-aligned question generation, enabling teachers to leverage LLMs to produce simulation-aligned, pedagogically meaningful questions through natural language interaction. The framework integrates four components: instructional goal understanding via teacher-LLM dialogue, lab understanding via knowledge unit and relationship analysis, a question taxonomy for structuring cognitive and pedagogical intent, and the TELeR taxonomy for controlling prompt detail. Early design choices were informed by a small teacher-assisted case study, while our final evaluation analyzed over 1,100 questions from 19 open-source LLMs. With goal and lab understanding grounding questions in teacher intent and simulation context, the question taxonomy elevates cognitive demand (open-ended formats and relational types raise quality by 0.29-0.39 points), and optimized TELeR prompts enhance format adherence (80% parsability, >90% adherence). Larger models yield the strongest gains: parsability +37.1%, adherence +25.7%, and average quality +0.8 Likert points.

[31] FinLFQA: Evaluating Attributed Text Generation of LLMs in Financial Long-Form Question Answering

Yitao Long, Tiansheng Hu, Yilun Zhao, Arman Cohan, Chen Zhao

Main category: cs.CL

TL;DR: FinLFQA is a benchmark for evaluating LLMs’ ability to generate long-form financial answers with reliable attributions, assessing evidence extraction, numerical reasoning, and domain knowledge.

Details

Motivation: Existing benchmarks focus on simple attribution with textual evidence, but real-world financial applications require more nuanced attribution including numerical reasoning and domain knowledge.

Method: Created FinLFQA benchmark with human annotations evaluating three attribution aspects: supporting evidence from financial reports, intermediate numerical reasoning steps, and domain-specific financial knowledge. Also developed automatic evaluation framework for answer and attribution quality.

Result: Experiments on eight LLMs across multiple attribution paradigms showed that fine-grained metrics distinguish model capabilities, end-to-end generation performs comparably to post-hoc approaches, and iterative refinement only helps with external feedback.

Conclusion: The FinLFQA benchmark provides comprehensive evaluation of LLM attribution in financial contexts, revealing important insights about model performance and refinement strategies.

Abstract: Large Language Models (LLMs) frequently hallucinate to long-form questions, producing plausible yet factually incorrect answers. A common mitigation strategy is to provide attribution to LLM outputs. However, existing benchmarks primarily focus on simple attribution that retrieves supporting textual evidence as references. We argue that in real-world scenarios such as financial applications, attribution goes beyond reference retrieval. We introduce FinLFQA, a benchmark designed to evaluate the ability of LLMs to generate long-form answers to complex financial questions with reliable and nuanced attributions. FinLFQA evaluates three critical aspects of attribution through human annotations: (1) supporting evidence extracted from financial reports, (2) intermediate numerical reasoning steps, and (3) domain-specific financial knowledge that informs the reasoning process. We further provide an automatic evaluation framework covering both answer quality and attribution quality. Through extensive experiments on eight LLMs across multiple attribution-generation paradigms, we find that fine-grained metrics are important to distinguish model capabilities, that end-to-end generation achieves comparable performance to post-hoc approaches, and that iterative refinement only helps when guided by external feedback.

[32] Bridging Discourse Treebanks with a Unified Rhetorical Structure Parser

Elena Chistova

Main category: cs.CL

TL;DR: UniRST is the first unified RST-style discourse parser that handles 18 treebanks in 11 languages without modifying their relation inventories, using parameter-efficient training strategies.

Details

Motivation: To overcome inventory incompatibilities across different RST treebanks and enable unified multilingual discourse parsing with a single model.

Method: Proposed two training strategies: Multi-Head (separate relation classification layers per inventory) and Masked-Union (shared parameter training with selective label masking). Also used augmentation for low-resource settings.

Result: Masked-Union approach was the strongest and most parameter-efficient. UniRST outperformed 16 of 18 mono-treebank baselines, demonstrating superior performance across diverse multilingual resources.

Conclusion: A single unified model for multilingual discourse parsing is feasible and advantageous, with the Masked-Union strategy providing the best balance of performance and efficiency.

Abstract: We introduce UniRST, the first unified RST-style discourse parser capable of handling 18 treebanks in 11 languages without modifying their relation inventories. To overcome inventory incompatibilities, we propose and evaluate two training strategies: Multi-Head, which assigns separate relation classification layer per inventory, and Masked-Union, which enables shared parameter training through selective label masking. We first benchmark monotreebank parsing with a simple yet effective augmentation technique for low-resource settings. We then train a unified model and show that (1) the parameter efficient Masked-Union approach is also the strongest, and (2) UniRST outperforms 16 of 18 mono-treebank baselines, demonstrating the advantages of a single-model, multilingual end-to-end discourse parsing across diverse resources.

[33] MathRobust-LV: Evaluation of Large Language Models’ Robustness to Linguistic Variations in Mathematical Reasoning

Neeraja Kirtane, Yuvraj Khanna, Peter Relan

Main category: cs.CL

TL;DR: The paper introduces MathRobust-LV, a test set for evaluating LLM robustness to linguistic variation in math problems, showing accuracy drops when problems are rephrased while keeping numerical structure constant.

Details

Motivation: To assess math reasoning robustness in real educational settings where instructors rephrase problems with varied linguistic expressions while maintaining identical concepts and difficulty levels.

Method: Created MathRobust-LV test set by changing surface details (names, contexts, variables) while preserving numerical structure and answers, then evaluated 34 models on baseline vs. rephrased variants.

Result: Accuracy declined when moving from baseline to variants, with severe drops for smaller models (9-11%) and measurable degradation for stronger models. Frontier models like GPT-5 and Gemini-2.5pro remained comparatively stable.

Conclusion: Robustness to linguistic variation is a fundamental challenge that exposes reasoning vulnerabilities in models, even when MATH data benchmarking appears saturated.

Abstract: Large language models excel on math benchmarks, but their math reasoning robustness to linguistic variation is underexplored. While recent work increasingly treats high-difficulty competitions like the IMO as the gold standard for evaluating reasoning, we believe in comprehensive benchmarking of high school-level math problems in real educational settings. We introduce MathRobust-LV, a test set and evaluation methodology that mirrors how instructors rephrase problems across assessments while keeping difficulty constant: we change surface details (names, contexts, variables) while preserving numerical structure and answers. In contrast to prior efforts that alter problem content or emphasize IMO-level tasks, we focus on high-school-level dataset problems at the difficulty level where models are currently deployed in educational settings: tutoring and assessment systems. In these applications, instructors rephrase identical concepts in varied ways, making linguistic robustness essential for reliable deployment. Although MATH data benchmarking is often regarded as saturated, our experiment on 34 models reveals that accuracy declines when moving from the baseline to the variants. These drops are severe for smaller models (9-11%) while stronger models also show measurable degradation. Frontier models like GPT-5, Gemini-2.5pro remain comparatively stable. Our results highlight that robustness to linguistic variation is a fundamental challenge, exposing reasoning vulnerabilities in models.

[34] A Survey on Agentic Security: Applications, Threats and Defenses

Asif Shahriar, Md Nafiu Rahman, Sadif Ahmed, Farig Sadeque, Md Rizwan Parvez

Main category: cs.CL

TL;DR: First comprehensive survey of LLM agent security landscape covering applications, threats, and defenses with analysis of 150+ papers.

Details

Motivation: The shift from passive LLMs to autonomous agents introduces new security risks in cybersecurity that need systematic study.

Method: Holistic survey structured around three pillars: Applications, Threats, and Defenses, with comprehensive taxonomy of over 150 papers.

Result: Detailed cross-cutting analysis reveals emerging trends in agent architecture and identifies critical research gaps in model and modality coverage.

Conclusion: The survey provides foundational understanding of agentic security landscape and highlights areas needing further research.

Abstract: The rapid shift from passive LLMs to autonomous LLM-agents marks a new paradigm in cybersecurity. While these agents can act as powerful tools for both offensive and defensive operations, the very agentic context introduces a new class of inherent security risks. In this work we present the first holistic survey of the agentic security landscape, structuring the field around three interdependent pillars: Applications, Threats, and Defenses. We provide a comprehensive taxonomy of over 150 papers, explaining how agents are used, the vulnerabilities they possess, and the countermeasures designed to protect them. A detailed cross-cutting analysis shows emerging trends in agent architecture while revealing critical research gaps in model and modality coverage.

[35] Linguistically Informed Tokenization Improves ASR for Underresourced Languages

Massimo Daul, Alessio Tosolini, Claire Bowern

Main category: cs.CL

TL;DR: Fine-tuning wav2vec2 ASR model on Yan-nhangu language shows phonemic tokenization outperforms orthographic tokenization, and ASR-assisted transcription is faster than manual transcription for underresourced languages.

Details

Motivation: Modern ASR systems require large datasets, making them unsuitable for underresourced languages like Yan-nhangu, an Indigenous Australian language. The study aims to adapt ASR for language documentation of such languages.

Method: Fine-tuned wav2vec2 ASR model on Yan-nhangu language data, comparing phonemic vs orthographic tokenization strategies, and evaluated ASR’s role in language documentation pipeline.

Result: Phonemic tokenization significantly improved Word Error Rate (WER) and Character Error Rate (CER) compared to orthographic tokenization. ASR-assisted transcription was much faster than manual transcription from scratch.

Conclusion: ASR can be effectively adapted for underresourced languages using appropriate tokenization strategies, making it a viable tool for language documentation pipelines despite data limitations.

Abstract: Automatic speech recognition (ASR) is a crucial tool for linguists aiming to perform a variety of language documentation tasks. However, modern ASR systems use data-hungry transformer architectures, rendering them generally unusable for underresourced languages. We fine-tune a wav2vec2 ASR model on Yan-nhangu, a dormant Indigenous Australian language, comparing the effects of phonemic and orthographic tokenization strategies on performance. In parallel, we explore ASR’s viability as a tool in a language documentation pipeline. We find that a linguistically informed phonemic tokenization system substantially improves WER and CER compared to a baseline orthographic tokenization scheme. Finally, we show that hand-correcting the output of an ASR model is much faster than hand-transcribing audio from scratch, demonstrating that ASR can work for underresourced languages.

[36] Test-Time Scaling of Reasoning Models for Machine Translation

Zihao Li, Shaoxiong Ji, Jörg Tiedemann

Main category: cs.CL

TL;DR: Test-time scaling (TTS) provides limited benefits for direct machine translation with general-purpose models but becomes effective with domain-specific fine-tuning and in post-editing workflows.

Details

Motivation: To investigate whether increased inference-time computation improves translation quality, as TTS has shown success in other reasoning tasks but remains underexplored in machine translation.

Method: Evaluated 12 reasoning models across diverse MT benchmarks using three scenarios: direct translation, forced-reasoning extrapolation, and post-editing.

Result: TTS shows limited benefits for direct translation with general models (performance plateaus quickly), but domain-specific fine-tuning enables consistent improvements up to optimal reasoning depth. Forcing models beyond natural stopping point degrades quality, while post-editing reliably improves translations.

Conclusion: The value of inference-time computation in MT lies not in single-pass translation with general models, but in multi-step self-correction workflows and task-specialized models.

Abstract: Test-time scaling (TTS) has enhanced the performance of Reasoning Models (RMs) on various tasks such as math and coding, yet its efficacy in machine translation (MT) remains underexplored. This paper investigates whether increased inference-time computation improves translation quality. We evaluate 12 RMs across a diverse suite of MT benchmarks spanning multiple domains, examining three scenarios: direct translation, forced-reasoning extrapolation, and post-editing. Our findings show that for general-purpose RMs, TTS provides limited and inconsistent benefits for direct translation, with performance quickly plateauing. However, the effectiveness of TTS is unlocked by domain-specific fine-tuning, which aligns a model’s reasoning process with task requirements, leading to consistent improvements up to an optimal, self-determined reasoning depth. We also find that forcing a model to reason beyond its natural stopping point consistently degrades translation quality. In contrast, TTS proves highly effective in a post-editing context, reliably turning self-correction into a beneficial process. These results indicate that the value of inference-time computation in MT lies not in enhancing single-pass translation with general models, but in targeted applications like multi-step, self-correction workflows and in conjunction with task-specialized models.

[37] Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels

Zhepeng Cen, Haolin Chen, Shiyu Wang, Zuxin Liu, Zhiwei Liu, Ding Zhao, Silvio Savarese, Caiming Xiong, Huan Wang, Weiran Yao

Main category: cs.CL

TL;DR: Webscale-RL pipeline converts pre-training documents into diverse QA pairs for RL, achieving 100x efficiency over continual pre-training with 1.2M examples across 9 domains.

Details

Motivation: Address the training-generation gap in LLMs and overcome the data bottleneck in RL applications by creating web-scale RL datasets comparable to pre-training corpora.

Method: Developed Webscale-RL pipeline that systematically converts large-scale pre-training documents into millions of diverse, verifiable question-answer pairs for reinforcement learning.

Result: Created Webscale-RL dataset with 1.2 million examples across 9+ domains. RL training with this dataset significantly outperforms continual pre-training and baselines, achieving same performance with up to 100x fewer tokens.

Conclusion: Webscale-RL presents a viable path to scale RL to pre-training levels, enabling more capable and efficient language models by bridging the data gap between imitation learning and reinforcement learning.

Abstract: Large Language Models (LLMs) have achieved remarkable success through imitation learning on vast text corpora, but this paradigm creates a training-generation gap and limits robust reasoning. Reinforcement learning (RL) offers a more data-efficient solution capable of bridging this gap, yet its application has been constrained by a critical data bottleneck: existing RL datasets are orders of magnitude smaller and less diverse than web-scale pre-training corpora. To address this, we introduce the Webscale-RL pipeline, a scalable data engine that systematically converts large-scale pre-training documents into millions of diverse, verifiable question-answer pairs for RL. Using this pipeline, we construct the Webscale-RL dataset, containing 1.2 million examples across more than 9 domains. Our experiments show that the model trained on this dataset significantly outperforms continual pretraining and strong data refinement baselines across a suite of benchmarks. Notably, RL training with our dataset proves substantially more efficient, achieving the performance of continual pre-training with up to 100$\times$ fewer tokens. Our work presents a viable path toward scaling RL to pre-training levels, enabling more capable and efficient language models.

[38] From Acceleration to Saturation: Scaling Behavior of Bootstrapped Language Model Pretraining

Seng Pei Liew, Takuya Kato

Main category: cs.CL

TL;DR: Bootstrapped pretraining becomes less effective as base models are more overtrained, with scaling efficiency diminishing logarithmically based on the number of tokens used in initial pretraining.

Details

Motivation: To understand the effectiveness of bootstrapped pretraining (reusing pretrained models for further training) and how it scales, especially when applied to overtrained base models, to reduce training costs.

Method: Empirical study of bootstrapped pretraining scaling behavior, analyzing how scaling exponent decreases with base model pretraining tokens and developing a simple scaling law model.

Result: Found that scaling efficiency diminishes predictably - the scaling exponent decreases logarithmically with the number of tokens used to pretrain the base model, revealing a saturation effect.

Conclusion: There’s a fundamental trade-off in multi-stage pretraining: more extensively pretrained models provide less additional benefit from bootstrapping, offering practical insights for efficient training and considerations for reusing overtrained models.

Abstract: Bootstrapped pretraining, i.e., the reuse of a pretrained base model for further pretraining, such as continual pretraining or model growth, is promising at reducing the cost of training language models from scratch. However, its effectiveness remains unclear, especially when applied to overtrained base models. In this work, we empirically study the scaling behavior of bootstrapped pretraining and find that its scaling efficiency diminishes in a predictable manner: The scaling exponent with respect to second-stage pretraining tokens decreases logarithmically with the number of tokens used to pretrain the base model. The joint dependence on first- and second-stage tokens is accurately modeled by a simple scaling law. Such saturation effect reveals a fundamental trade-off in multi-stage pretraining strategies: the more extensively a model is pretrained, the less additional benefit bootstrapping provides. Our findings provide practical insights for efficient language model training and raise important considerations for the reuse of overtrained models.

[39] Flipping the Dialogue: Training and Evaluating User Language Models

Tarek Naous, Philippe Laban, Wei Xu, Jennifer Neville

Main category: cs.CL

TL;DR: Assistant LMs make poor user simulators - better assistants yield worse simulators. Purpose-built User LMs better simulate human behavior and reveal that realistic user simulations cause assistant performance to drop significantly.

Details

Motivation: To evaluate LM performance in realistic multi-turn conversations, since current methods use assistant LMs as user simulators which don't accurately reflect how real users interact (with imperfect, evolving requests).

Method: Introduce purpose-built User Language Models (User LMs) - models post-trained specifically to simulate human users in multi-turn conversations, rather than using assistant LMs as simulators.

Result: User LMs align better with human behavior and achieve better simulation robustness. When used to simulate coding and math conversations, GPT-4o’s performance drops from 74.6% to 57.4%, showing assistants struggle with realistic user nuances.

Conclusion: Realistic user simulation environments reveal that current assistants fail to cope with the nuances of real users in multi-turn conversations, highlighting the need for purpose-built user simulators rather than repurposed assistant models.

Abstract: Conversations with LMs involve two participants: a human user leading the conversation, and an LM assistant responding to the user’s request. To satisfy this specific role, LMs are post-trained to be helpful assistants – optimized to produce exhaustive and well-structured responses, free of ambiguity and grammar errors. User utterances, on the other hand, are rarely perfected, with each user phrasing requests in unique ways, sometimes putting in partial effort at each turn and refining on the fly. To evaluate LM performance in realistic settings, prior work simulated users in multi-turn conversations, often prompting an LLM originally trained to be a helpful assistant to act as a user. However, we show that assistant LMs make for poor user simulators, with the surprising finding that better assistants yield worse simulators. Instead, we introduce purpose-built User Language Models (User LMs) - models post-trained to simulate human users in multi-turn conversations. Through various evaluations, we show how User LMs align better with human behavior and achieve better simulation robustness than existing simulation methods. When leveraging User LMs to simulate coding and math conversations, the performance of a strong assistant (GPT-4o) drops from 74.6% to 57.4%, confirming that more realistic simulation environments lead to assistant struggles as they fail to cope with the nuances of users in multi-turn setups.

[40] The Algebra of Meaning: Why Machines Need Montague More Than Moore’s Law

Cheonkam Jeong, Sungdo Kim, Jewoo Park

Main category: cs.CL

TL;DR: The paper proposes Savassan, a neuro-symbolic architecture that treats language model alignment as a parsing problem, compiling utterances into typed logical forms to address hallucination and compliance issues through type-theoretic semantics.

Details

Motivation: Current language models frequently hallucinate and struggle with compliance because they lack proper type-theoretic semantics to handle different meaning types (descriptive, normative, legal) in a compositional way.

Method: Savassan uses neural components to extract candidate structures from inputs and symbolic components for type checking, constraint reasoning, and cross-jurisdiction mapping. It compiles utterances into Montague-style logical forms mapped to typed ontologies with deontic operators.

Result: The system enables “parse once, project many” capability where a single parsed structure can be evaluated across multiple legal jurisdictions (e.g., defamation risk in Korea/Japan, protected opinion in US, GDPR checks in EU) to produce compliance-aware guidance.

Conclusion: Trustworthy AI autonomy requires compositional typing of meaning to enable systems to reason about descriptive content, normative prescriptions, and legal liabilities within a unified algebra of meaning.

Abstract: Contemporary language models are fluent yet routinely mis-handle the types of meaning their outputs entail. We argue that hallucination, brittle moderation, and opaque compliance outcomes are symptoms of missing type-theoretic semantics rather than data or scale limitations. Building on Montague’s view of language as typed, compositional algebra, we recast alignment as a parsing problem: natural-language inputs must be compiled into structures that make explicit their descriptive, normative, and legal dimensions under context. We present Savassan, a neuro-symbolic architecture that compiles utterances into Montague-style logical forms and maps them to typed ontologies extended with deontic operators and jurisdictional contexts. Neural components extract candidate structures from unstructured inputs; symbolic components perform type checking, constraint reasoning, and cross-jurisdiction mapping to produce compliance-aware guidance rather than binary censorship. In cross-border scenarios, the system “parses once” (e.g., defect claim(product x, company y)) and projects the result into multiple legal ontologies (e.g., defamation risk in KR/JP, protected opinion in US, GDPR checks in EU), composing outcomes into a single, explainable decision. This paper contributes: (i) a diagnosis of hallucination as a type error; (ii) a formal Montague-ontology bridge for business/legal reasoning; and (iii) a production-oriented design that embeds typed interfaces across the pipeline. We outline an evaluation plan using legal reasoning benchmarks and synthetic multi-jurisdiction suites. Our position is that trustworthy autonomy requires compositional typing of meaning, enabling systems to reason about what is described, what is prescribed, and what incurs liability within a unified algebra of meaning.

[41] TinyScientist: An Interactive, Extensible, and Controllable Framework for Building Research Agents

Haofei Yu, Keyang Xuan, Fenghai Li, Kunlun Zhu, Zijie Lei, Jiaxun Zhang, Ziheng Qi, Kyle Richardson, Jiaxuan You

Main category: cs.CL

TL;DR: TinyScientist is an interactive, extensible framework that simplifies building and maintaining complex automatic research workflows using LLMs, addressing the growing complexity of multi-agent systems and research automation.

Details

Motivation: As automatic research with LLMs becomes more important, the complexity of multi-agent systems, planning, tool usage, and human-agent interaction creates significant challenges for extending and maintaining these workflows as algorithms advance.

Method: Identifies essential components of automatic research workflow and proposes an interactive, extensible, and controllable framework that easily adapts to new tools and supports iterative growth.

Result: Provides an open-source codebase, interactive web demonstration, and PyPI Python package to make state-of-the-art auto-research pipelines broadly accessible to researchers and developers.

Conclusion: TinyScientist successfully addresses the complexity challenges in automatic research workflows by providing an adaptable framework that supports easy extension and maintenance while making advanced research automation accessible to the broader community.

Abstract: Automatic research with Large Language Models (LLMs) is rapidly gaining importance, driving the development of increasingly complex workflows involving multi-agent systems, planning, tool usage, code execution, and human-agent interaction to accelerate research processes. However, as more researchers and developers begin to use and build upon these tools and platforms, the complexity and difficulty of extending and maintaining such agentic workflows have become a significant challenge, particularly as algorithms and architectures continue to advance. To address this growing complexity, TinyScientist identifies the essential components of the automatic research workflow and proposes an interactive, extensible, and controllable framework that easily adapts to new tools and supports iterative growth. We provide an open-source codebase, an interactive web demonstration, and a PyPI Python package to make state-of-the-art auto-research pipelines broadly accessible to every researcher and developer.

[42] Do Internal Layers of LLMs Reveal Patterns for Jailbreak Detection?

Sri Durga Sai Sowmya Kadali, Evangelos E. Papalexakis

Main category: cs.CL

TL;DR: This paper investigates jailbreaking in LLMs by analyzing internal representations, focusing on how hidden layers respond differently to jailbreak vs benign prompts in GPT-J and Mamba2 models.

Details

Motivation: Jailbreaking LLMs is a critical security concern as adversarial users exploit carefully engineered prompts to elicit restricted outputs, and existing defenses are insufficient against evolving attack techniques.

Method: Analyzed internal representations of LLMs by examining how hidden layers respond to jailbreak versus benign prompts, specifically studying GPT-J and Mamba2 models.

Result: Found distinct layer-wise behaviors in how models process jailbreak vs benign prompts, with preliminary findings suggesting internal dynamics differ significantly between attack and normal inputs.

Conclusion: The distinct layer-wise behaviors observed provide promising directions for future research on leveraging internal model dynamics for more robust jailbreak detection and defense mechanisms.

Abstract: Jailbreaking large language models (LLMs) has emerged as a pressing concern with the increasing prevalence and accessibility of conversational LLMs. Adversarial users often exploit these models through carefully engineered prompts to elicit restricted or sensitive outputs, a strategy widely referred to as jailbreaking. While numerous defense mechanisms have been proposed, attackers continuously develop novel prompting techniques, and no existing model can be considered fully resistant. In this study, we investigate the jailbreak phenomenon by examining the internal representations of LLMs, with a focus on how hidden layers respond to jailbreak versus benign prompts. Specifically, we analyze the open-source LLM GPT-J and the state-space model Mamba2, presenting preliminary findings that highlight distinct layer-wise behaviors. Our results suggest promising directions for further research on leveraging internal model dynamics for robust jailbreak detection and defense.

[43] A Comparative Analysis of Contextual Representation Flow in State-Space and Transformer Architectures

Nhat M. Hoang, Do Xuan Long, Cong-Duy Nguyen, Min-Yen Kan, Luu Anh Tuan

Main category: cs.CL

TL;DR: First unified analysis of representation propagation in State Space Models (SSMs) vs Transformer-Based Models (TBMs), revealing divergent patterns: TBMs rapidly homogenize tokens then re-diversify later, while SSMs preserve uniqueness early but converge to homogenization deeper.

Details

Motivation: SSMs have emerged as efficient alternatives to TBMs for long sequences, but how contextual information flows across layers and tokens remains understudied.

Method: Used centered kernel alignment, stability metrics, probing, theoretical analysis, and parameter randomization to analyze representation evolution within and across layers.

Result: Found key divergence: TBMs rapidly homogenize token representations with diversity reemerging later, while SSMs preserve token uniqueness early but converge to homogenization deeper. Oversmoothing in TBMs stems from architectural design, while in SSMs it arises from training dynamics.

Conclusion: Insights clarify inductive biases of both architectures and inform future model and training designs for long-context reasoning.

Abstract: State Space Models (SSMs) have recently emerged as efficient alternatives to Transformer-Based Models (TBMs) for long-sequence processing, offering linear scaling and lower memory use. Yet, how contextual information flows across layers and tokens in these architectures remains understudied. We present the first unified, token- and layer-level analysis of representation propagation in SSMs and TBMs. Using centered kernel alignment, stability metrics, and probing, we characterize how representations evolve within and across layers. We find a key divergence: TBMs rapidly homogenize token representations, with diversity reemerging only in later layers, while SSMs preserve token uniqueness early but converge to homogenization deeper. Theoretical analysis and parameter randomization further reveal that oversmoothing in TBMs stems from architectural design, whereas in SSMs it arises mainly from training dynamics. These insights clarify the inductive biases of both architectures and inform future model and training designs for long-context reasoning.

[44] Aligning Large Language Models via Fully Self-Synthetic Data

Shangjian Yin, Zhepei Wei, Xinyu Zhu, Wei-Lin Chen, Yu Meng

Main category: cs.CL

TL;DR: SAO is a fully self-synthetic framework for LLM alignment that generates all training data (prompts, responses, preferences) using the model itself through persona role-play and self-evaluation.

Details

Motivation: Traditional RLHF requires expensive human-annotated datasets, and RLAIF also incurs significant costs for collecting prompts, responses, and using external reward models or proprietary models for preference annotation.

Method: SAO instructs the LLM to engage in persona role-play to generate diverse prompts and responses, then self-evaluates these for preference optimization, creating a fully self-synthetic training framework.

Result: SAO effectively enhances the model’s chat capabilities on standard benchmarks like AlpacaEval 2.0 while maintaining strong performance on downstream objective tasks such as question-answering and math reasoning.

Conclusion: SAO provides a practical solution for self-improvement in aligning LLMs, offering a cost-effective alternative to traditional RLHF and RLAIF methods.

Abstract: Traditional reinforcement learning from human feedback (RLHF) for large language models (LLMs) relies on expensive human-annotated datasets, while Reinforcement Learning from AI Feedback (RLAIF) also incurs significant costs, requiring the collection of diverse prompts and corresponding responses, often necessitating external reward models or proprietary models like GPT-4 to annotate preference pairs. In this work, we introduce Self-Alignment Optimization (SAO), a fully self-synthetic framework for LLM alignment, where all training data, including prompts (i.e., user queries), responses, and preferences, are generated by the model itself. Specifically, SAO first instructs the LLM to engage in persona role-play and generate diverse prompts and responses, which are then self-evaluated for preference optimization. Extensive experiments demonstrate that SAO effectively enhances the model’s chat capabilities on standard benchmarks like AlpacaEval~2.0, while maintaining strong performance on downstream objective tasks (e.g., question-answering, math reasoning). Our work provides a practical solution for self-improvement in aligning LLMs, and the code for reproducing our results is available at: https://github.com/SJY8460/SAO.

[45] ToolMem: Enhancing Multimodal Agents with Learnable Tool Capability Memory

Yunzhong Xiao, Yangmin Li, Hewei Wang, Yunlong Tang, Zora Zhiruo Wang

Main category: cs.CL

TL;DR: ToolMem enables AI agents to develop memory of tool capabilities from previous interactions, allowing them to select optimal tools for specific tasks rather than relying on fixed tools.

Details

Motivation: Current AI agents typically use fixed tools, limiting their ability to select the most suitable tool for specific tasks, unlike humans who learn tool capabilities through interaction.

Method: ToolMem enables agents to develop memories of tool capabilities by summarizing strengths and weaknesses from previous interactions and storing them in memory for retrieval during inference.

Result: ToolMem-augmented agents predicted tool performance 14.8% and 28.7% more accurately across text and multimodal generation scenarios, and improved optimal tool selection by 21% and 24% in respective scenarios.

Conclusion: ToolMem successfully enables agents to learn from tool interactions and apply this knowledge to select better tools, significantly improving performance in both text and multimodal generation tasks.

Abstract: Agents utilizing tools powered by large language models (LLMs) or vision-language models (VLMs) have demonstrated remarkable progress in diverse tasks across text and visual modalities. Unlike traditional tools such as calculators, which give deterministic outputs, neural tools perform uncertainly across task scenarios. While different tools for a task may excel in varied scenarios, existing agents typically rely on fixed tools, thus limiting the flexibility in selecting the most suitable tool for specific tasks. In contrast, humans snowball their understanding of the capabilities of different tools by interacting with them, and apply this knowledge to select the optimal tool when solving a future task. To build agents that similarly benefit from this process, we propose ToolMem that enables agents to develop memories of tool capabilities from previous interactions, by summarizing their strengths and weaknesses and storing them in memory; at inference, the agent can retrieve relevant entries from ToolMem, and select the best tool to solve individual tasks more accurately. We evaluate ToolMem on learning varied text generation and text-to-image generation neural tools. Compared to no-memory, generic agents, we find ToolMem-augmented agents predict tool performance 14.8% and 28.7% more accurately across text and multimodal generation scenarios. Moreover, ToolMem facilitates optimal tool selection among multiple choices by 21% and 24% absolute increases in respective scenarios.

[46] PIKA: Expert-Level Synthetic Datasets for Post-Training Alignment from Scratch

Shangjian Yin, Shining Liang, Wenbiao Ding, Yuli Qian, Zhouxing Shi, Hongzhi Li, Yutao Xie

Main category: cs.CL

TL;DR: PiKa introduces data-efficient alignment datasets that achieve expert-level performance with only 30k SFT examples, outperforming models trained on much larger datasets and even proprietary models.

Details

Motivation: Existing alignment datasets are either private or require costly human annotation, limiting reproducibility and scalability. Current approaches use over 300k examples but still underperform proprietary models, creating barriers for academic and resource-limited communities.

Method: Developed PiKa, a family of expert-level alignment datasets, with PiKa-SFT using only 30k supervised fine-tuning examples. Evaluated by fine-tuning Llama-3-8B-Base and Qwen2.5 series models on PiKa and other public datasets.

Result: PiKa-SFT outperforms models trained on much larger data and even surpasses the official Llama-3-8B-Instruct model trained on over 10 million proprietary examples on AlpacaEval 2.0 and Arena-Hard benchmarks. Consistent gains achieved across Qwen2.5 series models (0.5B to 7B).

Conclusion: High-quality alignment can be achieved with significantly less data, offering a scalable path for open-source LLM alignment and reducing barriers for academic and resource-limited communities.

Abstract: Reinforcement Learning from Human Feedback (RLHF) has become a cornerstone for aligning large language models (LLMs). However, its effectiveness depends on high-quality instruction data. Most existing alignment datasets are either private or require costly human annotation, which limits reproducibility and scalability. Even with Reinforcement Learning from AI Feedback (RLAIF), concerns about data quality remain. Moreover, it is unclear how much data is actually required to fine-tune a base model into a strong instruction-following model. Current approaches often rely on over 300k examples even at the supervised fine-tuning (SFT) stage, yet they still underperform compared to proprietary models, creating barriers for academic and resource-limited communities. To address this gap, we introduce PiKa, a data-efficient family of expert-level alignment datasets. In particular, the PiKa-SFT dataset uses only 30k SFT examples, far fewer than state-of-the-art datasets like Magpie. Through evaluations by fine-tuning Llama-3-8B-Base on PiKa and other public datasets, we show that PiKa-SFT outperforms models trained on much larger data. On AlpacaEval 2.0 and Arena-Hard benchmarks, PiKa-SFT fine-tuning even surpasses the official Llama-3-8B-Instruct model trained on over 10 million proprietary examples. We further extend our study by training the Qwen2.5 series (0.5B to 7B) on PiKa-SFT, achieving consistent gains. These findings demonstrate that high-quality alignment can be achieved with significantly less data, offering a scalable path for open-source LLM alignment. Code and data: https://github.com/SJY8460/PiKa.

[47] WeatherArchive-Bench: Benchmarking Retrieval-Augmented Reasoning for Historical Weather Archives

Yongan Yu, Xianda Du, Qingchen Hu, Jiahao Liang, Jingwei Ni, Dan Qiang, Kaiyu Huang, Grant McKenzie, Renee Sieber, Fengran Mo

Main category: cs.CL

TL;DR: WeatherArchive-Bench is the first benchmark for evaluating RAG systems on historical weather archives, featuring retrieval and assessment tasks to measure system performance on locating relevant passages and classifying societal vulnerability/resilience indicators from extreme weather narratives.

Details

Motivation: Historical weather archives contain rich qualitative accounts of societal responses to extreme weather events, but their vast scale, noisy digitization, and archaic language make them difficult to transform into structured knowledge for climate research.

Method: Created WeatherArchive-Bench with two tasks: WeatherArchive-Retrieval (locating relevant passages from over 1M archival news segments) and WeatherArchive-Assessment (evaluating LLMs’ ability to classify vulnerability and resilience indicators). Tested sparse, dense, and re-ranking retrievers plus various LLMs.

Result: Dense retrievers often fail on historical terminology, while LLMs frequently misinterpret vulnerability and resilience concepts. These findings reveal limitations in reasoning about complex societal indicators.

Conclusion: The benchmark highlights key challenges in processing historical weather archives and provides insights for designing more robust climate-focused RAG systems. The dataset and framework are publicly available.

Abstract: Historical archives on weather events are collections of enduring primary source records that offer rich, untapped narratives of how societies have experienced and responded to extreme weather events. These qualitative accounts provide insights into societal vulnerability and resilience that are largely absent from meteorological records, making them valuable for climate scientists to understand societal responses. However, their vast scale, noisy digitized quality, and archaic language make it difficult to transform them into structured knowledge for climate research. To address this challenge, we introduce WeatherArchive-Bench, the first benchmark for evaluating retrieval-augmented generation (RAG) systems on historical weather archives. WeatherArchive-Bench comprises two tasks: WeatherArchive-Retrieval, which measures a system’s ability to locate historically relevant passages from over one million archival news segments, and WeatherArchive-Assessment, which evaluates whether Large Language Models (LLMs) can classify societal vulnerability and resilience indicators from extreme weather narratives. Extensive experiments across sparse, dense, and re-ranking retrievers, as well as a diverse set of LLMs, reveal that dense retrievers often fail on historical terminology, while LLMs frequently misinterpret vulnerability and resilience concepts. These findings highlight key limitations in reasoning about complex societal indicators and provide insights for designing more robust climate-focused RAG systems from archival contexts. The constructed dataset and evaluation framework are publicly available at https://anonymous.4open.science/r/WeatherArchive-Bench/.

[48] Incremental Summarization for Customer Support via Progressive Note-Taking and Agent Feedback

Yisha Wu, Cen, Zhao, Yuanpei Cao, Xiaoqing Su, Yashar Mehdad, Mindy Ji, Claire Na Cheng

Main category: cs.CL

TL;DR: Incremental summarization system for customer support that generates concise bullet notes during conversations using fine-tuned Mixtral-8x7B model and DeBERTa classifier, achieving 3% reduction in case handling time.

Details

Motivation: To reduce customer support agents' context-switching effort and redundant review by providing timely, concise summaries during conversations rather than bulk summarization.

Method: Combines fine-tuned Mixtral-8x7B model for continuous note generation with DeBERTa-based classifier to filter trivial content, and uses agent edits to refine online generation and inform offline model retraining.

Result: 3% reduction in case handling time compared to bulk summarization (up to 9% in highly complex cases), with high agent satisfaction ratings from surveys.

Conclusion: Incremental summarization with continuous feedback effectively enhances summary quality and agent productivity at scale in production environments.

Abstract: We introduce an incremental summarization system for customer support agents that intelligently determines when to generate concise bullet notes during conversations, reducing agents’ context-switching effort and redundant review. Our approach combines a fine-tuned Mixtral-8x7B model for continuous note generation with a DeBERTa-based classifier to filter trivial content. Agent edits refine the online notes generation and regularly inform offline model retraining, closing the agent edits feedback loop. Deployed in production, our system achieved a 3% reduction in case handling time compared to bulk summarization (with reductions of up to 9% in highly complex cases), alongside high agent satisfaction ratings from surveys. These results demonstrate that incremental summarization with continuous feedback effectively enhances summary quality and agent productivity at scale.

[49] How Language Models Conflate Logical Validity with Plausibility: A Representational Analysis of Content Effects

Leonardo Bertolazzi, Sandro Pezzelle, Raffaelle Bernardi

Main category: cs.CL

TL;DR: LLMs exhibit content effects where semantic plausibility biases logical validity judgments, similar to humans. The study shows validity and plausibility are linearly represented and aligned in LLMs’ internal representations, causing conflation between the two concepts.

Details

Motivation: To understand the mechanisms behind content effects in LLMs and investigate how validity and plausibility concepts are encoded in their internal representations, since this phenomenon in humans is explained by dual-process theory but remains unclear in LLMs.

Method: Analyzed how LLMs encode validity and plausibility concepts, used steering vectors to demonstrate causal influence between plausibility and validity judgments, measured alignment between concepts, and constructed debiasing vectors to disentangle them.

Result: Found that validity and plausibility are linearly represented and strongly aligned in LLMs’ representations, leading to conflation. Plausibility vectors can causally bias validity judgments and vice versa. Degree of alignment predicts behavioral content effects. Debiasing vectors successfully reduced content effects and improved reasoning accuracy.

Conclusion: The study advances understanding of how abstract logical concepts are represented in LLMs and demonstrates that representational interventions can reduce content effects, providing a path toward more logical AI systems.

Abstract: Both humans and large language models (LLMs) exhibit content effects: biases in which the plausibility of the semantic content of a reasoning problem influences judgments regarding its logical validity. While this phenomenon in humans is best explained by the dual-process theory of reasoning, the mechanisms behind content effects in LLMs remain unclear. In this work, we address this issue by investigating how LLMs encode the concepts of validity and plausibility within their internal representations. We show that both concepts are linearly represented and strongly aligned in representational geometry, leading models to conflate plausibility with validity. Using steering vectors, we demonstrate that plausibility vectors can causally bias validity judgements, and vice versa, and that the degree of alignment between these two concepts predicts the magnitude of behavioral content effects across models. Finally, we construct debiasing vectors that disentangle these concepts, reducing content effects and improving reasoning accuracy. Our findings advance understanding of how abstract logical concepts are represented in LLMs and highlight representational interventions as a path toward more logical systems.

[50] Scaling LLM Multi-turn RL with End-to-end Summarization-based Context Management

Miao Lu, Weiwei Sun, Weihua Du, Zhan Ling, Xuesong Yao, Kang Liu, Jiecao Chen

Main category: cs.CL

TL;DR: SUPO introduces summarization-based context management for RL fine-tuning of LLM agents, enabling long-horizon multi-turn tool use beyond fixed context limits by optimizing both tool-use behaviors and summarization strategies end-to-end.

Details

Motivation: Address context length bottleneck in RL fine-tuning of LLM agents for long-horizon multi-turn tool use, overcoming degraded instruction following, excessive rollout costs, and strict context limits in existing pipelines.

Method: Periodically compress tool-using history with LLM-generated summaries to retain task-relevant information, enabling compact context and scaling beyond fixed context windows. Derive policy gradient representation for end-to-end optimization of both tool-use behaviors and summarization strategies.

Result: SUPO significantly improves success rate while maintaining same or lower working context length compared to baselines on interactive function calling and searching tasks. For complex searching tasks, performance improves further when scaling test-time summarization beyond training time.

Conclusion: Summarization-based context management is a principled and scalable approach for training RL agents beyond fixed context length limits.

Abstract: We study reinforcement learning (RL) fine-tuning of large language model (LLM) agents for long-horizon multi-turn tool use, where context length quickly becomes a fundamental bottleneck. Existing RL pipelines can suffer from degraded instruction following, excessive rollout costs, and most importantly, strict context limits. To address these challenges, we introduce summarization-based context management to training. In specific, it periodically compresses the tool using history by LLM-generated summaries that retain task-relevant information to keep a compact context while enabling the agent to scale beyond the fixed context window. Building on this formulation, we derive a policy gradient representation that seamlessly enables standard LLM RL infrastructures to optimize both tool-use behaviors as well as summarization strategies in an end-to-end fashion. We instantiate this framework with \underline{SU}mmarization augmented \underline{P}olicy \underline{O}ptimization (\texttt{SUPO}), an LLM RL algorithm that enables long-horizon training beyond a fixed context limit. Experiments on interactive function calling and searching tasks demonstrate that \texttt{SUPO} significantly improves the success rate while maintaining the same or even lower working context length compared to baselines. We also demonstrate that for complex searching tasks, \texttt{SUPO} can further improve the evaluation performance when scaling test-time maximum round of summarization beyond that of training time. Our results establish summarization-based context management as a principled and scalable approach for training RL agents beyond a fixed context length limit.

[51] PTEB: Towards Robust Text Embedding Evaluation via Stochastic Paraphrasing at Evaluation Time with LLMs

Manuel Frank, Haithem Afli

Main category: cs.CL

TL;DR: PTEB introduces a dynamic evaluation protocol using LLM-generated paraphrases to test sentence embedding robustness, showing performance sensitivity to token variations despite preserved semantics.

Details

Motivation: Static benchmarks like MTEB can inflate performance metrics and obscure real-world robustness due to repeated tuning on fixed test sets.

Method: Uses LLM-based method to generate semantically preserving paraphrases at evaluation time, aggregating results across multiple stochastic runs.

Result: Shows sentence encoders are sensitive to token space changes even when semantics are fixed, with smaller models not disproportionately affected compared to larger ones.

Conclusion: Proposes shifting NLP evaluation from static benchmarks to dynamic, stochastic evaluation leveraging eval-time compute for more robust performance assessment.

Abstract: Current evaluations of sentence embedding models typically rely on static test beds such as the Massive Text Embedding Benchmark (MTEB). While invaluable, repeated tuning on a fixed suite can inflate reported performance and obscure real-world robustness. We introduce the Paraphrasing Text Embedding Benchmark (PTEB), a dynamic protocol that stochastically generates meaning-preserving paraphrases at evaluation time and aggregates results across multiple runs. Using a cost-efficient LLM-based method grounded in semantic textual similarity gold ratings, we show that LLMs generate token-diverse but semantically preserving, paraphrases. Across 7 MTEB tasks, we validate our hypothesis that the performance of sentence encoders is sensitive to changes in token space even when semantics remain fixed. We also observe that smaller models are not disproportionately affected relative to larger ones. Our results are statistically robust over multiple runs and we extended our experiments to 3 multilingual datasets covering 10 languages. More generally, we aim to propose a new evaluation paradigm in NLP that relies less on static, pre-defined benchmarks but shifts towards dynamic, stochastic evaluation leveraging eval-time compute.

[52] Are LLMs Reliable Rankers? Rank Manipulation via Two-Stage Token Optimization

Tiancheng Xing, Jerry Li, Yixuan Du, Xiyang Hu

Main category: cs.CL

TL;DR: RAF is a two-stage token optimization method that crafts natural-sounding prompts to manipulate LLM-based reranking systems, consistently promoting target items while remaining hard to detect.

Details

Motivation: To expose the vulnerability of LLM-based reranking systems, which can be easily manipulated by small textual perturbations, highlighting security concerns for modern retrieval systems.

Method: Two-stage token optimization: Stage 1 uses Greedy Coordinate Gradient to shortlist candidate tokens combining gradient of rank-target with readability score; Stage 2 evaluates candidates under exact ranking and readability losses using entropy-based dynamic weighting and temperature-controlled sampling.

Result: RAF significantly boosts target item ranks using naturalistic language across multiple LLMs, with greater robustness than existing methods in both promotion effectiveness and naturalness preservation.

Conclusion: LLM-based reranking is inherently susceptible to adversarial manipulation, raising critical security implications for the trustworthiness and robustness of modern retrieval systems.

Abstract: Large language models (LLMs) are increasingly used as rerankers in information retrieval, yet their ranking behavior can be steered by small, natural-sounding prompts. To expose this vulnerability, we present Rank Anything First (RAF), a two-stage token optimization method that crafts concise textual perturbations to consistently promote a target item in LLM-generated rankings while remaining hard to detect. Stage 1 uses Greedy Coordinate Gradient to shortlist candidate tokens at the current position by combining the gradient of the rank-target with a readability score; Stage 2 evaluates those candidates under exact ranking and readability losses using an entropy-based dynamic weighting scheme, and selects a token via temperature-controlled sampling. RAF generates ranking-promoting prompts token-by-token, guided by dual objectives: maximizing ranking effectiveness and preserving linguistic naturalness. Experiments across multiple LLMs show that RAF significantly boosts the rank of target items using naturalistic language, with greater robustness than existing methods in both promoting target items and maintaining naturalness. These findings underscore a critical security implication: LLM-based reranking is inherently susceptible to adversarial manipulation, raising new challenges for the trustworthiness and robustness of modern retrieval systems. Our code is available at: https://github.com/glad-lab/RAF.

[53] AWM: Accurate Weight-Matrix Fingerprint for Large Language Models

Boyi Zeng, Lin Chen, Ziwei He, Xinbing Wang, Zhouhan Lin

Main category: cs.CL

TL;DR: A training-free fingerprinting method using weight matrices and Linear Assignment Problem with unbiased Centered Kernel Alignment to identify if LLMs are derived from base models, achieving perfect classification metrics with near-zero false positives.

Details

Motivation: Protecting intellectual property of LLMs is crucial due to substantial training resources, and current post-training processes make reliable identification challenging.

Method: Training-free fingerprinting based on weight matrices using Linear Assignment Problem and unbiased Centered Kernel Alignment similarity to neutralize parameter manipulation effects.

Result: Achieved perfect classification metrics on 60 positive and 90 negative model pairs, with exceptional robustness against all six post-training categories and near-zero false positive risk.

Conclusion: The method establishes a strong basis for reliable model lineage verification and completes computation within 30s on an NVIDIA 3090 GPU.

Abstract: Protecting the intellectual property of large language models (LLMs) is crucial, given the substantial resources required for their training. Consequently, there is an urgent need for both model owners and third parties to determine whether a suspect LLM is trained from scratch or derived from an existing base model. However, the intensive post-training processes that models typically undergo-such as supervised fine-tuning, extensive continued pretraining, reinforcement learning, multi-modal extension, pruning, and upcycling-pose significant challenges to reliable identification. In this work, we propose a training-free fingerprinting method based on weight matrices. We leverage the Linear Assignment Problem (LAP) and an unbiased Centered Kernel Alignment (CKA) similarity to neutralize the effects of parameter manipulations, yielding a highly robust and high-fidelity similarity metric. On a comprehensive testbed of 60 positive and 90 negative model pairs, our method demonstrates exceptional robustness against all six aforementioned post-training categories while exhibiting a near-zero risk of false positives. By achieving perfect scores on all classification metrics, our approach establishes a strong basis for reliable model lineage verification. Moreover, the entire computation completes within 30s on an NVIDIA 3090 GPU. The code is available at https://github.com/LUMIA-Group/AWM.

[54] TWIST: Training-free and Label-free Short Text Clustering through Iterative Vector Updating with LLMs

I-Fan Lin, Faegheh Hasibi, Suzan Verberne

Main category: cs.CL

TL;DR: A training-free, label-free method for short text clustering that works with any embedder, using iterative vector updating with LLM guidance to achieve competitive results without needing labeled data or prior cluster knowledge.

Details

Motivation: Companies using customer-facing chatbots need to cluster large amounts of user utterances by intent, but typically have no labeled data and unknown number of clusters in commercial settings.

Method: Iterative vector updating: constructs sparse vectors based on representative texts and refines them through LLM guidance, working with any embedder and requiring no training or labels.

Result: Achieves comparable or superior results to state-of-the-art contrastive learning methods, works with diverse datasets and smaller LLMs, scales to large datasets while reducing computational costs.

Conclusion: The method is model-agnostic, scalable, and better aligned with real-world scenarios than existing clustering methods due to its low-resource requirements and adaptability.

Abstract: In this paper, we propose a training-free and label-free method for short text clustering that can be used on top of any existing embedder. In the context of customer-facing chatbots, companies are dealing with large amounts of user utterances that need to be clustered according to their intent. In these commercial settings, no labeled data is typically available, and the number of clusters is not known. Our method is based on iterative vector updating: it constructs sparse vectors based on representative texts, and then iteratively refines them through LLM guidance. Our method achieves comparable or superior results to state-of-the-art methods that use contrastive learning, but without assuming prior knowledge of clusters or labels. Experiments on diverse datasets and smaller LLMs show that our method is model agnostic and can be applied to any embedder, with relatively small LLMs, and different clustering methods. We also show that our method scales to large datasets, reducing the computational cost of the LLM. These low-resource, adaptable settings and the scalability of our method make it more aligned with real-world scenarios than existing clustering methods.

[55] A Formal Framework for Fluency-based Multi-Reference Evaluation in Grammatical Error Correction

Eitan Klinger, Zihao Huang, Tran Minh Nguyen, Emma Jayeon Park, Yige Chen, Yang Gu, Qingyu Gao, Siliang Liu, Mengyang Qiu, Jungyeul Park

Main category: cs.CL

TL;DR: The paper introduces a fluency-based multi-reference evaluation framework for grammatical error correction that uses n-gram similarity with different aggregation strategies to handle diverse valid corrections.

Details

Motivation: Existing evaluation metrics for grammatical error correction are edit-based, English-centric, and rely on rigid alignments, limiting their applicability in multilingual and generative settings where multiple valid corrections exist.

Method: Proposes a formal framework for fluency-based multi-reference evaluation using n-gram similarity with four aggregation strategies: select-best, simple-average, weighted-average, and merged-counts.

Result: Empirical evaluation on Czech, Estonian, Ukrainian, and Chinese corpora shows the strategies capture complementary aspects of fluency and coverage, providing a principled approach to handle linguistic diversity.

Conclusion: The framework unifies multi-reference evaluation into a principled, fluency-oriented approach that incorporates linguistic diversity without penalizing legitimate variation in corrections.

Abstract: Evaluating grammatical error correction requires metrics that reflect the diversity of valid human corrections rather than privileging a single reference. Existing frameworks, largely edit-based and English-centric, rely on rigid alignments between system and reference edits, limiting their applicability in multilingual and generative settings. This paper introduces a formal framework for \textit{fluency-based multi-reference evaluation}, framing $n$-gram similarity as an aggregation problem over multiple legitimate corrections. Within this formulation, we instantiate GLEU through four aggregation strategies–\textsc{select-best}, \textsc{simple-average}, \textsc{weighted-average}, and \textsc{merged-counts}–and analyze their properties of boundedness, monotonicity, and sensitivity to reference variation. Empirical results on Czech, Estonian, Ukrainian, and Chinese corpora show that these strategies capture complementary aspects of fluency and coverage. The framework unifies multi-reference evaluation into a principled, fluency-oriented approach that incorporates linguistic diversity without penalizing legitimate variation.

[56] Gold-Switch: Training-Free Superposition of Slow- and Fast- Thinking LLMs

Jaeseong Lee, Dayoung Kwon, seung-won hwang

Main category: cs.CL

TL;DR: Proposes a training-free regulation method to prevent overthinking in Large Reasoning Models by selectively unlearning from LRM at inference using low-rank projections based on singular value energy analysis.

Details

Motivation: Large Reasoning Models suffer from overthinking which degrades performance and wastes computational resources, while deploying multiple models (LLM + LRM) for routing is costly and impractical.

Method: Superposed deployment strategy with lightweight, training-free regulation that switches models on/off. Uses cumulative energy of singular values to identify optimal low-rank projections for selective unlearning from LRM at inference.

Result: Enables scaling down computation while preserving reasoning capabilities by adjusting reasoning ‘just right’ without the need for multiple model deployments.

Conclusion: The proposed method provides an efficient alternative to multi-model routing by optimizing inference through selective unlearning and low-rank projections, addressing overthinking issues in Large Reasoning Models.

Abstract: Large Reasoning Models (LRMs) excel in structured tasks by emulating deliberate human reasoning but often suffer from overthinking, degrading performance and wasting resources. One possible baseline is to deploy both LLM and LRM, then route input by predicting whether it requires reasoning and may cause overthinking. However, deploying multiple models can be costly or impractical. We propose a superposed deployment strategy with a lightweight, training-free regulation to optimize inference by switching one model on and off. Instead of routing, we selectively unlearn from LRM at inference, scaling down computation while preserving reasoning. By analyzing the cumulative energy of singular values, we identify optimal low-rank projections to adjust reasoning just right.

[57] Adaptive LLM-Symbolic Reasoning via Dynamic Logical Solver Composition

Lei Xu, Pierre Beckmann, Marco Valentino, André Freitas

Main category: cs.CL

TL;DR: An adaptive neuro-symbolic framework that automatically identifies formal reasoning strategies from natural language problems and dynamically selects specialized logical solvers, outperforming state-of-the-art models.

Details

Motivation: Current neuro-symbolic NLP methods are static with predetermined solver integration, limiting the ability to employ diverse formal inference strategies for different reasoning tasks.

Method: An adaptive multi-paradigm neuro-symbolic inference framework that: (1) automatically identifies formal reasoning strategies from natural language problems, and (2) dynamically selects and applies specialized formal logical solvers via autoformalization interfaces.

Result: LLMs achieve over 90% accuracy in predicting formal reasoning strategies. The framework outperforms GPT-4o by 27% and DeepSeek-V3.1 by 6%. Adaptive reasoning also improves pure LLM methods by 10%, 5%, and 6% on zero-shot, CoT, and symbolic CoT settings respectively.

Conclusion: This work establishes foundations for adaptive LLM-symbolic reasoning, offering a path to unify material and formal inferences on heterogeneous reasoning challenges, with post-training helping smaller models improve.

Abstract: Neuro-symbolic NLP methods aim to leverage the complementary strengths of large language models and formal logical solvers. However, current approaches are mostly static in nature, i.e., the integration of a target solver is predetermined at design time, hindering the ability to employ diverse formal inference strategies. To address this, we introduce an adaptive, multi-paradigm, neuro-symbolic inference framework that: (1) automatically identifies formal reasoning strategies from problems expressed in natural language; and (2) dynamically selects and applies specialized formal logical solvers via autoformalization interfaces. Extensive experiments on individual and multi-paradigm reasoning tasks support the following conclusions: LLMs are effective at predicting the necessary formal reasoning strategies with an accuracy above 90 percent. This enables flexible integration with formal logical solvers, resulting in our framework outperforming competing baselines by 27 percent and 6 percent compared to GPT-4o and DeepSeek-V3.1, respectively. Moreover, adaptive reasoning can even positively impact pure LLM methods, yielding gains of 10, 5, and 6 percent on zero-shot, CoT, and symbolic CoT settings with GPT-4o. Finally, although smaller models struggle with adaptive neuro-symbolic reasoning, post-training offers a viable path to improvement. Overall, this work establishes the foundations for adaptive LLM-symbolic reasoning, offering a path forward for unifying material and formal inferences on heterogeneous reasoning challenges.

[58] Foundations of LLM Knowledge Materialization: Termination, Reproducibility, Robustness

Luca Giordano, Simon Razniewski

Main category: cs.CL

TL;DR: Systematic study of LLM knowledge materialization using miniGPTKBs shows high termination rates, mixed reproducibility, and varying robustness across different perturbation types.

Details

Motivation: To measure and systematize the factual knowledge encoded in LLMs, particularly focusing on whether extraction can terminate, outputs are reproducible, and how robust they are to variations.

Method: Used miniGPTKBs (domain-specific subcrawls) to analyze termination, reproducibility, and robustness across yield, lexical similarity, and semantic similarity metrics. Experimented with four variations (seed, language, randomness, model) across three domains (history, entertainment, finance).

Result: Found (i) high termination rates (model-dependent), (ii) mixed reproducibility, and (iii) robustness varying by perturbation type: high for seeds and temperature, lower for languages and models.

Conclusion: LLM knowledge materialization can reliably surface core knowledge but has important limitations in reproducibility and robustness to certain types of variations.

Abstract: Large Language Models (LLMs) encode substantial factual knowledge, yet measuring and systematizing this knowledge remains challenging. Converting it into structured format, for example through recursive extraction approaches such as the GPTKB methodology (Hu et al., 2025b), is still underexplored. Key open questions include whether such extraction can terminate, whether its outputs are reproducible, and how robust they are to variations. We systematically study LLM knowledge materialization using miniGPTKBs (domain-specific, tractable subcrawls), analyzing termination, reproducibility, and robustness across three categories of metrics: yield, lexical similarity, and semantic similarity. We experiment with four variations (seed, language, randomness, model) and three illustrative domains (from history, entertainment, and finance). Our findings show (i) high termination rates, though model-dependent; (ii) mixed reproducibility; and (iii) robustness that varies by perturbation type: high for seeds and temperature, lower for languages and models. These results suggest that LLM knowledge materialization can reliably surface core knowledge, while also revealing important limitations.

[59] Overview of the Plagiarism Detection Task at PAN 2025

André Greiner-Petter, Maik Fröbe, Jan Philip Wahle, Terry Ruas, Bela Gipp, Akiko Aizawa, Martin Potthast

Main category: cs.CL

TL;DR: The PAN 2025 task focuses on detecting AI-generated plagiarism in scientific articles using LLMs (Llama, DeepSeek-R1, Mistral) and aligning them with sources. Current methods show promise but lack generalizability.

Details

Motivation: To address the growing challenge of automatically generated textual plagiarism in scientific articles using modern large language models.

Method: Created a novel large-scale dataset using three LLMs (Llama, DeepSeek-R1, Mistral), evaluated participant approaches and four baselines, and compared performance with PAN 2015 dataset.

Result: Naive semantic similarity approaches achieved up to 0.8 recall and 0.5 precision on the new dataset, but most approaches significantly underperformed on the 2015 dataset, showing poor generalizability.

Conclusion: Current approaches for detecting AI-generated plagiarism show promising results on specific datasets but lack robustness and generalizability across different datasets and time periods.

Abstract: The generative plagiarism detection task at PAN 2025 aims at identifying automatically generated textual plagiarism in scientific articles and aligning them with their respective sources. We created a novel large-scale dataset of automatically generated plagiarism using three large language models: Llama, DeepSeek-R1, and Mistral. In this task overview paper, we outline the creation of this dataset, summarize and compare the results of all participants and four baselines, and evaluate the results on the last plagiarism detection task from PAN 2015 in order to interpret the robustness of the proposed approaches. We found that the current iteration does not invite a large variety of approaches as naive semantic similarity approaches based on embedding vectors provide promising results of up to 0.8 recall and 0.5 precision. In contrast, most of these approaches underperform significantly on the 2015 dataset, indicating a lack in generalizability.

[60] BlackboxNLP-2025 MIB Shared Task: Exploring Ensemble Strategies for Circuit Localization Methods

Philipp Mondorf, Mingyang Wang, Sebastian Gerstner, Ahmad Dawar Hakimi, Yihong Liu, Leonor Veloso, Shijia Zhou, Hinrich Schütze, Barbara Plank

Main category: cs.CL

TL;DR: This paper investigates ensembling circuit localization methods to improve performance on the Mechanistic Interpretability Benchmark, exploring both parallel and sequential approaches that yield notable gains.

Details

Motivation: To improve circuit localization performance in large language models by combining multiple methods through ensembling, as single methods may have limitations in identifying subnetworks responsible for specific task behaviors.

Method: Two ensembling variants: parallel (combining attribution scores from different methods via averaging, min/max) and sequential (using EAP-IG scores as warm start for edge pruning). Also combines both approaches in a final ensemble.

Result: Both parallel and sequential ensembling approaches yield notable gains on benchmark metrics, with the combined ensemble achieving the best results compared to official baselines across multiple model-task combinations.

Conclusion: Ensembling circuit localization methods, particularly combining parallel and sequential approaches, provides a more precise circuit identification method and achieves superior performance on the MIB benchmark.

Abstract: The Circuit Localization track of the Mechanistic Interpretability Benchmark (MIB) evaluates methods for localizing circuits within large language models (LLMs), i.e., subnetworks responsible for specific task behaviors. In this work, we investigate whether ensembling two or more circuit localization methods can improve performance. We explore two variants: parallel and sequential ensembling. In parallel ensembling, we combine attribution scores assigned to each edge by different methods-e.g., by averaging or taking the minimum or maximum value. In the sequential ensemble, we use edge attribution scores obtained via EAP-IG as a warm start for a more expensive but more precise circuit identification method, namely edge pruning. We observe that both approaches yield notable gains on the benchmark metrics, leading to a more precise circuit identification approach. Finally, we find that taking a parallel ensemble over various methods, including the sequential ensemble, achieves the best results. We evaluate our approach in the BlackboxNLP 2025 MIB Shared Task, comparing ensemble scores to official baselines across multiple model-task combinations.

[61] Adaptive Tool Generation with Models as Tools and Reinforcement Learning

Chenpeng Wang, Xiaojie Cheng, Chunye Wang, Linfeng Yang, Lei Zhang

Main category: cs.CL

TL;DR: MTR is a simulation-first training framework that enables tool-augmented language models to learn effective reasoning from structured traces without live API access, achieving competitive performance on multi-hop QA benchmarks.

Details

Motivation: Current tool-augmented language models face scalability and reliability challenges due to their dependence on live API access during training and deployment.

Method: Multi-agent architecture with ToolMaker, AutoAgent, and ToolActor components; two-stage training with SFT for trace grammar and GRPO with composite trace reward balancing answer correctness and internal consistency.

Result: Achieves competitive Exact Match scores to live-API systems on four multi-hop QA benchmarks (HotpotQA, MuSiQue, 2WikiMultiHopQA, Bamboogle) and excels on reasoning-intensive tasks.

Conclusion: Effective tool reasoning can be learned from structured traces without live interactions, providing a scalable and reliable alternative to live-API systems.

Abstract: Tool-augmented language models have demonstrated strong capabilities, but their reliance on live API access creates scalability and reliability challenges during training and deployment. We propose MTR, a simulation-first training framework for tool-augmented reasoning. Instead of relying on live APIs, MTR learns from complete ReAct traces with schema-validated, simulated observations. Our approach operates through a multi-agent architecture where a ToolMaker generates task-specific, OpenAI-compatible tool interfaces, an AutoAgent produces structured think-act-observe sequences, and a ToolActor simulates realistic responses. Training proceeds in two stages: Stage-1 Supervised Fine-Tuning (SFT) teaches ’trace grammar’ from complete reasoning sequences; Stage-2 Group Relative Policy Optimization (GRPO) optimizes strategy with a composite trace reward that balances answer correctness and internal consistency. Across four multi-hop QA benchmarks (HotpotQA, MuSiQue, 2WikiMultiHopQA, Bamboogle), MTR attains competitive Exact Match (EM) scores to live-API systems and excels on reasoning-intensive tasks, suggesting that effective tool reasoning can be learned from structured traces without live interactions.

[62] Mid-Training of Large Language Models: A Survey

Kaixiang Mo, Yuxin Shi, Weiwei Weng, Zhiqiang Zhou, Shuman Liu, Haibo Zhang, Anxiang Zeng

Main category: cs.CL

TL;DR: This paper introduces the first comprehensive survey of mid-training as a unified paradigm in LLM development, proposing a taxonomy and analyzing its effectiveness through gradient noise scale, information bottleneck, and curriculum learning principles.

Details

Motivation: Despite widespread use in state-of-the-art LLM systems, there has been no prior survey of mid-training as a unified paradigm, creating a gap in understanding this crucial intermediate stage between pre-training and fine-tuning.

Method: The authors introduce the first taxonomy of LLM mid-training spanning data distribution, learning-rate scheduling, and long-context extension. They compile evaluation benchmarks and report gains to enable structured comparisons across models.

Result: The paper demonstrates that mid-training mitigates diminishing returns from noisy tokens, stabilizes convergence, and expands model capability in late training through multiple annealing-style phases that refine data quality, adapt optimization schedules, and extend context length.

Conclusion: Mid-training is an effective intermediate stage that promotes generalization and abstraction in LLMs. The paper identifies open challenges and proposes avenues for future research and practice in this emerging paradigm.

Abstract: Large language models (LLMs) are typically developed through large-scale pre-training followed by task-specific fine-tuning. Recent advances highlight the importance of an intermediate mid-training stage, where models undergo multiple annealing-style phases that refine data quality, adapt optimization schedules, and extend context length. This stage mitigates diminishing returns from noisy tokens, stabilizes convergence, and expands model capability in late training. Its effectiveness can be explained through gradient noise scale, the information bottleneck, and curriculum learning, which together promote generalization and abstraction. Despite widespread use in state-of-the-art systems, there has been no prior survey of mid-training as a unified paradigm. We introduce the first taxonomy of LLM mid-training spanning data distribution, learning-rate scheduling, and long-context extension. We distill practical insights, compile evaluation benchmarks, and report gains to enable structured comparisons across models. We also identify open challenges and propose avenues for future research and practice.

[63] GAMBIT+: A Challenge Set for Evaluating Gender Bias in Machine Translation Quality Estimation Metrics

Giorgos Filandrianos, Orfeas Menis Mastromichalakis, Wafaa Mohammed, Giuseppe Attanasio, Chrysoula Zerva

Main category: cs.CL

TL;DR: This paper introduces a large-scale challenge set to evaluate gender bias in automatic quality estimation (QE) metrics for machine translation, focusing on gender-ambiguous occupational terms across 33 language pairs.

Details

Motivation: Gender bias in machine translation systems is well-documented, but bias in automatic quality estimation metrics remains underexplored, with existing studies limited by small datasets, narrow occupational coverage, and restricted language variety.

Method: The authors create a large-scale challenge set based on the GAMBIT corpus, extending coverage to three source languages (genderless or natural-gendered) and eleven target languages with grammatical gender, resulting in 33 language pairs. Each source text is paired with two target versions differing only in grammatical gender of occupational terms.

Result: The dataset enables fine-grained bias analysis by occupation and systematic comparisons across languages, providing a comprehensive framework to test whether QE metrics assign equal scores to masculine and feminine versions of translations.

Conclusion: This work addresses the gap in gender bias research for QE metrics by providing a scalable, parallel dataset that can systematically evaluate bias across multiple languages and occupations.

Abstract: Gender bias in machine translation (MT) systems has been extensively documented, but bias in automatic quality estimation (QE) metrics remains comparatively underexplored. Existing studies suggest that QE metrics can also exhibit gender bias, yet most analyses are limited by small datasets, narrow occupational coverage, and restricted language variety. To address this gap, we introduce a large-scale challenge set specifically designed to probe the behavior of QE metrics when evaluating translations containing gender-ambiguous occupational terms. Building on the GAMBIT corpus of English texts with gender-ambiguous occupations, we extend coverage to three source languages that are genderless or natural-gendered, and eleven target languages with grammatical gender, resulting in 33 source-target language pairs. Each source text is paired with two target versions differing only in the grammatical gender of the occupational term(s) (masculine vs. feminine), with all dependent grammatical elements adjusted accordingly. An unbiased QE metric should assign equal or near-equal scores to both versions. The dataset’s scale, breadth, and fully parallel design, where the same set of texts is aligned across all languages, enables fine-grained bias analysis by occupation and systematic comparisons across languages.

[64] SID: Multi-LLM Debate Driven by Self Signals

Xuhang Chen, Zhifan Song, Deyi Ji, Shuo Gao, Lanyun Zhu

Main category: cs.CL

TL;DR: SID introduces a self-signals driven multi-LLM debate framework that uses model-level confidence and token-level semantic focus to optimize debate efficiency and accuracy.

Details

Motivation: Existing Multi-LLM Agent Debate methods focus on external structures and neglect internal self signals like token logits and attention, leading to redundant computation and performance degradation.

Method: Leverages two self-signals: model-level confidence for early exit of high-confidence agents and token-level semantic focus using attention mechanism to compress redundant debate content.

Result: Outperforms existing MAD techniques in accuracy while reducing token consumption across various LLMs and Multimodal LLMs on multiple challenging benchmarks.

Conclusion: Utilizing self signals effectively enhances both performance and efficiency of multi-agent debate systems.

Abstract: Large Language Models (LLMs) have exhibited impressive capabilities across diverse application domains. Recent work has explored Multi-LLM Agent Debate (MAD) as a way to enhance performance by enabling multiple LLMs to discuss and refine responses iteratively. Nevertheless, existing MAD methods predominantly focus on utilizing external structures, such as debate graphs, using LLM-as-a-Judge, while neglecting the application of self signals, such as token logits and attention, that arise during generation. This omission leads to redundant computation and potential performance degradation. In this paper, we shift the focus to the self signals of multi-LLM debate and introduce a Self-Signals Driven Multi-LLM Debate (SID), which leverages two types of self-signals: model-level confidence and token-level semantic focus, to adaptively guide the debate process. Our approach enables high-confidence agents to exit early at the model level and compress the redundant debate contents based on the attention mechanism. We evaluate our method on various LLMs and Multimodal LLMs across multiple challenging benchmarks. Experimental results demonstrate that our method not only outperforms existing MAD techniques in accuracy but also reduces token consumption, highlighting the effectiveness of utilizing self signals in enhancing both the performance and efficiency of multi-agent debate systems. Our code will be available at~\href{https://github.com/xuhang2019/SID}{\texttt{https://github.com/xuhang2019/SID}}.

[65] OpenJAI-v1.0: An Open Thai Large Language Model

Pontakorn Trakuekul, Attapol T. Rutherford, Jullajak Karnjanaekarin, Narongkorn Panitsrisit, Sumana Sumanakul

Main category: cs.CL

TL;DR: OpenJAI-v1.0 is an open-source Thai-English LLM built on Qwen3-14B, optimized for instruction following, long-context understanding, and tool use, outperforming other Thai models while avoiding catastrophic forgetting.

Details

Motivation: To create an improved open-source NLP resource for the Thai AI community that enhances performance on practical tasks in both Thai and English languages.

Method: Developed from Qwen3-14B model using carefully curated data across three key use cases: instruction following, long-context understanding, and tool use.

Result: OpenJAI-v1.0 improves on base model capabilities and outperforms other leading open-source Thai models on diverse benchmarks while avoiding catastrophic forgetting.

Conclusion: The model is publicly released as an alternative NLP resource for the Thai AI community, demonstrating enhanced performance across multiple practical tasks.

Abstract: We introduce OpenJAI-v1.0, an open-source large language model for Thai and English, developed from the Qwen3-14B model. Our work focuses on boosting performance on practical tasks through carefully curated data across three key use cases: instruction following, long-context understanding, and tool use. Evaluation results show that OpenJAI-v1.0 improves on the capabilities of its base model and outperforms other leading open-source Thai models on a diverse suite of benchmarks, while avoiding catastrophic forgetting. OpenJAI-v1.0 is publicly released as another alternative NLP resource for the Thai AI community.

[66] Unlocking Latent Discourse Translation in LLMs Through Quality-Aware Decoding

Wafaa Mohammed, Vlad Niculae, Chrysoula Zerva

Main category: cs.CL

TL;DR: LLMs struggle with discourse phenomena in machine translation. The paper shows that discourse knowledge exists in LLMs and proposes quality-aware decoding (QAD) to extract this knowledge, improving translation quality and human alignment.

Details

Motivation: LLMs perform well in machine translation but struggle with discourse phenomena like pronoun resolution and lexical cohesion at document level, limiting their context-aware translation capabilities.

Method: Propose quality-aware decoding (QAD) to effectively extract the discourse knowledge encoded within LLMs, comparing it against other decoding approaches.

Result: QAD demonstrates superiority over other decoding methods, enhances semantic richness of translations, and aligns translations more closely with human preferences.

Conclusion: Discourse knowledge is encoded in LLMs and can be effectively extracted using quality-aware decoding to improve context-aware machine translation performance.

Abstract: Large language models (LLMs) have emerged as strong contenders in machine translation.Yet, they still struggle to adequately handle discourse phenomena, such as pronoun resolution and lexical cohesion at the document level. In this study, we thoroughly investigate the discourse phenomena performance of LLMs in context-aware translation. We demonstrate that discourse knowledge is encoded within LLMs and propose the use of quality-aware decoding (QAD) to effectively extract this knowledge, showcasing its superiority over other decoding approaches through comprehensive analysis. Furthermore, we illustrate that QAD enhances the semantic richness of translations and aligns them more closely with human preferences.

[67] $λ$-GRPO: Unifying the GRPO Frameworks with Learnable Token Preferences

Yining Wang, Jinman Zhao, Chuangxin Zhao, Shuhao Guan, Gerald Penn, Shinan Liu

Main category: cs.CL

TL;DR: The paper introduces λ-GRPO, a method that learns adaptive token-level weighting to address length bias in RLHF approaches for LLM reasoning, achieving consistent improvements over existing methods without additional computational cost.

Details

Motivation: Existing RLHF methods like GRPO suffer from length bias where longer responses disproportionately influence gradient updates, and current solutions like DAPO and Dr. GRPO are heuristic with limited interpretability.

Method: Proposes λ-GRPO which unifies existing frameworks and introduces a learnable parameter λ that adaptively controls token-level weighting during optimization, allowing the model to learn its own token preferences.

Result: λ-GRPO achieves consistent improvements over vanilla GRPO and DAPO on multiple mathematical reasoning benchmarks, with +1.9%, +1.0%, and +1.7% average accuracy improvements on Qwen2.5 models with 1.5B, 3B, and 7B parameters respectively.

Conclusion: Learning token preferences through adaptive weighting is an effective and practical approach that improves reasoning capabilities without requiring training data modifications or additional computational costs.

Abstract: Reinforcement Learning with Human Feedback (RLHF) has been the dominant approach for improving the reasoning capabilities of Large Language Models (LLMs). Recently, Reinforcement Learning with Verifiable Rewards (RLVR) has simplified this paradigm by replacing the reward and value models with rule-based verifiers. A prominent example is Group Relative Policy Optimization (GRPO). However, GRPO inherently suffers from a length bias, since the same advantage is uniformly assigned to all tokens of a response. As a result, longer responses distribute the reward over more tokens and thus contribute disproportionately to gradient updates. Several variants, such as DAPO and Dr. GRPO, modify the token-level aggregation of the loss, yet these methods remain heuristic and offer limited interpretability regarding their implicit token preferences. In this work, we explore the possibility of allowing the model to learn its own token preference during optimization. We unify existing frameworks under a single formulation and introduce a learnable parameter $\lambda$ that adaptively controls token-level weighting. We use $\lambda$-GRPO to denote our method, and we find that $\lambda$-GRPO achieves consistent improvements over vanilla GRPO and DAPO on multiple mathematical reasoning benchmarks. On Qwen2.5 models with 1.5B, 3B, and 7B parameters, $\lambda$-GRPO improves average accuracy by $+1.9%$, $+1.0%$, and $+1.7%$ compared to GRPO, respectively. Importantly, these gains come without any modifications to the training data or additional computational cost, highlighting the effectiveness and practicality of learning token preferences.

[68] MeXtract: Light-Weight Metadata Extraction from Scientific Papers

Zaid Alyafeai, Maged S. Al-Shaibani, Bernard Ghanem

Main category: cs.CL

TL;DR: MeXtract is a family of lightweight language models (0.5B-3B parameters) that achieves state-of-the-art performance on metadata extraction from scientific papers, with strong generalization to unseen schemas.

Details

Motivation: Traditional metadata extraction approaches struggle with generalization across domains and schema variations, creating a need for more robust and adaptable solutions.

Method: Fine-tuned Qwen 2.5 models to create MeXtract family, extended MOLE benchmark with model-specific metadata for evaluation, and tested transfer learning to unseen schemas.

Result: MeXtract achieves state-of-the-art performance on metadata extraction in its size family, with fine-tuning on one schema effectively transferring to unseen schemas.

Conclusion: The approach demonstrates robustness and adaptability for metadata extraction, with all code, datasets, and models released openly for research community use.

Abstract: Metadata plays a critical role in indexing, documenting, and analyzing scientific literature, yet extracting it accurately and efficiently remains a challenging task. Traditional approaches often rely on rule-based or task-specific models, which struggle to generalize across domains and schema variations. In this paper, we present MeXtract, a family of lightweight language models designed for metadata extraction from scientific papers. The models, ranging from 0.5B to 3B parameters, are built by fine-tuning Qwen 2.5 counterparts. In their size family, MeXtract achieves state-of-the-art performance on metadata extraction on the MOLE benchmark. To further support evaluation, we extend the MOLE benchmark to incorporate model-specific metadata, providing an out-of-domain challenging subset. Our experiments show that fine-tuning on a given schema not only yields high accuracy but also transfers effectively to unseen schemas, demonstrating the robustness and adaptability of our approach. We release all the code, datasets, and models openly for the research community.

[69] LongRM: Revealing and Unlocking the Context Boundary of Reward Modeling

Zecheng Tang, Baibei Ji, Quantong Qiu, Haitian Wang, Xiaobo Liang, Juntao Li, Min Zhang

Main category: cs.CL

TL;DR: Long-RewardBench is a new benchmark for evaluating reward models on long-context scenarios, showing current models struggle with context-response consistency. The authors propose a multi-stage training strategy that creates robust Long-context RMs (LongRMs) that outperform larger models.

Details

Motivation: Current reward models are limited to short contexts and focus mainly on response-level attributes, neglecting long context-response consistency which is crucial for real-world applications like LLM agents.

Method: Proposed a general multi-stage training strategy to scale arbitrary models into robust Long-context RMs (LongRMs), using the Long-RewardBench benchmark with Pairwise Comparison and Best-of-N tasks.

Result: The 8B LongRM outperforms much larger 70B-scale baselines and matches the performance of proprietary Gemini 2.5 Pro model, while preserving strong short-context capability.

Conclusion: The proposed multi-stage training approach effectively creates robust long-context reward models that address the fragility of current models in long-context scenarios.

Abstract: Reward model (RM) plays a pivotal role in aligning large language model (LLM) with human preferences. As real-world applications increasingly involve long history trajectories, e.g., LLM agent, it becomes indispensable to evaluate whether a model’s responses are not only high-quality but also grounded in and consistent with the provided context. Yet, current RMs remain confined to short-context settings and primarily focus on response-level attributes (e.g., safety or helpfulness), while largely neglecting the critical dimension of long context-response consistency. In this work, we introduce Long-RewardBench, a benchmark specifically designed for long-context RM evaluation, featuring both Pairwise Comparison and Best-of-N tasks. Our preliminary study reveals that even state-of-the-art generative RMs exhibit significant fragility in long-context scenarios, failing to maintain context-aware preference judgments. Motivated by the analysis of failure patterns observed in model outputs, we propose a general multi-stage training strategy that effectively scales arbitrary models into robust Long-context RMs (LongRMs). Experiments show that our approach not only substantially improves performance on long-context evaluation but also preserves strong short-context capability. Notably, our 8B LongRM outperforms much larger 70B-scale baselines and matches the performance of the proprietary Gemini 2.5 Pro model.

[70] EDUMATH: Generating Standards-aligned Educational Math Word Problems

Bryan R. Christ, Penelope Molitz, Jonathan Kropko, Thomas Hartvigsen

Main category: cs.CL

TL;DR: LLMs can generate customized math word problems aligned with educational standards, with a 12B open model matching larger models’ performance and a 30B model outperforming closed baselines using teacher-annotated data.

Details

Motivation: Teachers struggle to customize math word problems for individual students due to large class sizes and burnout, while personalized problems can improve learning outcomes.

Method: Used joint human expert-LLM evaluation of 11,000+ MWPs, created teacher-annotated dataset, trained 12B and 30B open models with classifier, and conducted student study comparing customized vs human-written problems.

Result: 12B model matches larger models; 30B model outperforms closed baselines; generated MWPs are more similar to human-written ones; students perform similarly on both but prefer customized problems.

Conclusion: LLMs can effectively generate standards-aligned, customized math word problems that students prefer, supporting math education through personalization.

Abstract: Math word problems (MWPs) are critical K-12 educational tools, and customizing them to students’ interests and ability levels can increase learning outcomes. However, teachers struggle to find time to customize MWPs for each student given large class sizes and increasing burnout. We propose that LLMs can support math education by generating MWPs customized to student interests and math education standards. To this end, we use a joint human expert-LLM judge approach to evaluate over 11,000 MWPs generated by open and closed LLMs and develop the first teacher-annotated dataset for standards-aligned educational MWP generation. We show the value of our data by using it to train a 12B open model that matches the performance of larger and more capable open models. We also use our teacher-annotated data to train a text classifier that enables a 30B open LLM to outperform existing closed baselines without any training. Next, we show our models’ MWPs are more similar to human-written MWPs than those from existing models. We conclude by conducting the first study of customized LLM-generated MWPs with grade school students, finding they perform similarly on our models’ MWPs relative to human-written MWPs but consistently prefer our customized MWPs.

Geng Liu, Feng Li, Junjie Mu, Mengxiao Zhu, Francesco Pierri

Main category: cs.CL

TL;DR: Chinese LLMs show systematic ingroup-positive and outgroup-negative biases across 240 social groups, with these biases intensifying in real user interactions.

Details

Motivation: To investigate social identity biases in Chinese LLMs as they are increasingly deployed in user-facing applications, raising concerns about reflecting and amplifying social biases.

Method: Used Mandarin-specific prompts across ten Chinese LLMs to evaluate responses to ingroup (“We”) and outgroup (“They”) framings across 240 social groups, and analyzed real chatbot conversations from a corpus of user interactions.

Result: Systematic ingroup-positive and outgroup-negative tendencies observed across all models, with biases not confined to synthetic prompts but also appearing in naturalistic dialogue, indicating bias dynamics strengthen in real interactions.

Conclusion: Social identity biases documented in English LLMs generalize cross-linguistically to Chinese LLMs and intensify in user-facing contexts, providing a language-aware evaluation framework for Chinese LLMs.

Abstract: Large language models (LLMs) are increasingly deployed in user-facing applications, raising concerns about their potential to reflect and amplify social biases. We investigate social identity framing in Chinese LLMs using Mandarin-specific prompts across ten representative Chinese LLMs, evaluating responses to ingroup (“We”) and outgroup (“They”) framings, and extending the setting to 240 social groups salient in the Chinese context. To complement controlled experiments, we further analyze Chinese-language conversations from a corpus of real interactions between users and chatbots. Across models, we observe systematic ingroup-positive and outgroup-negative tendencies, which are not confined to synthetic prompts but also appear in naturalistic dialogue, indicating that bias dynamics might strengthen in real interactions. Our study provides a language-aware evaluation framework for Chinese LLMs, demonstrating that social identity biases documented in English generalize cross-linguistically and intensify in user-facing contexts.

[72] Towards Reliable Retrieval in RAG Systems for Large Legal Datasets

Markus Reuter, Tobias Lingenberg, Rūta Liepiņa, Francesca Lagioia, Marco Lippi, Giovanni Sartor, Andrea Passerini, Burcu Sayin

Main category: cs.CL

TL;DR: Summary-Augmented Chunking (SAC) improves legal RAG reliability by adding document-level summaries to text chunks, reducing Document-Level Retrieval Mismatch and improving retrieval accuracy.

Details

Motivation: RAG systems in legal applications suffer from Document-Level Retrieval Mismatch (DRM) where retrievers select information from wrong source documents due to structurally similar legal documents in large databases.

Method: Proposed Summary-Augmented Chunking (SAC) - enhances each text chunk with a document-level synthetic summary to provide global context lost in standard chunking. Compared generic summarization vs legal expert domain knowledge approaches.

Result: SAC significantly reduces DRM and improves text-level retrieval precision and recall. Generic summarization outperformed legal expert knowledge approach. Technique is practical, scalable, and easily integrable.

Conclusion: SAC is an effective method to enhance RAG reliability for large-scale legal document datasets by mitigating retrieval failures through document-level context preservation.

Abstract: Retrieval-Augmented Generation (RAG) is a promising approach to mitigate hallucinations in Large Language Models (LLMs) for legal applications, but its reliability is critically dependent on the accuracy of the retrieval step. This is particularly challenging in the legal domain, where large databases of structurally similar documents often cause retrieval systems to fail. In this paper, we address this challenge by first identifying and quantifying a critical failure mode we term Document-Level Retrieval Mismatch (DRM), where the retriever selects information from entirely incorrect source documents. To mitigate DRM, we investigate a simple and computationally efficient technique which we refer to as Summary-Augmented Chunking (SAC). This method enhances each text chunk with a document-level synthetic summary, thereby injecting crucial global context that would otherwise be lost during a standard chunking process. Our experiments on a diverse set of legal information retrieval tasks show that SAC greatly reduces DRM and, consequently, also improves text-level retrieval precision and recall. Interestingly, we find that a generic summarization strategy outperforms an approach that incorporates legal expert domain knowledge to target specific legal elements. Our work provides evidence that this practical, scalable, and easily integrable technique enhances the reliability of RAG systems when applied to large-scale legal document datasets.

[73] Pragyaan: Designing and Curating High-Quality Cultural Post-Training Datasets for Indian Languages

Neel Prabhanjan Rachamalla, Aravind Konakalla, Gautam Rajeev, Ashish Kulkarni, Chandra Khatri, Shubham Agarwal

Main category: cs.CL

TL;DR: The paper introduces a human-in-the-loop pipeline to create high-quality multilingual post-training datasets for Indian languages, addressing gaps in existing datasets through translation and synthetic expansion.

Details

Motivation: Existing open-source datasets lack multilingual coverage, cultural grounding, and task diversity, especially for Indian languages, limiting the effectiveness of Large Language Models.

Method: A human-in-the-loop pipeline combining translations with synthetic expansion to produce reliable and diverse Indic post-training data across 10 Indian languages covering 13 broad and 56 sub-categories.

Result: Created two datasets: Pragyaan-IT (22.5K) and Pragyaan-Align (100K) across 10 Indian languages, leveraging 57 diverse datasets with emphasis on task diversity, multi-turn dialogue, instruction fidelity, safety alignment, and cultural nuance preservation.

Conclusion: The dataset protocol provides a foundation for more inclusive and effective multilingual LLMs by addressing often-overlooked dimensions in existing datasets.

Abstract: The effectiveness of Large Language Models (LLMs) depends heavily on the availability of high-quality post-training data, particularly instruction-tuning and preference-based examples. Existing open-source datasets, however, often lack multilingual coverage, cultural grounding, and suffer from task diversity gaps that are especially pronounced for Indian languages. We introduce a human-in-the-loop pipeline that combines translations with synthetic expansion to produce reliable and diverse Indic post-training data. Using this pipeline, we curate two datasets: Pragyaan-IT (22.5K) and Pragyaan-Align (100K) across 10 Indian languages covering 13 broad and 56 sub-categories, leveraging 57 diverse datasets. Our dataset protocol incorporates several often-overlooked dimensions and emphasize task diversity, multi-turn dialogue, instruction fidelity, safety alignment, and preservation of cultural nuance, providing a foundation for more inclusive and effective multilingual LLMs.

[74] Native Hybrid Attention for Efficient Sequence Modeling

Jusen Du, Jiaxi Hu, Tao Zhang, Weigao Sun, Yu Cheng

Main category: cs.CL

TL;DR: Native Hybrid Attention (NHA) combines linear and full attention in a unified layer design, maintaining long-term context via linear RNN and short-term tokens via sliding window, achieving better efficiency and accuracy than Transformers.

Details

Motivation: Transformers have quadratic complexity while linear attention sacrifices recall accuracy over long contexts. NHA aims to bridge this gap by creating a hybrid approach that maintains efficiency without compromising accuracy.

Method: NHA integrates intra and inter-layer hybridization using linear RNN for long-term context in key-value slots and sliding window for short-term tokens, with a single softmax attention operation over all keys and values.

Result: NHA outperforms Transformers and other hybrid baselines on recall-intensive and commonsense reasoning tasks, and enables pretrained LLMs to achieve competitive accuracy with significant efficiency gains.

Conclusion: NHA provides an effective hybrid attention mechanism that smoothly transitions between linear and full attention while maintaining structural uniformity across layers, offering both efficiency and accuracy benefits.

Abstract: Transformers excel at sequence modeling but face quadratic complexity, while linear attention offers improved efficiency but often compromises recall accuracy over long contexts. In this work, we introduce Native Hybrid Attention (NHA), a novel hybrid architecture of linear and full attention that integrates both intra & inter-layer hybridization into a unified layer design. NHA maintains long-term context in key-value slots updated by a linear RNN, and augments them with short-term tokens from a sliding window. A single \texttt{softmax attention} operation is then applied over all keys and values, enabling per-token and per-head context-dependent weighting without requiring additional fusion parameters. The inter-layer behavior is controlled through a single hyperparameter, the sliding window size, which allows smooth adjustment between purely linear and full attention while keeping all layers structurally uniform. Experimental results show that NHA surpasses Transformers and other hybrid baselines on recall-intensive and commonsense reasoning tasks. Furthermore, pretrained LLMs can be structurally hybridized with NHA, achieving competitive accuracy while delivering significant efficiency gains. Code is available at https://github.com/JusenD/NHA.

[75] Mining the Mind: What 100M Beliefs Reveal About Frontier LLM Knowledge

Shrestha Ghosh, Luca Giordano, Yujia Hu, Tuan-Phong Nguyen, Simon Razniewski

Main category: cs.CL

TL;DR: Analysis of GPT-4.1’s factual knowledge reveals significant differences from established knowledge bases, lower accuracy than benchmarks suggest, and major issues with inconsistency, ambiguity, and hallucinations.

Details

Motivation: To deeply understand the factual knowledge of frontier LLMs, which remains poorly understood and is usually analyzed from biased samples.

Method: Analyzed GPTKB v1.5, a recursively elicited set of 100 million beliefs from GPT-4.1, one of the strongest currently available frontier LLMs.

Result: Found that the model’s factual knowledge differs significantly from established knowledge bases, has lower accuracy than previous benchmarks indicated, and suffers from major issues with inconsistency, ambiguity, and hallucinations.

Conclusion: The findings highlight significant challenges in LLM factual knowledge and point to future research opportunities for improving factual accuracy and reliability in large language models.

Abstract: LLMs are remarkable artifacts that have revolutionized a range of NLP and AI tasks. A significant contributor is their factual knowledge, which, to date, remains poorly understood, and is usually analyzed from biased samples. In this paper, we take a deep tour into the factual knowledge (or beliefs) of a frontier LLM, based on GPTKB v1.5 (Hu et al., 2025a), a recursively elicited set of 100 million beliefs of one of the strongest currently available frontier LLMs, GPT-4.1. We find that the models’ factual knowledge differs quite significantly from established knowledge bases, and that its accuracy is significantly lower than indicated by previous benchmarks. We also find that inconsistency, ambiguity and hallucinations are major issues, shedding light on future research opportunities concerning factual LLM knowledge.

[76] Beyond Monolingual Assumptions: A Survey of Code-Switched NLP in the Era of Large Language Models

Rajvee Sheth, Samridhi Raj Sinha, Mahavir Patil, Himanshu Beniwal, Mayank Singh

Main category: cs.CL

TL;DR: This survey provides the first comprehensive analysis of code-switching (CSW) in large language models (LLMs), reviewing studies across 5 research areas, 12 NLP tasks, 30+ datasets, and 80+ languages.

Details

Motivation: Code-switching remains a fundamental challenge for multilingual NLP despite advances in LLMs, with models struggling with mixed-language inputs, limited datasets, and evaluation biases that hinder deployment in multilingual societies.

Method: The survey classifies recent advances by architecture, training strategy, and evaluation methodology, analyzing how LLMs have reshaped CSW modeling and what challenges persist.

Result: The paper provides a comprehensive review of CSW-aware LLM research spanning multiple research areas, NLP tasks, datasets, and languages, with curated resources maintained at a GitHub repository.

Conclusion: The paper concludes with a roadmap emphasizing the need for inclusive datasets, fair evaluation, and linguistically grounded models to achieve truly multilingual intelligence.

Abstract: Code-switching (CSW), the alternation of languages and scripts within a single utterance, remains a fundamental challenge for multiling ual NLP, even amidst the rapid advances of large language models (LLMs). Most LLMs still struggle with mixed-language inputs, limited CSW datasets, and evaluation biases, hindering deployment in multilingual societies. This survey provides the first comprehensive analysis of CSW-aware LLM research, reviewing \total{unique_references} studies spanning five research areas, 12 NLP tasks, 30+ datasets, and 80+ languages. We classify recent advances by architecture, training strategy, and evaluation methodology, outlining how LLMs have reshaped CSW modeling and what challenges persist. The paper concludes with a roadmap emphasizing the need for inclusive datasets, fair evaluation, and linguistically grounded models to achieve truly multilingual intelligence. A curated collection of all resources is maintained at https://github.com/lingo-iitgn/awesome-code-mixing/.

[77] Search-R3: Unifying Reasoning and Embedding Generation in Large Language Models

Yuntao Gui, James Cheng

Main category: cs.CL

TL;DR: Search-R3 is a framework that adapts LLMs to generate search embeddings through their reasoning process, using supervised learning, reinforcement learning, and a specialized RL environment to outperform prior methods.

Details

Motivation: LLMs have strong natural language understanding but are underutilized for retrieval tasks. The goal is to leverage LLMs' chain-of-thought capabilities for more effective embedding generation.

Method: Three mechanisms: (1) supervised learning for quality embeddings, (2) RL to optimize embedding generation with reasoning, (3) specialized RL environment handling evolving embeddings without full corpus re-encoding.

Result: Significantly outperforms prior methods on diverse benchmarks by unifying reasoning and embedding generation processes.

Conclusion: Search-R3 represents a substantial advancement for complex knowledge-intensive tasks requiring both sophisticated reasoning and effective information retrieval.

Abstract: Despite their remarkable natural language understanding capabilities, Large Language Models (LLMs) have been underutilized for retrieval tasks. We present Search-R3, a novel framework that addresses this limitation by adapting LLMs to generate search embeddings as a direct output of their reasoning process. Our approach exploits LLMs’ chain-of-thought capabilities, allowing them to produce more effective embeddings by reasoning step-by-step through complex semantic analyses. We implement this through three complementary mechanisms. (1) a supervised learning stage enables the model’s ability to produce quality embeddings, (2) a reinforcement learning (RL) methodology that optimizes embedding generation alongside reasoning, and (3) a specialized RL environment that efficiently handles evolving embedding representations without requiring complete corpus re-encoding at each training iteration. Our extensive evaluations on diverse benchmarks demonstrate that Search-R3 significantly outperforms prior methods by unifying the reasoning and embedding generation processes. This integrated post-training approach represents a substantial advancement in handling complex knowledge-intensive tasks that require both sophisticated reasoning and effective information retrieval. Project page: https://github.com/ytgui/Search-R3

[78] Does Local News Stay Local?: Online Content Shifts in Sinclair-Acquired Stations

Miriam Wanner, Sophia Hager, Anjalie Field

Main category: cs.CL

TL;DR: Sinclair Broadcast Group’s acquisition of local news stations leads to increased national news coverage and more polarizing topics at the expense of local content.

Details

Motivation: To investigate how local news coverage changes when stations are acquired by Sinclair Broadcast Group, given their trusted status and influence on viewers.

Method: Computational analysis of internet content from local news stations before and after Sinclair acquisition, comparing with national news outlets.

Result: Clear evidence shows stations report more national news and polarizing topics while reducing local coverage after acquisition.

Conclusion: Sinclair’s ownership shifts local news focus from community concerns to national political topics, potentially undermining their role as trusted local information sources.

Abstract: Local news stations are often considered to be reliable sources of non-politicized information, particularly local concerns that residents care about. Because these stations are trusted news sources, viewers are particularly susceptible to the information they report. The Sinclair Broadcast group is a broadcasting company that has acquired many local news stations in the last decade. We investigate the effects of local news stations being acquired by Sinclair: how does coverage change? We use computational methods to investigate changes in internet content put out by local news stations before and after being acquired by Sinclair and in comparison to national news outlets. We find that there is clear evidence that local news stations report more frequently on national news at the expense of local topics, and that their coverage of polarizing national topics increases.

[79] Revisiting Metric Reliability for Fine-grained Evaluation of Machine Translation and Summarization in Indian Languages

Amir Hossein Yari, Kalmit Kulkarni, Ahmad Raza Khan, Fajri Koto

Main category: cs.CL

TL;DR: ITEM is a benchmark evaluating 26 automatic metrics for Indian language MT and TS, finding LLM-based evaluators align best with humans, outliers impact agreement, metrics differ in TS vs MT effectiveness, and vary in robustness to perturbations.

Details

Motivation: Current MT/TS metrics focus mainly on English/high-resource languages, leaving Indian languages (spoken by 1.5B+ people) overlooked, questioning the universality of evaluation practices.

Method: Large-scale benchmark evaluating 26 automatic metrics across 6 major Indian languages with fine-grained annotations, covering agreement with human judgments, outlier sensitivity, language-specific reliability, inter-metric correlations, and resilience to controlled perturbations.

Result: Four key findings: (1) LLM-based evaluators show strongest human alignment; (2) outliers significantly impact metric-human agreement; (3) TS metrics better capture content fidelity while MT metrics better reflect fluency; (4) metrics vary in robustness to perturbations.

Conclusion: Provides critical guidance for advancing metric design and evaluation in Indian languages, addressing the gap in multilingual evaluation practices.

Abstract: While automatic metrics drive progress in Machine Translation (MT) and Text Summarization (TS), existing metrics have been developed and validated almost exclusively for English and other high-resource languages. This narrow focus leaves Indian languages, spoken by over 1.5 billion people, largely overlooked, casting doubt on the universality of current evaluation practices. To address this gap, we introduce ITEM, a large-scale benchmark that systematically evaluates the alignment of 26 automatic metrics with human judgments across six major Indian languages, enriched with fine-grained annotations. Our extensive evaluation, covering agreement with human judgments, sensitivity to outliers, language-specific reliability, inter-metric correlations, and resilience to controlled perturbations, reveals four central findings: (1) LLM-based evaluators show the strongest alignment with human judgments at both segment and system levels; (2) outliers exert a significant impact on metric-human agreement; (3) in TS, metrics are more effective at capturing content fidelity, whereas in MT, they better reflect fluency; and (4) metrics differ in their robustness and sensitivity when subjected to diverse perturbations. Collectively, these findings offer critical guidance for advancing metric design and evaluation in Indian languages.

[80] LuxInstruct: A Cross-Lingual Instruction Tuning Dataset For Luxembourgish

Fred Philippy, Laura Bernardy, Siwen Guo, Jacques Klein, Tegawendé F. Bissyandé

Main category: cs.CL

TL;DR: Cross-lingual instruction tuning for Luxembourgish using aligned data from English, French, and German, avoiding machine translation to preserve linguistic and cultural nuances.

Details

Motivation: Low-resource languages like Luxembourgish lack high-quality instruction datasets, and machine translation often introduces semantic misalignment and cultural inaccuracies.

Method: Create cross-lingual instruction tuning dataset for Luxembourgish by leveraging aligned data from English, French, and German, without using machine-generated translations into Luxembourgish.

Result: Cross-lingual instruction tuning improves representational alignment across languages and enhances the model’s generative capabilities in Luxembourgish.

Conclusion: Cross-lingual data curation can avoid pitfalls of machine-translated data and directly benefit low-resource language development.

Abstract: Instruction tuning has become a key technique for enhancing the performance of large language models, enabling them to better follow human prompts. However, low-resource languages such as Luxembourgish face severe limitations due to the lack of high-quality instruction datasets. Traditional reliance on machine translation often introduces semantic misalignment and cultural inaccuracies. In this work, we address these challenges by creating a cross-lingual instruction tuning dataset for Luxembourgish, without resorting to machine-generated translations into it. Instead, by leveraging aligned data from English, French, and German, we build a high-quality dataset that preserves linguistic and cultural nuances. We provide evidence that cross-lingual instruction tuning not only improves representational alignment across languages but also the model’s generative capabilities in Luxembourgish. This highlights how cross-lingual data curation can avoid the common pitfalls of machine-translated data and directly benefit low-resource language development.

[81] Accelerating Diffusion LLM Inference via Local Determinism Propagation

Fanheng Kong, Jingyuan Zhang, Yahui Liu, Zirui Wu, Yu Tian, Victoria W., Guorui Zhou

Main category: cs.CL

TL;DR: LocalLeap is a training-free adaptive parallel decoding strategy for diffusion large language models that addresses delayed decoding by using local determinism propagation and progressive spatial consistency decay to achieve 6.94× throughput improvements with negligible performance impact.

Details

Motivation: Existing open-source diffusion LLM implementations suffer from quality-speed trade-offs due to conservative sampling strategies that use greedy decoding, leading to delayed decoding and inefficient inference with repeated redundant refinement iterations.

Method: LocalLeap identifies high-confidence anchors and performs localized relaxed parallel decoding within bounded neighborhoods using two principles: local determinism propagation and progressive spatial consistency decay, enabling early commitment of already-determined tokens.

Result: LocalLeap achieves 6.94× throughput improvements and reduces decoding steps to just 14.2% of the original requirement, with negligible performance impact across various benchmarks.

Conclusion: LocalLeap effectively addresses the delayed decoding problem in diffusion LLMs through adaptive parallel decoding, achieving substantial efficiency gains without compromising output quality.

Abstract: Diffusion large language models (dLLMs) represent a significant advancement in text generation, offering parallel token decoding capabilities. However, existing open-source implementations suffer from quality-speed trade-offs that impede their practical deployment. Conservative sampling strategies typically decode only the most confident token per step to ensure quality (i.e., greedy decoding), at the cost of inference efficiency due to repeated redundant refinement iterations–a phenomenon we term delayed decoding. Through systematic analysis of dLLM decoding dynamics, we characterize this delayed decoding behavior and propose a training-free adaptive parallel decoding strategy, named LocalLeap, to address these inefficiencies. LocalLeap is built on two fundamental empirical principles: local determinism propagation centered on high-confidence anchors and progressive spatial consistency decay. By applying these principles, LocalLeap identifies anchors and performs localized relaxed parallel decoding within bounded neighborhoods, achieving substantial inference step reduction through early commitment of already-determined tokens without compromising output quality. Comprehensive evaluation on various benchmarks demonstrates that LocalLeap achieves 6.94$\times$ throughput improvements and reduces decoding steps to just 14.2% of the original requirement, achieving these gains with negligible performance impact. The source codes are available at: https://github.com/friedrichor/LocalLeap.

[82] All Claims Are Equal, but Some Claims Are More Equal Than Others: Importance-Sensitive Factuality Evaluation of LLM Generations

Miriam Wanner, Leif Azzopardi, Paul Thomas, Soham Dan, Benjamin Van Durme, Nick Craswell

Main category: cs.CL

TL;DR: VITAL introduces new metrics for evaluating LLM factuality that focus on detecting errors in key information, unlike existing methods that treat all claims equally.

Details

Motivation: Existing factuality evaluation methods are insensitive to errors in key information because they treat all claims as equally important, leading to misleading assessments when vital information is missing or incorrect.

Method: Created VITALERRORS benchmark with 6,733 queries containing minimally altered LLM responses that omit or falsify key information. Developed VITAL metrics that incorporate relevance and importance of claims relative to the query.

Result: VITAL metrics demonstrated greater sensitivity in detecting errors in key information compared to existing evaluation methods, showing they more reliably identify when vital information is missing or incorrect.

Conclusion: The VITAL metrics, dataset, and analysis provide a foundation for more accurate and robust assessment of LLM factuality by focusing on the importance of claims rather than treating all information equally.

Abstract: Existing methods for evaluating the factuality of large language model (LLM) responses treat all claims as equally important. This results in misleading evaluations when vital information is missing or incorrect as it receives the same weight as peripheral details, raising the question: how can we reliably detect such differences when there are errors in key information? Current approaches that measure factuality tend to be insensitive to omitted or false key information. To investigate this lack of sensitivity, we construct VITALERRORS, a benchmark of 6,733 queries with minimally altered LLM responses designed to omit or falsify key information. Using this dataset, we demonstrate the insensitivities of existing evaluation metrics to key information errors. To address this gap, we introduce VITAL, a set of metrics that provide greater sensitivity in measuring the factuality of responses by incorporating the relevance and importance of claims with respect to the query. Our analysis demonstrates that VITAL metrics more reliably detect errors in key information than previous methods. Our dataset, metrics, and analysis provide a foundation for more accurate and robust assessment of LLM factuality.

[83] TALENT: Table VQA via Augmented Language-Enhanced Natural-text Transcription

Guo Yutong, Wanying Wang, Yue Wu, Zichen Miao, Haoyu Wang

Main category: cs.CL

TL;DR: TALENT is a lightweight framework that uses a small VLM for OCR and natural language narration, combined with an LLM for reasoning, achieving comparable performance to large VLMs at lower computational cost.

Details

Motivation: Large VLMs are computationally expensive and miss fine-grained details, while OCR+LLM approaches using structured outputs introduce substantial errors. A more efficient and accurate solution is needed.

Method: TALENT uses dual representations: a small VLM produces both OCR text and natural language narration, which are combined with the question for LLM reasoning, reframing Table VQA as LLM-centric multimodal reasoning.

Result: TALENT enables small VLM-LLM combinations to match or surpass single large VLMs at significantly lower computational cost on both public datasets and the new ReTabVQA dataset.

Conclusion: The proposed framework effectively separates perception from reasoning, making Table VQA more efficient and accurate while enabling deployment on resource-constrained devices.

Abstract: Table Visual Question Answering (Table VQA) is typically addressed by large vision-language models (VLMs). While such models can answer directly from images, they often miss fine-grained details unless scaled to very large sizes, which are computationally prohibitive, especially for mobile deployment. A lighter alternative is to have a small VLM perform OCR and then use a large language model (LLM) to reason over structured outputs such as Markdown tables. However, these representations are not naturally optimized for LLMs and still introduce substantial errors. We propose TALENT (Table VQA via Augmented Language-Enhanced Natural-text Transcription), a lightweight framework that leverages dual representations of tables. TALENT prompts a small VLM to produce both OCR text and natural language narration, then combines them with the question for reasoning by an LLM. This reframes Table VQA as an LLM-centric multimodal reasoning task, where the VLM serves as a perception-narration module rather than a monolithic solver. Additionally, we construct ReTabVQA, a more challenging Table VQA dataset requiring multi-step quantitative reasoning over table images. Experiments show that TALENT enables a small VLM-LLM combination to match or surpass a single large VLM at significantly lower computational cost on both public datasets and ReTabVQA.

[84] Opt-ICL at LeWiDi-2025: Maximizing In-Context Signal from Rater Examples via Meta-Learning

Taylor Sorensen, Yejin Choi

Main category: cs.CL

TL;DR: A system for modeling human variation in NLP tasks using LLMs’ in-context learning with two-step meta-learning, winning the LeWiDi competition and showing importance of rater examples, dataset-specific fine-tuning, and model scale.

Details

Motivation: Many NLP tasks involve subjectivity, ambiguity, or legitimate disagreement between annotators, requiring systems that can model human variation effectively.

Method: Leverages LLMs’ in-context learning with two-step meta-learning: 1) post-training on multiple in-context datasets, and 2) specializing via in-context meta-learning to specific data distributions.

Result: Won the Learning With Disagreements (LeWiDi) competition on both tasks. Ablation study showed rater examples in-context are crucial, dataset-specific fine-tuning helps on larger datasets, post-training helps on one competition dataset, and performance improves with model scale.

Conclusion: The proposed system effectively models human variation in NLP tasks through in-context learning and meta-learning, demonstrating strong performance in handling annotator disagreements and showing the importance of key components like rater examples and model scale.

Abstract: Many natural language processing (NLP) tasks involve subjectivity, ambiguity, or legitimate disagreement between annotators. In this paper, we outline our system for modeling human variation. Our system leverages language models’ (LLMs) in-context learning abilities, along with a two-step meta-learning training procedure for 1) post-training on many datasets requiring in-context learning and 2) specializing the model via in-context meta-learning to the particular data distribution of interest. We also evaluate the performance of our system submission to the Learning With Disagreements (LeWiDi) competition, where it was the overall winner on both tasks. Additionally, we perform an ablation study to measure the importance of each system component. We find that including rater examples in-context is crucial for our system’s performance, dataset-specific fine-tuning is helpful on the larger datasets, post-training on other in-context datasets is helpful on one of the competition datasets, and that performance improves with model scale.

[85] TRIM: Token-wise Attention-Derived Saliency for Data-Efficient Instruction Tuning

Manish Nagaraj, Sakshi Choudhary, Utkarsh Saxena, Deepak Ravikumar, Kaushik Roy

Main category: cs.CL

TL;DR: TRIM is a token-centric framework that uses attention-based fingerprints instead of gradients to select high-quality coresets for instruction tuning, achieving better performance than full-data fine-tuning in some cases with much lower computational cost.

Details

Motivation: Existing coreset selection methods for instruction tuning rely on computationally expensive gradient-based approaches that overlook fine-grained token-level features, motivating the need for a more efficient and sensitive method.

Method: TRIM uses forward-only, token-centric analysis with attention-based fingerprints from target samples to identify representational patterns, avoiding expensive backward passes and focusing on structural task features.

Result: Coresets selected by TRIM outperform state-of-the-art baselines by up to 9% on downstream tasks and sometimes exceed full-data fine-tuning performance, while requiring only a fraction of the computational cost.

Conclusion: TRIM provides a scalable and efficient alternative for building high-quality instruction-tuning datasets by leveraging attention patterns instead of gradients.

Abstract: Instruction tuning is essential for aligning large language models (LLMs) to downstream tasks and commonly relies on large, diverse corpora. However, small, high-quality subsets, known as coresets, can deliver comparable or superior results, though curating them remains challenging. Existing methods often rely on coarse, sample-level signals like gradients, an approach that is computationally expensive and overlooks fine-grained features. To address this, we introduce TRIM (Token Relevance via Interpretable Multi-layer Attention), a forward-only, token-centric framework. Instead of using gradients, TRIM operates by matching underlying representational patterns identified via attention-based “fingerprints” from a handful of target samples. Such an approach makes TRIM highly efficient and uniquely sensitive to the structural features that define a task. Coresets selected by our method consistently outperform state-of-the-art baselines by up to 9% on downstream tasks and even surpass the performance of full-data fine-tuning in some settings. By avoiding expensive backward passes, TRIM achieves this at a fraction of the computational cost. These findings establish TRIM as a scalable and efficient alternative for building high-quality instruction-tuning datasets.

[86] Comparing human and language models sentence processing difficulties on complex structures

Samuel Joseph Amouyal, Aya Meltzer-Asscher, Jonathan Berant

Main category: cs.CL

TL;DR: LLMs struggle with garden path sentences and show varying similarity to human sentence comprehension patterns depending on model size and strength.

Details

Motivation: To investigate whether LLMs experience human-like processing difficulties in sentence comprehension, particularly across challenging linguistic structures.

Method: Systematic comparison of human and LLM sentence comprehension across seven challenging linguistic structures using data from humans and five families of state-of-the-art LLMs in a unified experimental framework.

Result: LLMs overall struggle on target structures, especially garden path sentences (46.8% accuracy for GPT-5 vs 93.7% on non-GP structures). Rank correlation between humans and models increases with parameter count. Performance gap patterns match humans except for very weak or very strong models.

Conclusion: There is both convergence and divergence in human and LLM sentence comprehension, revealing new insights into their similarity, with model size affecting how closely LLMs mirror human processing patterns.

Abstract: Large language models (LLMs) that fluently converse with humans are a reality

but do LLMs experience human-like processing difficulties? We systematically compare human and LLM sentence comprehension across seven challenging linguistic structures. We collect sentence comprehension data from humans and five families of state-of-the-art LLMs, varying in size and training procedure in a unified experimental framework. Our results show LLMs overall struggle on the target structures, but especially on garden path (GP) sentences. Indeed, while the strongest models achieve near perfect accuracy on non-GP structures (93.7% for GPT-5), they struggle on GP structures (46.8% for GPT-5). Additionally, when ranking structures based on average performance, rank correlation between humans and models increases with parameter count. For each target structure, we also collect data for their matched baseline without the difficult structure. Comparing performance on the target vs. baseline sentences, the performance gap observed in humans holds for LLMs, with two exceptions: for models that are too weak performance is uniformly low across both sentence types, and for models that are too strong the performance is uniformly high. Together, these reveal convergence and divergence in human and LLM sentence comprehension, offering new insights into the similarity of humans and LLMs.

[87] Reasoning for Hierarchical Text Classification: The Case of Patents

Lekang Jiang, Wenjun Sun, Stephan Goetz

Main category: cs.CL

TL;DR: Proposes RHC framework that reformulates hierarchical text classification as step-by-step reasoning, using LLMs trained with CoT alignment and reinforcement learning to achieve better performance and explainability.

Details

Motivation: Automated patent classification is challenging due to domain complexity and large label sets, and existing approaches lack explainability by only outputting flat labels without reasoning.

Method: Two-stage training: cold-start stage aligns LLM outputs with chain-of-thought reasoning format, followed by reinforcement learning stage to enhance multi-step reasoning ability for hierarchical label deduction.

Result: RHC outperforms baselines by ~3% in accuracy and macro F1, provides natural-language justifications, scales better with model size, and achieves SOTA on multiple HTC benchmarks beyond patents.

Conclusion: RHC effectively addresses hierarchical text classification challenges by leveraging reasoning capabilities of LLMs, offering improved performance, explainability, scalability and broad applicability across domains.

Abstract: Hierarchical text classification (HTC) assigns documents to multiple levels of a pre-defined taxonomy. Automated patent subject classification represents one of the hardest HTC scenarios because of domain knowledge difficulty and a huge number of labels. Prior approaches only output a flat label set, which offers little insight into the reason behind predictions. Therefore, we propose Reasoning for Hierarchical Classification (RHC), a novel framework that reformulates HTC as a step-by-step reasoning task to sequentially deduce hierarchical labels. RHC trains large language models (LLMs) in two stages: a cold-start stage that aligns outputs with chain-of-thought (CoT) reasoning format and a reinforcement learning (RL) stage to enhance multi-step reasoning ability. RHC demonstrates four advantages in our experiments. (1) Effectiveness: RHC surpasses previous baselines and outperforms the supervised fine-tuning counterparts by approximately 3% in accuracy and macro F1. (2) Explainability: RHC produces natural-language justifications before prediction to facilitate human inspection. (3) Scalability: RHC scales favorably with model size with larger gains compared to standard fine-tuning. (4) Applicability: Beyond patents, we further demonstrate that RHC achieves state-of-the-art performance on other widely used HTC benchmarks, which highlights its broad applicability.

[88] More Data or Better Data? A Critical Analysis of Data Selection and Synthesis for Mathematical Reasoning

Yike Zhao, Simin Guo, Ziqing Yang, Shifan Han, Dahua Lin, Fei Tan

Main category: cs.CL

TL;DR: Comprehensive analysis of mathematical reasoning datasets and synthesis techniques, showing that better data quality and structure often beats simply scaling data volume.

Details

Motivation: LLM reasoning capabilities depend heavily on training data quality, but practical utility of data construction methods in real-world pipelines remains underexplored.

Method: Conducted unified analysis of open-source datasets and data synthesis techniques for mathematical reasoning using a pipeline mirroring real training and deployment scenarios.

Result: Found that structuring data in interpretable formats and distilling from stronger models often outweighs simply scaling up data volume. Identified effective data selection strategies for industrial applications.

Conclusion: Provides actionable guidance for integrating training data to enhance LLM capabilities, supporting cost-effective data curation and scalable model enhancement. Highlights need to balance “more data” versus “better data” for real-world reasoning tasks.

Abstract: The reasoning capabilities of Large Language Models (LLMs) play a critical role in many downstream tasks, yet depend strongly on the quality of training data. Despite various proposed data construction methods, their practical utility in real-world pipelines remains underexplored. In this work, we conduct a comprehensive analysis of open-source datasets and data synthesis techniques for mathematical reasoning, evaluating them under a unified pipeline designed to mirror training and deployment scenarios. We further distill effective data selection strategies and identify practical methods suitable for industrial applications. Our findings highlight that structuring data in more interpretable formats, or distilling from stronger models often outweighs simply scaling up data volume. This study provides actionable guidance for integrating training data to enhance LLM capabilities, supporting both cost-effective data curation and scalable model enhancement. We hope this work will inspire further research on how to balance “more data” versus “better data” for real-world reasoning tasks.

[89] NurseLLM: The First Specialized Language Model for Nursing

Md Tawkat Islam Khondaker, Julia Harrington, Shady Shehata

Main category: cs.CL

TL;DR: NurseLLM is the first nursing-specialized large language model that outperforms state-of-the-art general-purpose and medical-specialized LLMs on multiple choice question-answering tasks in nursing.

Details

Motivation: Large language models have transformed medical systems, but their potential in specialized domains like nursing remains largely underexplored, creating a need for domain-specific models.

Method: Developed a multi-stage data generation pipeline to create the first large-scale nursing MCQ dataset, trained NurseLLM on broad nursing topics, and introduced multiple nursing benchmarks for evaluation.

Result: NurseLLM outperforms state-of-the-art general-purpose and medical-specialized LLMs of comparable size across different nursing benchmarks.

Conclusion: Specialized LLMs are important for the nursing domain, and reasoning with multi-agent collaboration systems shows promise for future nursing research and applications.

Abstract: Recent advancements in large language models (LLMs) have significantly transformed medical systems. However, their potential within specialized domains such as nursing remains largely underexplored. In this work, we introduce NurseLLM, the first nursing-specialized LLM tailored for multiple choice question-answering (MCQ) tasks. We develop a multi-stage data generation pipeline to build the first large scale nursing MCQ dataset to train LLMs on a broad spectrum of nursing topics. We further introduce multiple nursing benchmarks to enable rigorous evaluation. Our extensive experiments demonstrate that NurseLLM outperforms SoTA general-purpose and medical-specialized LLMs of comparable size on different benchmarks, underscoring the importance of a specialized LLM for the nursing domain. Finally, we explore the role of reasoning and multi-agent collaboration systems in nursing, highlighting their promise for future research and applications.

[90] Quantifying Data Contamination in Psychometric Evaluations of LLMs

Jongwook Han, Woojung Song, Jonggeun Lee, Yohan Jo

Main category: cs.CL

TL;DR: Proposes a framework to systematically measure data contamination in psychometric evaluations of LLMs, finding strong contamination in popular inventories like BFI-44 and PVQ-40 where models memorize items and can adjust responses to achieve target scores.

Details

Motivation: Address the gap in quantifying data contamination from psychometric inventories in LLM evaluations, as prior work raised concerns about contamination threatening reliability but lacked systematic measurement.

Method: Developed a framework measuring three aspects: item memorization, evaluation memorization, and target score matching. Applied to 21 models from major families and four widely used psychometric inventories.

Result: Found strong contamination in popular inventories like Big Five Inventory (BFI-44) and Portrait Values Questionnaire (PVQ-40). Models not only memorize items but can also adjust responses to achieve specific target scores.

Conclusion: Data contamination is a significant issue in psychometric evaluations of LLMs, with popular inventories showing strong contamination effects that compromise the reliability of such assessments.

Abstract: Recent studies apply psychometric questionnaires to Large Language Models (LLMs) to assess high-level psychological constructs such as values, personality, moral foundations, and dark traits. Although prior work has raised concerns about possible data contamination from psychometric inventories, which may threaten the reliability of such evaluations, there has been no systematic attempt to quantify the extent of this contamination. To address this gap, we propose a framework to systematically measure data contamination in psychometric evaluations of LLMs, evaluating three aspects: (1) item memorization, (2) evaluation memorization, and (3) target score matching. Applying this framework to 21 models from major families and four widely used psychometric inventories, we provide evidence that popular inventories such as the Big Five Inventory (BFI-44) and Portrait Values Questionnaire (PVQ-40) exhibit strong contamination, where models not only memorize items but can also adjust their responses to achieve specific target scores.

[91] CARPAS: Towards Content-Aware Refinement of Provided Aspects for Summarization in Large Language Models

Yong-En Tian, Yu-Chien Tang, An-Zi Yen, Wen-Chih Peng

Main category: cs.CL

TL;DR: The paper introduces CARPAS, a new task for content-aware refinement of provided aspects in aspect-based summarization, where LLMs dynamically adjust aspects based on document context before summarizing.

Details

Motivation: Existing aspect-based summarization approaches assume predefined aspects, but real-world scenarios often have incomplete, irrelevant, or missing aspects that need adaptive refinement based on document content.

Method: Proposes CARPAS task with three new datasets, uses LLMs with four prompting strategies, and introduces a preliminary subtask to predict the number of relevant aspects as guidance for LLMs.

Result: LLMs tend to predict overly comprehensive aspect sets leading to long, misaligned summaries. Using predicted aspect numbers as guidance significantly improves performance across all datasets and helps LLMs focus on pertinent aspects.

Conclusion: The approach effectively reduces inference difficulty and enables LLMs to focus on relevant aspects. Analysis reveals LLMs’ compliance when requested aspect numbers differ from their estimations, providing crucial insights for real-world LLM deployment.

Abstract: Aspect-based summarization has attracted significant attention for its ability to generate more fine-grained and user-aligned summaries. While most existing approaches assume a set of predefined aspects as input, real-world scenarios often present challenges where these given aspects may be incomplete, irrelevant, or entirely missing from the document. Users frequently expect systems to adaptively refine or filter the provided aspects based on the actual content. In this paper, we initiate this novel task setting, termed Content-Aware Refinement of Provided Aspects for Summarization (CARPAS), with the aim of dynamically adjusting the provided aspects based on the document context before summarizing. We construct three new datasets to facilitate our pilot experiments, and by using LLMs with four representative prompting strategies in this task, we find that LLMs tend to predict an overly comprehensive set of aspects, which often results in excessively long and misaligned summaries. Building on this observation, we propose a preliminary subtask to predict the number of relevant aspects, and demonstrate that the predicted number can serve as effective guidance for the LLMs, reducing the inference difficulty, and enabling them to focus on the most pertinent aspects. Our extensive experiments show that the proposed approach significantly improves performance across all datasets. Moreover, our deeper analyses uncover LLMs’ compliance when the requested number of aspects differs from their own estimations, establishing a crucial insight for the deployment of LLMs in similar real-world applications.

[92] Biasless Language Models Learn Unnaturally: How LLMs Fail to Distinguish the Possible from the Impossible

Imry Ziv, Nur Lan, Emmanuel Chemla, Roni Katzir

Main category: cs.CL

TL;DR: LLMs like GPT-2 don’t distinguish between humanly possible and impossible languages, suggesting they lack human innate linguistic biases.

Details

Motivation: To test whether LLMs share human innate learning biases by examining if they can distinguish between possible and impossible human languages.

Method: Compared GPT-2 learning curves on natural languages and impossible counterparts created via perturbations, analyzed cross-linguistic variance in perplexity metrics.

Result: GPT-2 learns possible and impossible languages equally easily and provides no systematic separation between them.

Conclusion: LLMs do not share the human innate biases that shape linguistic typology.

Abstract: Are large language models (LLMs) sensitive to the distinction between humanly possible languages and humanly impossible languages? This question is taken by many to bear on whether LLMs and humans share the same innate learning biases. Previous work has attempted to answer it in the positive by comparing LLM learning curves on existing language datasets and on “impossible” datasets derived from them via various perturbation functions. Using the same methodology, we examine this claim on a wider set of languages and impossible perturbations. We find that in most cases, GPT-2 learns each language and its impossible counterpart equally easily, in contrast to previous claims. We also apply a more lenient condition by testing whether GPT-2 provides any kind of separation between the whole set of natural languages and the whole set of impossible languages. By considering cross-linguistic variance in various metrics computed on the perplexity curves, we show that GPT-2 provides no systematic separation between the possible and the impossible. Taken together, these perspectives show that LLMs do not share the human innate biases that shape linguistic typology.

[93] Language Lives in Sparse Dimensions: Toward Interpretable and Efficient Multilingual Control for Large Language Models

Chengzhi Zhong, Fei Cheng, Qianying Liu, Yugo Murawaki, Chenhui Chu, Sadao Kurohashi

Main category: cs.CL

TL;DR: A training-free method identifies sparse cross-lingual dimensions in LLMs that control language switching while preserving semantics, outperforming prior neuron-based approaches with minimal data.

Details

Motivation: Large language models show strong multilingual capabilities despite limited non-English training data, suggesting they use English-aligned representations that get projected to target languages. The goal is to understand and control this cross-lingual transition mechanism.

Method: A simple, training-free method that identifies sparse cross-lingual dimensions occurring at consistent indices across intermediate to final layers. Requires only 50 sentences of parallel or monolingual data to identify and manipulate these dimensions.

Result: Experiments show the method can switch output language while preserving semantic content, and outperforms prior neuron-based approaches at substantially lower computational cost.

Conclusion: Cross-lingual transitions in LLMs are governed by sparse, interpretable dimensions that can be efficiently identified and manipulated for multilingual generation control without model training.

Abstract: Large language models exhibit strong multilingual capabilities despite limited exposure to non-English data. Prior studies show that English-centric large language models map multilingual content into English-aligned representations at intermediate layers and then project them back into target-language token spaces in the final layer. From this observation, we hypothesize that this cross-lingual transition is governed by a small and sparse set of dimensions, which occur at consistent indices across the intermediate to final layers. Building on this insight, we introduce a simple, training-free method to identify and manipulate these dimensions, requiring only as few as 50 sentences of either parallel or monolingual data. Experiments on a multilingual generation control task reveal the interpretability of these dimensions, demonstrating that the interventions in these dimensions can switch the output language while preserving semantic content, and that it surpasses the performance of prior neuron-based approaches at a substantially lower cost.

[94] Sunflower: A New Approach To Expanding Coverage of African Languages in Large Language Models

Benjamin Akera, Evelyn Nafula Ouma, Gilbert Yiga, Patrick Walukagga, Phionah Natukunda, Trevor Saaka, Solomon Nsumba, Lilian Teddy Nabukeera, Joel Muhanguzi, Imran Sekalala, Nimpamya Janat Namara, Engineer Bainomugisha, Ernest Mwebaze, John Quinn

Main category: cs.CL

TL;DR: Sunflower 14B and 32B models were developed to address the language technology gap for African languages, specifically focusing on Uganda’s linguistic diversity with state-of-the-art comprehension across most Ugandan languages.

Details

Motivation: Most African languages (over 2000) lack adequate language technology support, with current LLMs prioritizing only the most common languages, leaving many underserved. A regionally focused approach is needed to efficiently address linguistic diversity.

Method: Developed Sunflower 14B and 32B models based on Qwen 3 architecture, specifically designed for Uganda’s high linguistic diversity, taking a regionally focused approach rather than piecemeal language support.

Result: Created open-source models with state-of-the-art comprehension in the majority of all Ugandan languages, demonstrating that regional focus is more efficient than current piecemeal approaches.

Conclusion: The Sunflower models can be used to reduce language barriers in important practical applications, showing that regionally focused development is an effective strategy for addressing language technology gaps in linguistically diverse regions like Africa.

Abstract: There are more than 2000 living languages in Africa, most of which have been bypassed by advances in language technology. Current leading LLMs exhibit strong performance on a number of the most common languages (e.g. Swahili or Yoruba), but prioritise support for the languages with the most speakers first, resulting in piecemeal ability across disparate languages. We contend that a regionally focussed approach is more efficient, and present a case study for Uganda, a country with high linguistic diversity. We describe the development of Sunflower 14B and 32B, a pair of models based on Qwen 3 with state of the art comprehension in the majority of all Ugandan languages. These models are open source and can be used to reduce language barriers in a number of important practical applications.

[95] Where to Begin: Efficient Pretraining via Subnetwork Selection and Distillation

Arjun Krishnakumar, Rhea Sanjay Sukthanker, Hannan Javed Mahadik, Gabriela Kadlecová, Vladyslav Moroshan, Timur Carstensen, Frank Hutter, Aaron Klein

Main category: cs.CL

TL;DR: A framework for efficient pretraining of Small Language Models (SLMs) using sparse sub-network initializations, evolutionary search, and knowledge distillation to achieve comparable performance with significantly fewer resources.

Details

Motivation: To make SLMs more efficient and accessible by reducing computational costs while maintaining strong performance compared to Large Language Models.

Method: Three complementary techniques: 1) Structurally sparse sub-network initializations, 2) Evolutionary search for discovering high-quality initializations, 3) Knowledge distillation from larger teacher models.

Result: The best model matches validation perplexity of comparable Pythia SLM while requiring 9.2x fewer pretraining tokens.

Conclusion: The framework provides a practical and reproducible path for cost-efficient small language model development at scale.

Abstract: Small Language models (SLMs) offer an efficient and accessible alternative to Large Language Models (LLMs), delivering strong performance while using far fewer resources. We introduce a simple and effective framework for pretraining SLMs that brings together three complementary ideas. First, we identify structurally sparse sub-network initializations that consistently outperform randomly initialized models of similar size under the same compute budget. Second, we use evolutionary search to automatically discover high-quality sub-network initializations, providing better starting points for pretraining. Third, we apply knowledge distillation from larger teacher models to speed up training and improve generalization. Together, these components make SLM pretraining substantially more efficient: our best model, discovered using evolutionary search and initialized with LLM weights, matches the validation perplexity of a comparable Pythia SLM while requiring 9.2x fewer pretraining tokens. We release all code and models at https://github.com/whittle-org/whittle/, offering a practical and reproducible path toward cost-efficient small language model development at scale.

[96] How much speech data is necessary for ASR in African languages? An evaluation of data scaling in Kinyarwanda and Kikuyu

Benjamin Akera, Evelyn Nafula, Patrick Walukagga, Gilbert Yiga, John Quinn, Ernest Mwebaze

Main category: cs.CL

TL;DR: This paper provides practical deployment guidance for ASR systems in low-resource African languages, showing that viable performance (WER < 13%) can be achieved with just 50 hours of training data, and that data quality issues are a major source of errors.

Details

Motivation: To address the challenge of developing ASR systems for low-resource African languages with limited transcribed speech data, and to provide practical guidance on data requirements and failure modes for real-world deployment.

Method: Comprehensive experiments using Whisper model on two Bantu languages: systematic data scaling analysis on Kinyarwanda (1-1,400 hours training data) and detailed error characterization on Kikuyu (270 hours training data).

Result: Practical ASR performance (WER < 13%) achievable with 50 hours of training data, with continued improvements through 200 hours (WER < 10%). Error analysis revealed that data quality issues, particularly noisy ground truth transcriptions, account for 38.6% of high-error cases.

Conclusion: Careful data curation is as critical as data volume for robust ASR performance in low-resource languages. The study provides actionable benchmarks and deployment guidance for similar contexts.

Abstract: The development of Automatic Speech Recognition (ASR) systems for low-resource African languages remains challenging due to limited transcribed speech data. While recent advances in large multilingual models like OpenAI’s Whisper offer promising pathways for low-resource ASR development, critical questions persist regarding practical deployment requirements. This paper addresses two fundamental concerns for practitioners: determining the minimum data volumes needed for viable performance and characterizing the primary failure modes that emerge in production systems. We evaluate Whisper’s performance through comprehensive experiments on two Bantu languages: systematic data scaling analysis on Kinyarwanda using training sets from 1 to 1,400 hours, and detailed error characterization on Kikuyu using 270 hours of training data. Our scaling experiments demonstrate that practical ASR performance (WER < 13%) becomes achievable with as little as 50 hours of training data, with substantial improvements continuing through 200 hours (WER < 10%). Complementing these volume-focused findings, our error analysis reveals that data quality issues, particularly noisy ground truth transcriptions, account for 38.6% of high-error cases, indicating that careful data curation is as critical as data volume for robust system performance. These results provide actionable benchmarks and deployment guidance for teams developing ASR systems across similar low-resource language contexts. We release accompanying and models see https://github.com/SunbirdAI/kinyarwanda-whisper-eval

[97] Benchmarking LLM Causal Reasoning with Scientifically Validated Relationships

Donggyu Lee, Sungwon Park, Yerin Hwang, Hyunwoo Oh, Hyoshin Kim, Jungwon Kim, Meeyoung Cha, Sangyoon Park, Jihee Kim

Main category: cs.CL

TL;DR: A novel benchmark for causal reasoning in LLMs using real-world economic/finance data reveals major limitations, with best models achieving only 57.6% accuracy and scale not consistently improving performance.

Details

Motivation: Existing causal reasoning benchmarks suffer from synthetic data limitations and narrow domain coverage, creating a need for more realistic evaluation using rigorous causal identification methods from real research.

Method: Constructed benchmark using casually identified relationships from top economics/finance journals, employing rigorous methodologies including instrumental variables, difference-in-differences, and regression discontinuity designs. Contains 40,379 items across five task types spanning multiple domains.

Result: Evaluation of eight state-of-the-art LLMs showed substantial limitations - best model achieved only 57.6% accuracy. Model scale did not consistently improve performance, and advanced reasoning models struggled with fundamental causal relationship identification.

Conclusion: There is a critical gap between current LLM capabilities and the demands of reliable causal reasoning in high-stakes applications, highlighting the need for improved causal reasoning abilities in language models.

Abstract: Causal reasoning is fundamental for Large Language Models (LLMs) to understand genuine cause-and-effect relationships beyond pattern matching. Existing benchmarks suffer from critical limitations such as reliance on synthetic data and narrow domain coverage. We introduce a novel benchmark constructed from casually identified relationships extracted from top-tier economics and finance journals, drawing on rigorous methodologies including instrumental variables, difference-in-differences, and regression discontinuity designs. Our benchmark comprises 40,379 evaluation items covering five task types across domains such as health, environment, technology, law, and culture. Experimental results on eight state-of-the-art LLMs reveal substantial limitations, with the best model achieving only 57.6% accuracy. Moreover, model scale does not consistently translate to superior performance, and even advanced reasoning models struggle with fundamental causal relationship identification. These findings underscore a critical gap between current LLM capabilities and demands of reliable causal reasoning in high-stakes applications.

[98] Customer-R1: Personalized Simulation of Human Behaviors via RL-based LLM Agent in Online Shopping

Ziyi Wang, Yuxuan Lu, Yimeng Zhang, Jing Huang, Dakuo Wang

Main category: cs.CL

TL;DR: Customer-R1 is an RL-based method that enables LLMs to simulate personalized user behavior in online shopping by conditioning on explicit user personas and optimizing next-step rationale and action generation.

Details

Motivation: Prior methods for simulating step-wise human behavior with LLMs learn population-level policies without considering individual user personas, resulting in generic rather than personalized simulations.

Method: Customer-R1 uses reinforcement learning with action correctness reward signals to optimize next-step rationale and action generation, conditioned on explicit user personas.

Result: Experiments on OPeRA dataset show Customer-R1 significantly outperforms prompting and SFT baselines in next-action prediction and better matches users’ action distribution, indicating higher fidelity in personalized behavior simulation.

Conclusion: The proposed RL-based approach with persona conditioning enables more accurate and personalized simulation of step-wise user behavior compared to existing methods.

Abstract: Simulating step-wise human behavior with Large Language Models (LLMs) has become an emerging research direction, enabling applications in various practical domains. While prior methods, including prompting, supervised fine-tuning (SFT), and reinforcement learning (RL), have shown promise in modeling step-wise behavior, they primarily learn a population-level policy without conditioning on a user’s persona, yielding generic rather than personalized simulations. In this work, we pose a critical question: how can LLM agents better simulate personalized user behavior? We introduce Customer-R1, an RL-based method for personalized, step-wise user behavior simulation in online shopping environments. Our policy is conditioned on an explicit persona, and we optimize next-step rationale and action generation via action correctness reward signals. Experiments on the OPeRA dataset emonstrate that Customer-R1 not only significantly outperforms prompting and SFT-based baselines in next-action prediction tasks, but also better matches users' action distribution, indicating higher fidelity in personalized behavior simulation.

[99] LeMAJ (Legal LLM-as-a-Judge): Bridging Legal Reasoning and LLM Evaluation

Joseph Enguehard, Morgane Van Ermengem, Kate Atkinson, Sujeong Cha, Arijit Ghosh Chowdhury, Prashanth Kallur Ramaswamy, Jeremy Roghair, Hannah R Marlowe, Carina Suzana Negreanu, Kitty Boxall, Diana Mincu

Main category: cs.CL

TL;DR: This paper introduces a novel reference-free evaluation method for LLM outputs in legal domain using Legal Data Points (LDPs), which outperforms existing baselines and better correlates with human expert evaluations.

Details

Motivation: Current LLM evaluation methods in legal domain are limited - they either require costly reference data or use standardized assessment methods that don't reflect how lawyers actually evaluate legal answers, leading to unreliable and variable results.

Method: Break down lengthy legal responses into ‘Legal Data Points’ (LDPs) - self-contained information units, and develop a reference-free evaluation methodology that mimics how lawyers evaluate legal answers.

Result: The method outperforms various baselines on both proprietary and LegalBench datasets, shows higher correlation with human expert evaluations, and improves inter-annotator agreement.

Conclusion: The proposed Legal Data Points approach provides a more reliable and effective way to evaluate LLM outputs in legal contexts, and the authors open-source their LDPs to enable community replication and advancement.

Abstract: Evaluating large language model (LLM) outputs in the legal domain presents unique challenges due to the complex and nuanced nature of legal analysis. Current evaluation approaches either depend on reference data, which is costly to produce, or use standardized assessment methods, both of which have significant limitations for legal applications. Although LLM-as-a-Judge has emerged as a promising evaluation technique, its reliability and effectiveness in legal contexts depend heavily on evaluation processes unique to the legal industry and how trustworthy the evaluation appears to the human legal expert. This is where existing evaluation methods currently fail and exhibit considerable variability. This paper aims to close the gap: a) we break down lengthy responses into ‘Legal Data Points’ (LDPs), self-contained units of information, and introduce a novel, reference-free evaluation methodology that reflects how lawyers evaluate legal answers; b) we demonstrate that our method outperforms a variety of baselines on both our proprietary dataset and an open-source dataset (LegalBench); c) we show how our method correlates more closely with human expert evaluations and helps improve inter-annotator agreement; and finally d) we open source our Legal Data Points for a subset of LegalBench used in our experiments, allowing the research community to replicate our results and advance research in this vital area of LLM evaluation on legal question-answering.

[100] LAD-RAG: Layout-aware Dynamic RAG for Visually-Rich Document Understanding

Zhivar Sourati, Zheng Wang, Marianne Menglin Liu, Yazhe Hu, Mengqing Guo, Sujeeth Bharadwaj, Kyu Han, Tao Sheng, Sujith Ravi, Morteza Dehghani, Dan Roth

Main category: cs.CL

TL;DR: LAD-RAG is a Layout-Aware Dynamic RAG framework that improves question answering over visually rich documents by capturing structural layout and cross-page dependencies during ingestion and enabling adaptive retrieval during inference.

Details

Motivation: Conventional RAG methods lose structural and cross-page dependencies when encoding content in isolated chunks, and retrieve fixed numbers of pages regardless of question demands, leading to incomplete evidence retrieval and degraded answer quality for multi-page reasoning tasks.

Method: During ingestion, constructs a symbolic document graph capturing layout structure and cross-page dependencies alongside standard neural embeddings. During inference, uses an LLM agent to dynamically interact with neural and symbolic indices for adaptive evidence retrieval based on the query.

Result: Achieves over 90% perfect recall on average without top-k tuning, outperforms baseline retrievers by up to 20% in recall at comparable noise levels, and yields higher QA accuracy with minimal latency on MMLongBench-Doc, LongDocURL, DUDE, and MP-DocVQA datasets.

Conclusion: LAD-RAG effectively addresses limitations of conventional RAG methods by incorporating layout awareness and dynamic retrieval, significantly improving performance on multi-page document QA tasks.

Abstract: Question answering over visually rich documents (VRDs) requires reasoning not only over isolated content but also over documents’ structural organization and cross-page dependencies. However, conventional retrieval-augmented generation (RAG) methods encode content in isolated chunks during ingestion, losing structural and cross-page dependencies, and retrieve a fixed number of pages at inference, regardless of the specific demands of the question or context. This often results in incomplete evidence retrieval and degraded answer quality for multi-page reasoning tasks. To address these limitations, we propose LAD-RAG, a novel Layout-Aware Dynamic RAG framework. During ingestion, LAD-RAG constructs a symbolic document graph that captures layout structure and cross-page dependencies, adding it alongside standard neural embeddings to yield a more holistic representation of the document. During inference, an LLM agent dynamically interacts with the neural and symbolic indices to adaptively retrieve the necessary evidence based on the query. Experiments on MMLongBench-Doc, LongDocURL, DUDE, and MP-DocVQA demonstrate that LAD-RAG improves retrieval, achieving over 90% perfect recall on average without any top-k tuning, and outperforming baseline retrievers by up to 20% in recall at comparable noise levels, yielding higher QA accuracy with minimal latency.

[101] When Benchmarks Age: Temporal Misalignment through Large Language Model Factuality Evaluation

Xunyi Jiang, Dingyi Chang, Julian McAuley, Xin Xu

Main category: cs.CL

TL;DR: This paper investigates the aging problem in popular LLM factuality benchmarks, showing that outdated samples lead to unreliable assessments of modern LLMs’ factuality.

Details

Motivation: The static nature of evaluation benchmarks can't keep up with rapidly evolving LLMs and real-world facts, creating temporal misalignment that affects reliability of LLM factuality evaluation.

Method: Systematic investigation using five popular factuality benchmarks and eight LLMs across different years, with an up-to-date fact retrieval pipeline and three tailored metrics to quantify benchmark aging and its impact.

Result: A considerable portion of samples in widely used factuality benchmarks are outdated, leading to unreliable assessments of LLM factuality.

Conclusion: The work provides a testbed to assess benchmark reliability for LLM factuality evaluation and aims to inspire more research on the benchmark aging issue.

Abstract: The rapid evolution of large language models (LLMs) and the real world has outpaced the static nature of widely used evaluation benchmarks, raising concerns about their reliability for evaluating LLM factuality. While substantial works continue to rely on the popular but old benchmarks, their temporal misalignment with real-world facts and modern LLMs, and their effects on LLM factuality evaluation remain underexplored. Therefore, in this work, we present a systematic investigation of this issue by examining five popular factuality benchmarks and eight LLMs released across different years. An up-to-date fact retrieval pipeline and three metrics are tailored to quantify benchmark aging and its impact on LLM factuality evaluation. Experimental results and analysis illustrate that a considerable portion of samples in the widely used factuality benchmarks are outdated, leading to unreliable assessments of LLM factuality. We hope our work can provide a testbed to assess the reliability of a benchmark for LLM factuality evaluation and inspire more research on the benchmark aging issue. Codes are available in https://github.com/JiangXunyi/BenchAge.

[102] Online Rubrics Elicitation from Pairwise Comparisons

MohammadHossein Rezaei, Robert Vacareanu, Zihao Wang, Clinton Wang, Yunzhong He, Afra Feyza Akyürek

Main category: cs.CL

TL;DR: OnlineRubrics is a method that dynamically updates evaluation criteria during LLM training through pairwise comparisons, preventing reward hacking and capturing emergent requirements, achieving up to 8% improvement over static rubrics.

Details

Motivation: Static rubrics in LLM training are vulnerable to reward hacking and fail to capture emergent desiderata that arise during training, limiting their effectiveness for open-ended long-form answers.

Method: Online Rubrics Elicitation (OnlineRubrics) dynamically curates evaluation criteria through pairwise comparisons of responses from current and reference policies, enabling continuous identification and mitigation of errors during training.

Result: The approach yields consistent improvements of up to 8% over training with static rubrics across multiple benchmarks including AlpacaEval, GPQA, ArenaHard, and expert question validation sets.

Conclusion: Dynamic rubric elicitation through pairwise comparisons effectively captures emergent training requirements and prevents reward hacking, leading to significant performance improvements in LLM post-training.

Abstract: Rubrics provide a flexible way to train LLMs on open-ended long-form answers where verifiable rewards are not applicable and human preferences provide coarse signals. Prior work shows that reinforcement learning with rubric-based rewards leads to consistent gains in LLM post-training. Most existing approaches rely on rubrics that remain static over the course of training. Such static rubrics, however, are vulnerable to reward-hacking type behaviors and fail to capture emergent desiderata that arise during training. We introduce Online Rubrics Elicitation (OnlineRubrics), a method that dynamically curates evaluation criteria in an online manner through pairwise comparisons of responses from current and reference policies. This online process enables continuous identification and mitigation of errors as training proceeds. Empirically, this approach yields consistent improvements of up to 8% over training exclusively with static rubrics across AlpacaEval, GPQA, ArenaHard as well as the validation sets of expert questions and rubrics. We qualitatively analyze the elicited criteria and identify prominent themes such as transparency, practicality, organization, and reasoning.

[103] Red-Bandit: Test-Time Adaptation for LLM Red-Teaming via Bandit-Guided LoRA Experts

Christos Ziakas, Nicholas Loo, Nishita Jain, Alessandra Russo

Main category: cs.CL

TL;DR: Red-Bandit is an adaptive red-teaming framework that uses LoRA experts specialized in different attack styles and a multi-armed bandit policy to dynamically select attacks based on target model responses, achieving state-of-the-art results while producing human-readable prompts.

Details

Motivation: Existing automated red-teaming approaches lack mechanisms to efficiently adapt to model-specific vulnerabilities at inference time, limiting their effectiveness in identifying and exploiting unique failure modes of different LLMs.

Method: Post-trains parameter-efficient LoRA experts specialized for different attack styles using reinforcement learning with rule-based safety model rewards. Uses multi-armed bandit policy to dynamically select attack-style experts based on target model’s response safety.

Result: Achieves state-of-the-art results on AdvBench (ASR@10) under sufficient exploration, produces more human-readable prompts (lower perplexity), and serves as diagnostic tool for uncovering model-specific vulnerabilities.

Conclusion: Red-Bandit provides an effective adaptive framework for automated red-teaming that balances exploration and exploitation while offering diagnostic insights into model vulnerabilities through attack style preferences.

Abstract: Automated red-teaming has emerged as a scalable approach for auditing Large Language Models (LLMs) prior to deployment, yet existing approaches lack mechanisms to efficiently adapt to model-specific vulnerabilities at inference. We introduce Red-Bandit, a red-teaming framework that adapts online to identify and exploit model failure modes under distinct attack styles (e.g., manipulation, slang). Red-Bandit post-trains a set of parameter-efficient LoRA experts, each specialized for a particular attack style, using reinforcement learning that rewards the generation of unsafe prompts via a rule-based safety model. At inference, a multi-armed bandit policy dynamically selects among these attack-style experts based on the target model’s response safety, balancing exploration and exploitation. Red-Bandit achieves state-of-the-art results on AdvBench under sufficient exploration (ASR@10), while producing more human-readable prompts (lower perplexity). Moreover, Red-Bandit’s bandit policy serves as a diagnostic tool for uncovering model-specific vulnerabilities by indicating which attack styles most effectively elicit unsafe behaviors.

[104] Hybrid Reinforcement: When Reward Is Sparse, It’s Better to Be Dense

Leitian Tao, Ilia Kulikov, Swarnadeep Saha, Tianlu Wang, Jing Xu, Yixuan Li, Jason E Weston, Ping Yu

Main category: cs.CL

TL;DR: HERO is a hybrid reinforcement learning framework that combines verifier signals with reward model scores using stratified normalization and variance-aware weighting to improve reasoning in large language models.

Details

Motivation: Verifiable rewards provide reliable but brittle binary feedback, while reward models offer richer continuous feedback. The goal is to combine both to overcome limitations of all-or-nothing supervision.

Method: HERO integrates verifier signals with reward model scores through stratified normalization (bounding reward scores within verifier groups) and variance-aware weighting (emphasizing challenging prompts).

Result: HERO consistently outperforms RM-only and verifier-only baselines across diverse mathematical reasoning benchmarks, with strong gains on both verifiable and hard-to-verify tasks.

Conclusion: Hybrid reward design retains the stability of verifiers while leveraging the nuance of reward models to advance reasoning capabilities in LLMs.

Abstract: Post-training for reasoning of large language models (LLMs) increasingly relies on verifiable rewards: deterministic checkers that provide 0-1 correctness signals. While reliable, such binary feedback is brittle–many tasks admit partially correct or alternative answers that verifiers under-credit, and the resulting all-or-nothing supervision limits learning. Reward models offer richer, continuous feedback, which can serve as a complementary supervisory signal to verifiers. We introduce HERO (Hybrid Ensemble Reward Optimization), a reinforcement learning framework that integrates verifier signals with reward-model scores in a structured way. HERO employs stratified normalization to bound reward-model scores within verifier-defined groups, preserving correctness while refining quality distinctions, and variance-aware weighting to emphasize challenging prompts where dense signals matter most. Across diverse mathematical reasoning benchmarks, HERO consistently outperforms RM-only and verifier-only baselines, with strong gains on both verifiable and hard-to-verify tasks. Our results show that hybrid reward design retains the stability of verifiers while leveraging the nuance of reward models to advance reasoning.

[105] Don’t Adapt Small Language Models for Tools; Adapt Tool Schemas to the Models

Jonggeun Lee, Woojung Song, Jongwook Han, Haesung Pyun, Yohan Jo

Main category: cs.CL

TL;DR: PA-Tool adapts tool schemas to align with small language models’ pretrained knowledge, reducing schema misalignment errors by 80% and improving performance by up to 17 percentage points without retraining.

Details

Motivation: Small language models struggle with tool-use tasks due to schema misalignment, where they hallucinate tool names based on pretraining knowledge that don't match provided schemas.

Method: PA-Tool uses peakedness (a contamination detection signal) to automatically rename tool components by generating multiple candidates and selecting those with highest output concentration across samples.

Result: Experiments on MetaTool and RoTBench show 17% performance improvement and 80% reduction in schema misalignment errors, enabling small models to approach state-of-the-art performance.

Conclusion: Schema-level interventions can unlock tool-use potential in resource-efficient models by adapting schemas to models rather than forcing models to adapt to arbitrary schemas.

Abstract: Small language models (SLMs) offer significant computational advantages for tool-augmented AI systems, yet they struggle with tool-use tasks, particularly in selecting appropriate tools and identifying correct parameters. A common failure mode is schema misalignment: models hallucinate plausible but non-existent tool names that reflect naming conventions internalized during pretraining but absent from the provided tool schema. Rather than forcing models to adapt to arbitrary schemas, we propose adapting schemas to align with models’ pretrained knowledge. We introduce PA-Tool (Pretraining-Aligned Tool Schema Generation), a training-free method that leverages peakedness-a signal from contamination detection indicating pretraining familiarity-to automatically rename tool components. By generating multiple candidates and selecting those with highest output concentration across samples, PA-Tool identifies pretrain-aligned naming patterns. Experiments on MetaTool and RoTBench show improvements of up to 17% points, with schema misalignment errors reduced by 80%. PA-Tool enables small models to approach state-of-the-art performance while maintaining computational efficiency for adaptation to new tools without retraining. Our work demonstrates that schema-level interventions can unlock the tool-use potential of resource-efficient models by adapting schemas to models rather than models to schemas.

[106] On the Convergence of Moral Self-Correction in Large Language Models

Guangliang Liu, Haitao Mao, Bochuan Cao, Zhiyu Xue, Xitong Zhang, Rongrong Wang, Kristen Marie Johnson

Main category: cs.CL

TL;DR: LLMs can self-correct moral responses through multi-round interactions, achieving performance convergence by activating stable moral concepts that reduce uncertainty.

Details

Motivation: To understand how and why intrinsic self-correction works in LLMs, particularly for moral reasoning, since empirical success exists but mechanisms remain unknown.

Method: Mechanistic analysis of moral self-correction through multi-round interactions, examining how self-correction instructions activate moral concepts and reduce model uncertainty.

Result: Intrinsic self-correction exhibits performance convergence - consistently injected instructions activate moral concepts that stabilize over rounds, reducing uncertainty and leading to converged performance.

Conclusion: Moral self-correction demonstrates strong potential with the desirable property of converged performance, showing how activated moral concepts stabilize through multi-round interactions.

Abstract: Large Language Models (LLMs) are able to improve their responses when instructed to do so, a capability known as self-correction. When instructions provide only a general and abstract goal without specific details about potential issues in the response, LLMs must rely on their internal knowledge to improve response quality, a process referred to as intrinsic self-correction. The empirical success of intrinsic self-correction is evident in various applications, but how and why it is effective remains unknown. Focusing on moral self-correction in LLMs, we reveal a key characteristic of intrinsic self-correction: performance convergence through multi-round interactions; and provide a mechanistic analysis of this convergence behavior. Based on our experimental results and analysis, we uncover the underlying mechanism of convergence: consistently injected self-correction instructions activate moral concepts that reduce model uncertainty, leading to converged performance as the activated moral concepts stabilize over successive rounds. This paper demonstrates the strong potential of moral self-correction by showing that it exhibits a desirable property of converged performance.

[107] Think Natively: Unlocking Multilingual Reasoning with Consistency-Enhanced Reinforcement Learning

Xue Zhang, Yunlong Liang, Fandong Meng, Songming Zhang, Kaiyu Huang, Yufeng Chen, Jinan Xu, Jie Zhou

Main category: cs.CL

TL;DR: M-Thinker addresses language inconsistency and poor reasoning in non-English languages for Large Reasoning Models using GRPO algorithm with Language Consistency and Cross-lingual Thinking Alignment rewards.

Details

Motivation: Current LRMs struggle with input-output language consistency and perform poorly with wrong reasoning paths in non-English languages, degrading user experience and hindering global deployment.

Method: Proposes M-Thinker trained with GRPO algorithm featuring Language Consistency reward for input-thought-answer consistency and Cross-lingual Thinking Alignment reward to transfer reasoning capability from English to non-English languages.

Result: M-Thinker-1.5B/7B models achieve nearly 100% language consistency and superior performance on MMATH and PolyMath benchmarks, with excellent generalization to out-of-domain languages.

Conclusion: M-Thinker effectively solves language inconsistency and reasoning quality issues in non-English LRMs through innovative reward mechanisms, enabling better global deployment.

Abstract: Large Reasoning Models (LRMs) have achieved remarkable performance on complex reasoning tasks by adopting the “think-then-answer” paradigm, which enhances both accuracy and interpretability. However, current LRMs exhibit two critical limitations when processing non-English languages: (1) They often struggle to maintain input-output language consistency; (2) They generally perform poorly with wrong reasoning paths and lower answer accuracy compared to English. These limitations significantly degrade the user experience for non-English speakers and hinder the global deployment of LRMs. To address these limitations, we propose M-Thinker, which is trained by the GRPO algorithm that involves a Language Consistency (LC) reward and a novel Cross-lingual Thinking Alignment (CTA) reward. Specifically, the LC reward defines a strict constraint on the language consistency between the input, thought, and answer. Besides, the CTA reward compares the model’s non-English reasoning paths with its English reasoning path to transfer its own reasoning capability from English to non-English languages. Through an iterative RL procedure, our M-Thinker-1.5B/7B models not only achieve nearly 100% language consistency and superior performance on two multilingual benchmarks (MMATH and PolyMath), but also exhibit excellent generalization on out-of-domain languages.

[108] Vibe Checker: Aligning Code Evaluation with Human Preference

Ming Zhong, Xiang Zhou, Ting-Yun Chang, Qingze Wang, Nan Xu, Xiance Si, Dan Garrette, Shyam Upadhyay, Jeremiah Liu, Jiawei Han, Benoit Schillings, Jiao Sun

Main category: cs.CL

TL;DR: The paper introduces VeriCode, a taxonomy of 30 verifiable code instructions with deterministic verifiers, to quantify LLMs’ code instruction following capabilities beyond functional correctness, and shows this composite score best correlates with human preference.

Details

Motivation: Current code evaluation focuses only on functional correctness (pass@k), overlooking non-functional instructions that users apply in vibe coding where solutions should feel right, read cleanly, preserve intent, and remain correct.

Method: Developed VeriCode taxonomy of 30 verifiable code instructions with corresponding deterministic verifiers, augmented established evaluation suites to create Vibe Checker testbed for assessing both code instruction following and functional correctness.

Result: Evaluation of 31 leading LLMs shows even strongest models struggle with multiple instructions and exhibit functional regression. Composite score of functional correctness and instruction following best correlates with human preference, with instruction following being primary differentiator on real-world tasks.

Conclusion: Instruction following is the missing piece underlying vibe check that represents human preference in coding. The work identifies core factors of vibe check and provides path for benchmarking models that better align with user preferences.

Abstract: Large Language Models (LLMs) have catalyzed vibe coding, where users leverage LLMs to generate and iteratively refine code through natural language interactions until it passes their vibe check. Vibe check is tied to real-world human preference and goes beyond functionality: the solution should feel right, read cleanly, preserve intent, and remain correct. However, current code evaluation remains anchored to pass@k and captures only functional correctness, overlooking the non-functional instructions that users routinely apply. In this paper, we hypothesize that instruction following is the missing piece underlying vibe check that represents human preference in coding besides functional correctness. To quantify models’ code instruction following capabilities with measurable signals, we present VeriCode, a taxonomy of 30 verifiable code instructions together with corresponding deterministic verifiers. We use the taxonomy to augment established evaluation suites, resulting in Vibe Checker, a testbed to assess both code instruction following and functional correctness. Upon evaluating 31 leading LLMs, we show that even the strongest models struggle to comply with multiple instructions and exhibit clear functional regression. Most importantly, a composite score of functional correctness and instruction following correlates the best with human preference, with the latter emerging as the primary differentiator on real-world programming tasks. Our work identifies core factors of the vibe check, providing a concrete path for benchmarking and developing models that better align with user preferences in coding.

[109] Agent Bain vs. Agent McKinsey: A New Text-to-SQL Benchmark for the Business Domain

Yue Li, Ran Tao, Derek Hommel, Yusuf Denizay Dönder, Sungyong Chang, David Mimno, Unso Eun Seo Jo

Main category: cs.CL

TL;DR: CORGI is a new text-to-SQL benchmark for business intelligence that tests LLMs on complex queries requiring causal reasoning, forecasting, and recommendations, showing performance drops on high-level questions.

Details

Motivation: Existing text-to-SQL benchmarks focus on factual retrieval, but real-world business contexts require more complex reasoning like predictions and strategic recommendations.

Method: Created CORGI benchmark with synthetic databases inspired by real enterprises (Doordash, Airbnb, Lululemon) and questions across four complexity levels: descriptive, explanatory, predictive, and recommendational.

Result: LLM performance drops significantly on high-level questions, struggling with accurate predictions and actionable plans. CORGI is about 21% more difficult than BIRD benchmark.

Conclusion: There’s a significant gap between current LLM capabilities and real-world business intelligence needs, highlighting the need for benchmarks that test multi-level agentic intelligence.

Abstract: In the business domain, where data-driven decision making is crucial, text-to-SQL is fundamental for easy natural language access to structured data. While recent LLMs have achieved strong performance in code generation, existing text-to-SQL benchmarks remain focused on factual retrieval of past records. We introduce CORGI, a new benchmark specifically designed for real-world business contexts. CORGI is composed of synthetic databases inspired by enterprises such as Doordash, Airbnb, and Lululemon. It provides questions across four increasingly complex categories of business queries: descriptive, explanatory, predictive, and recommendational. This challenge calls for causal reasoning, temporal forecasting, and strategic recommendation, reflecting multi-level and multi-step agentic intelligence. We find that LLM performance drops on high-level questions, struggling to make accurate predictions and offer actionable plans. Based on execution success rate, the CORGI benchmark is about 21% more difficult than the BIRD benchmark. This highlights the gap between popular LLMs and the need for real-world business intelligence. We release a public dataset and evaluation framework, and a website for public submissions.

[110] Artificial Hippocampus Networks for Efficient Long-Context Modeling

Yunhao Fang, Weihao Yu, Shu Zhong, Qinghao Ye, Xuehan Xiong, Lai Wei

Main category: cs.CL

TL;DR: The paper introduces a memory framework inspired by cognitive science that combines lossless short-term memory (sliding window KV cache) with compressed long-term memory (Artificial Hippocampus Network) to achieve efficient long-sequence modeling.

Details

Motivation: To address the fundamental trade-off between efficiency in RNN-like models and fidelity in attention-based Transformers for long-sequence modeling, drawing inspiration from the Multi-Store Model in cognitive science.

Method: Maintains a sliding window of Transformer’s KV cache as short-term memory, while using an Artificial Hippocampus Network (AHN) to recurrently compress out-of-window information into fixed-size long-term memory. AHNs are instantiated using modern RNN-like architectures like Mamba2, DeltaNet, and Gated DeltaNet.

Result: AHN-augmented models consistently outperform sliding window baselines and achieve performance comparable or superior to full-attention models while substantially reducing computational requirements. For Qwen2.5-3B-Instruct, AHNs reduce inference FLOPs by 40.5% and memory cache by 74.0%, while improving LV-Eval score from 4.41 to 5.88.

Conclusion: The proposed memory framework successfully balances efficiency and fidelity in long-sequence modeling, demonstrating significant computational savings while maintaining or improving performance compared to full-attention models.

Abstract: Long-sequence modeling faces a fundamental trade-off between the efficiency of compressive fixed-size memory in RNN-like models and the fidelity of lossless growing memory in attention-based Transformers. Inspired by the Multi-Store Model in cognitive science, we introduce a memory framework of artificial neural networks. Our method maintains a sliding window of the Transformer’s KV cache as lossless short-term memory, while a learnable module termed Artificial Hippocampus Network (AHN) recurrently compresses out-of-window information into a fixed-size compact long-term memory. To validate this framework, we instantiate AHNs using modern RNN-like architectures, including Mamba2, DeltaNet, and Gated DeltaNet. Extensive experiments on long-context benchmarks LV-Eval and InfiniteBench demonstrate that AHN-augmented models consistently outperform sliding window baselines and achieve performance comparable or even superior to full-attention models, while substantially reducing computational and memory requirements. For instance, augmenting the Qwen2.5-3B-Instruct with AHNs reduces inference FLOPs by 40.5% and memory cache by 74.0%, while improving its average score on LV-Eval (128k sequence length) from 4.41 to 5.88. Code is available at: https://github.com/ByteDance-Seed/AHN.

[111] ECLM: Entity Level Language Model for Spoken Language Understanding with Chain of Intent

Shangjian Yin, Peijie Huang, Jiatian Chen, Haojing Huang, Yuhong Xu

Main category: cs.CL

TL;DR: ECLM framework addresses LLM limitations in spoken language understanding by reformulating slot-filling as entity recognition and introducing Chain of Intent for multi-intent recognition, achieving significant performance gains over baselines.

Details

Motivation: LLMs struggle with token-level tasks due to autoregressive misalignment and fail to capture nuanced interrelations in semantic-level tasks through direct fine-tuning alone, limiting their effectiveness in spoken language understanding.

Method: Proposes Entity-level Language Model (ECLM) framework that reformulates slot-filling as entity recognition and introduces Chain of Intent concept for step-by-step multi-intent recognition.

Result: ECLM significantly outperforms strong baselines (Uni-MIS) with gains of 3.7% on MixATIS and 3.1% on MixSNIPS, and achieves 8.5% and 21.2% improvements over standard supervised fine-tuning of LLMs on these datasets respectively.

Conclusion: ECLM effectively addresses LLM limitations in SLU tasks through entity-level reformulation and structured intent recognition, demonstrating substantial performance improvements across multiple datasets.

Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities in language generation and general task performance. However, their application to spoken language understanding (SLU) remains challenging, particularly for token-level tasks, where the autoregressive nature of LLMs often leads to misalignment issues. They also struggle to capture nuanced interrelations in semantic-level tasks through direct fine-tuning alone. To address these challenges, we propose the Entity-level Language Model (ECLM) framework, which reformulates slot-filling as an entity recognition task and introduces a novel concept, \textit{Chain of Intent}, to enable step-by-step multi-intent recognition. Experimental results show that ECLM significantly outperforms strong baselines such as Uni-MIS, achieving gains of 3.7% on MixATIS and 3.1% on MixSNIPS. Compared to standard supervised fine-tuning of LLMs, ECLM further achieves improvements of 8.5% and 21.2% on these datasets, respectively. Our code is available at https://github.com/SJY8460/ECLM.

[112] Approximately Aligned Decoding

Daniel Melcer, Sujan Gonugondla, Pramuditha Perera, Haifeng Qian, Wen-Hao Chiang, Yanjun Wang, Nihal Jain, Pranav Garg, Xiaofei Ma, Anoop Deoras

Main category: cs.CL

TL;DR: AprAD is a method that balances output distribution distortion with computational efficiency for rejecting undesired LLM outputs, inspired by speculative decoding algorithms.

Details

Motivation: Current methods for rejecting undesired LLM outputs require excessive computation for re-sampling or distort output distributions by constraining to improbable tokens.

Method: Approximately Aligned Decoding (AprAD), inspired by speculative decoding algorithms, generates long text sequences with difficult constraints while minimizing amplification of low probability outputs.

Result: AprAD achieves task-specific performance comparable to methods that don’t distort output distribution, while being much more computationally efficient.

Conclusion: AprAD provides an effective balance between computational efficiency and output distribution preservation for constrained LLM generation.

Abstract: It is common to reject undesired outputs of Large Language Models (LLMs); however, current methods to do so require an excessive amount of computation to re-sample after a rejection, or distort the distribution of outputs by constraining the output to highly improbable tokens. We present a method, Approximately Aligned Decoding (AprAD), to balance the distortion of the output distribution with computational efficiency, inspired by algorithms from the speculative decoding literature. AprAD allows for the generation of long sequences of text with difficult-to-satisfy constraints, while amplifying low probability outputs much less compared to existing methods. We show through a series of experiments that the task-specific performance of AprAD is comparable to methods that do not distort the output distribution, while being much more computationally efficient.

[113] SuffixDecoding: Extreme Speculative Decoding for Emerging AI Applications

Gabriele Oliaro, Zhihao Jia, Daniel Campos, Aurick Qiao

Main category: cs.CL

TL;DR: SuffixDecoding is a novel speculative decoding method that uses suffix trees to cache long token sequences from prompts and previous outputs, achieving up to 5.3× speedup in LLM inference for agentic workloads.

Details

Motivation: Current speculative decoding methods don't effectively exploit the long, highly predictable sequences in emerging AI applications like LLM-based agents, which involve repetitive inference requests such as multi-agent pipelines and self-refinement loops.

Method: Uses efficient suffix trees to cache long token sequences from prompts and previous outputs, adaptively speculating more tokens when acceptance likelihood is high and fewer when it is low.

Result: Achieves speedups of up to 5.3× on agentic benchmarks (SWE-Bench and Text-to-SQL), outperforming state-of-the-art methods - 2.8× faster than model-based approaches like EAGLE-2/3 and 1.9× faster than model-free approaches like Token Recycling.

Conclusion: SuffixDecoding effectively exploits opportunities for longer speculations while conserving computation when opportunities are limited, making it well-suited for agentic workloads with repetitive inference patterns.

Abstract: Speculative decoding is widely adopted to reduce latency in large language model (LLM) inference by leveraging smaller draft models capable of handling diverse user tasks. However, emerging AI applications, such as LLM-based agents, present unique workload characteristics: instead of diverse independent requests, agentic frameworks typically submit repetitive inference requests, such as multi-agent pipelines performing similar subtasks or self-refinement loops iteratively enhancing outputs. These workloads result in long and highly predictable sequences, which current speculative decoding methods do not effectively exploit. To address this gap, we introduce \emph{SuffixDecoding}, a novel method that utilizes efficient suffix trees to cache long token sequences from prompts and previous outputs. By adaptively speculating more tokens when acceptance likelihood is high and fewer when it is low, SuffixDecoding effectively exploits opportunities for longer speculations while conserving computation when those opportunities are limited. Evaluations on agentic benchmarks, including SWE-Bench and Text-to-SQL, demonstrate that SuffixDecoding achieves speedups of up to 5.3$\times$, outperforming state-of-the-art methods – 2.8$\times$ faster than model-based approaches like EAGLE-2/3 and 1.9$\times$ faster than model-free approaches such as Token Recycling. SuffixDecoding is open-sourced at https://github.com/snowflakedb/ArcticInference

[114] Evil twins are not that evil: Qualitative insights into machine-generated prompts

Nathanaël Carraz Rakotonirina, Corentin Kervadec, Francesca Franzon, Marco Baroni

Main category: cs.CL

TL;DR: Analysis of machine-generated prompts (autoprompts) reveals they have intelligible last tokens, prunable tokens, filler tokens, and semantically-related keywords, with some patterns applying to natural language inputs as well.

Details

Motivation: To understand why language models respond to seemingly unintelligible machine-generated prompts, which represents both a theoretical gap in understanding LM behavior and a practical security concern for jailbreaking attacks.

Method: Conducted thorough analysis of autoprompts across 6 different LMs of varying sizes and families, examining token characteristics, pruning effects, and human expert identification of influential tokens.

Result: Found that autoprompts have intelligible last tokens that strongly affect generation, contain prunable tokens, and consist of filler tokens and semantically-related keywords. Human experts can reliably identify influential tokens, and some ablation effects generalize to natural language inputs.

Conclusion: Autoprompts are not entirely opaque and their characteristics emerge naturally from how language models process linguistic inputs in general, suggesting these patterns are fundamental to LM processing rather than artificial artifacts.

Abstract: It has been widely observed that language models (LMs) respond in predictable ways to algorithmically generated prompts that are seemingly unintelligible. This is both a sign that we lack a full understanding of how LMs work, and a practical challenge, because opaqueness can be exploited for harmful uses of LMs, such as jailbreaking. We present the first thorough analysis of opaque machine-generated prompts, or autoprompts, pertaining to 6 LMs of different sizes and families. We find that machine-generated prompts are characterized by a last token that is often intelligible and strongly affects the generation. A small but consistent proportion of the previous tokens are prunable, probably appearing in the prompt as a by-product of the fact that the optimization process fixes the number of tokens. The remaining tokens fall into two categories: filler tokens, which can be replaced with semantically unrelated substitutes, and keywords, that tend to have at least a loose semantic relation with the generation, although they do not engage in well-formed syntactic relations with it. Additionally, human experts can reliably identify the most influential tokens in an autoprompt a posteriori, suggesting these prompts are not entirely opaque. Finally, some of the ablations we applied to autoprompts yield similar effects in natural language inputs, suggesting that autoprompts emerge naturally from the way LMs process linguistic inputs in general.

[115] 2 OLMo 2 Furious

Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Shane Arora, Akshita Bhagia, Yuling Gu, Shengyi Huang, Matt Jordan, Nathan Lambert, Dustin Schwenk, Oyvind Tafjord, Taira Anderson, David Atkinson, Faeze Brahman, Christopher Clark, Pradeep Dasigi, Nouha Dziri, Allyson Ettinger, Michal Guerquin, David Heineman, Hamish Ivison, Pang Wei Koh, Jiacheng Liu, Saumya Malik, William Merrill, Lester James V. Miranda, Jacob Morrison, Tyler Murray, Crystal Nam, Jake Poznanski, Valentina Pyatkin, Aman Rangapur, Michael Schmitz, Sam Skjonsberg, David Wadden, Christopher Wilhelm, Michael Wilson, Luke Zettlemoyer, Ali Farhadi, Noah A. Smith, Hannaneh Hajishirzi

Main category: cs.CL

TL;DR: OLMo 2 is a fully open language model family (7B, 13B, 32B) with complete transparency including weights, training data, code, and checkpoints. It achieves state-of-the-art performance with improved training efficiency and specialized data curriculum.

Details

Motivation: To create fully transparent language models that match or outperform existing models while providing complete openness in weights, data, code, and training process.

Method: Modified model architecture, improved training stability techniques, Dolmino Mix 1124 specialized data via late-stage curriculum training, and Tulu 3-inspired instruction tuning with RLVR (reinforcement learning with verifiable rewards).

Result: OLMo 2 base models achieve Pareto-optimal performance-to-compute ratio, matching/exceeding Llama 3.1, Qwen 2.5, and Gemma 2 with fewer FLOPs. OLMo 2-Instruct is competitive with open-weight models and even some proprietary models like GPT-3.5 Turbo.

Conclusion: OLMo 2 demonstrates that fully open language models can achieve state-of-the-art performance while maintaining complete transparency, setting a new standard for open AI development.

Abstract: We present OLMo 2, the next generation of our fully open language models. OLMo 2 includes a family of dense autoregressive language models at 7B, 13B and 32B scales with fully released artifacts – model weights, full training data, training code and recipes, training logs and thousands of intermediate checkpoints. In this work, we describe our modified model architecture and training recipe, focusing on techniques for achieving better training stability and improved per-token efficiency. Our updated pretraining data mixture introduces a new, specialized data mix called Dolmino Mix 1124, which significantly improves model capabilities across many downstream task benchmarks when introduced via late-stage curriculum training (i.e. specialized data during the annealing phase of pretraining). Finally, we incorporate best practices from T"ulu 3 to develop OLMo 2-Instruct, focusing on permissive data and extending our final-stage reinforcement learning with verifiable rewards (RLVR). Our OLMo 2 base models sit at the Pareto frontier of performance to training compute, often matching or outperforming open-weight only models like Llama 3.1, Qwen 2.5, and Gemma 2 while using fewer FLOPs and with fully transparent training data, code, and recipe. Our fully open OLMo 2-Instruct models are competitive with open-weight only models of comparable size and even some proprietary models like GPT-3.5 Turbo and GPT 4o Mini.

[116] LatteReview: A Multi-Agent Framework for Systematic Review Automation Using Large Language Models

Pouria Rouzrokh, Bardia Khosravi, Parsa Rouzrokh, Moein Shariatnia

Main category: cs.CL

TL;DR: LatteReview is a Python framework that uses LLMs and multi-agent systems to automate systematic literature reviews, reducing manual effort while maintaining rigor.

Details

Motivation: Systematic reviews are time-consuming and labor-intensive due to manual screening, evaluation, and data extraction processes that require iterative refinement.

Method: Uses modular agents with LLMs for title/abstract screening, relevance scoring, and structured data extraction within orchestrated workflows supporting sequential/parallel review rounds and dynamic decision-making.

Result: Developed a framework with features like RAG for external context, multimodal reviews, Pydantic validation, and asynchronous programming for large datasets. Available as open-source on GitHub.

Conclusion: LatteReview provides an automated solution for systematic reviews that maintains rigor while significantly reducing manual effort through LLM-powered multi-agent systems.

Abstract: Systematic literature reviews and meta-analyses are essential for synthesizing research insights, but they remain time-intensive and labor-intensive due to the iterative processes of screening, evaluation, and data extraction. This paper introduces and evaluates LatteReview, a Python-based framework that leverages large language models (LLMs) and multi-agent systems to automate key elements of the systematic review process. Designed to streamline workflows while maintaining rigor, LatteReview utilizes modular agents for tasks such as title and abstract screening, relevance scoring, and structured data extraction. These agents operate within orchestrated workflows, supporting sequential and parallel review rounds, dynamic decision-making, and iterative refinement based on user feedback. LatteReview’s architecture integrates LLM providers, enabling compatibility with both cloud-based and locally hosted models. The framework supports features such as Retrieval-Augmented Generation (RAG) for incorporating external context, multimodal reviews, Pydantic-based validation for structured inputs and outputs, and asynchronous programming for handling large-scale datasets. The framework is available on the GitHub repository, with detailed documentation and an installable package.

[117] Benchmarking Gaslighting Negation Attacks Against Multimodal Large Language Models

Bin Zhu, Yinxuan Gui, Huiyan Qi, Jingjing Chen, Chong-Wah Ngo, Ee-Peng Lim

Main category: cs.CL

TL;DR: MLLMs are vulnerable to gaslighting negation attacks where users can persuade models to reverse correct answers with fabricated justifications, despite initially providing correct responses.

Details

Motivation: To systematically study the vulnerability of Multimodal Large Language Models (MLLMs) to conversational adversarial inputs, specifically gaslighting negation attacks where models reverse their outputs when presented with user negations.

Method: Introduce GaslightingBench benchmark with multiple-choice questions from existing datasets and generated negation prompts across 20 categories. Conduct extensive evaluations of state-of-the-art MLLMs across diverse benchmarks.

Result: All evaluated MLLMs show substantial performance drops under negation attacks. Proprietary models (Gemini-1.5-flash, GPT-4o) demonstrate better resilience than open-source models (Qwen2-VL, LLaVA). Subjective/social domains are most vulnerable, while objective domains show smaller but notable drops.

Conclusion: MLLMs struggle to maintain logical consistency under gaslighting negation attacks, revealing a fundamental robustness gap that needs addressing for developing more reliable multimodal AI systems.

Abstract: Multimodal Large Language Models (MLLMs) have exhibited remarkable advancements in integrating different modalities, excelling in complex understanding and generation tasks. Despite their success, MLLMs remain vulnerable to conversational adversarial inputs. In this paper, we systematically study gaslighting negation attacks: a phenomenon where models, despite initially providing correct answers, are persuaded by user-provided negations to reverse their outputs, often fabricating justifications. We conduct extensive evaluations of state-of-the-art MLLMs across diverse benchmarks and observe substantial performance drops when negation is introduced. Notably, we introduce the first benchmark GaslightingBench, specifically designed to evaluate the vulnerability of MLLMs to negation arguments. GaslightingBench consists of multiple-choice questions curated from existing datasets, along with generated negation prompts across 20 diverse categories. Throughout extensive evaluation, we find that proprietary models such as Gemini-1.5-flash and GPT-4o demonstrate better resilience compared to open-source counterparts like Qwen2-VL and LLaVA, though even advanced reasoning-oriented models like Gemini-2.5-Pro remain susceptible. Our category-level analysis further shows that subjective or socially nuanced domains (e.g., Social Relation, Image Emotion) are especially fragile, while more objective domains (e.g., Geography) exhibit relatively smaller but still notable drops. Overall, all evaluated MLLMs struggle to maintain logical consistency under gaslighting negation attack. These findings highlight a fundamental robustness gap and provide insights for developing more reliable and trustworthy multimodal AI systems. Project website: https://yxg1005.github.io/GaslightingNegationAttacks/.

[118] Blessing of Multilinguality: A Systematic Analysis of Multilingual In-Context Learning

Yilei Tu, Andrew Xue, Freda Shi

Main category: cs.CL

TL;DR: Multilingual ICL using high-resource language demonstrations consistently outperforms English-only demonstrations, especially for low-resource language tasks, with multilingual exposure itself providing measurable gains.

Details

Motivation: Multilingual LLMs underperform on low-resource languages compared to high-resource languages, and there's limited understanding of when and why multilingual in-context learning works effectively for cross-lingual transfer.

Method: Systematic analysis of multilingual in-context learning using demonstrations in high-resource languages to enhance cross-lingual transfer, including ablation studies to understand the role of multilingual exposure.

Result: Mixed high-resource language demonstrations consistently outperform English-only ones across all scenarios, particularly for low-resource language tasks. Surprisingly, even irrelevant non-English sentences in prompts yield measurable performance gains.

Conclusion: Strategic use of multilingual resources can effectively bridge performance gaps for underrepresented languages, with multilingual exposure itself being a key factor in improving cross-lingual transfer.

Abstract: While multilingual large language models generally perform adequately, and sometimes even rival English performance on high-resource languages (HRLs), they often significantly underperform on low-resource languages (LRLs). Among several prompting strategies aiming at bridging the gap, multilingual in-context learning (ICL) has been particularly effective when demonstration in target languages is unavailable. However, there lacks a systematic understanding of when and why it works well. In this work, we systematically analyze multilingual ICL, using demonstrations in HRLs to enhance cross-lingual transfer. We show that demonstrations in mixed HRLs consistently outperform English-only ones across the board, particularly for tasks written in LRLs. Surprisingly, our ablation study shows that the presence of irrelevant non-English sentences in the prompt yields measurable gains, suggesting the effectiveness of multilingual exposure itself. Our results highlight the potential of strategically leveraging multilingual resources to bridge the performance gap for underrepresented languages.

[119] Diagnosing Moral Reasoning Acquisition in Language Models: Pragmatics and Generalization

Guangliang Liu, Zimo Qi, Xitong Zhang, Lei Jiang, Kristen Marie Johnson

Main category: cs.CL

TL;DR: Current learning paradigms face limitations in teaching LLMs moral reasoning due to the ‘pragmatic dilemma’ - the contextual, discourse-dependent nature of moral concepts that prevents effective generalization.

Details

Motivation: LLMs often fail at ethics-based judgments despite fine-tuning efforts, and there's debate about which learning paradigm best enhances ethical reasoning capabilities.

Method: Analyzed moral reasoning acquisition using distributional semantics theory and examining the pragmatic nature of moral discourse in LLMs.

Result: Found that moral reasoning improvements follow semantic-level task mechanisms but remain constrained by the pragmatic nature of morals in discourse (the ‘pragmatic dilemma’).

Conclusion: The pragmatic dilemma is the primary bottleneck for moral reasoning in LLMs, imposing significant limitations on generalization ability of current learning paradigms.

Abstract: Ensuring that Large Language Models (LLMs) return just responses which adhere to societal values is crucial for their broader application. Prior research has shown that LLMs often fail to perform satisfactorily on tasks requiring moral cognizance, such as ethics-based judgments. While current approaches have focused on fine-tuning LLMs with curated datasets to improve their capabilities on such tasks, choosing the optimal learning paradigm to enhance the ethical responses of LLMs remains an open research debate. In this work, we aim to address this fundamental question: can current learning paradigms enable LLMs to acquire sufficient moral reasoning capabilities? Drawing from distributional semantics theory and the pragmatic nature of moral discourse, our analysis indicates that performance improvements follow a mechanism similar to that of semantic-level tasks, and therefore remain affected by the pragmatic nature of morals latent in discourse, a phenomenon we name the pragmatic dilemma. We conclude that this pragmatic dilemma imposes significant limitations on the generalization ability of current learning paradigms, making it the primary bottleneck for moral reasoning acquisition in LLMs.

[120] Speculative Decoding and Beyond: An In-Depth Survey of Techniques

Yunhai Hu, Zining Liu, Zhenyuan Dong, Tianfan Peng, Bradley McDanel, Sai Qian Zhang

Main category: cs.CL

TL;DR: This survey provides a comprehensive taxonomy of generation-refinement frameworks that mitigate the sequential dependency bottleneck in large-scale autoregressive models, analyzing methods across various sequence tasks and deployment environments.

Details

Motivation: Sequential dependencies create fundamental bottlenecks for deploying large-scale autoregressive models in real-time applications, and traditional optimization approaches like pruning and quantization often compromise model quality.

Method: The survey categorizes generation-refinement frameworks based on generation strategies (from n-gram prediction to sophisticated draft models) and refinement mechanisms (single-pass verification and iterative approaches), analyzing both algorithmic innovations and system-level implementations.

Result: The paper demonstrates that generation-refinement frameworks can significantly mitigate the trade-off between model efficiency and quality that traditional optimization approaches face.

Conclusion: This systematic examination of theoretical frameworks and practical implementations provides a foundation for future research in efficient autoregressive decoding across text, images, and speech generation applications.

Abstract: Sequential dependencies present a fundamental bottleneck in deploying large-scale autoregressive models, particularly for real-time applications. While traditional optimization approaches like pruning and quantization often compromise model quality, recent advances in generation-refinement frameworks demonstrate that this trade-off can be significantly mitigated. This survey presents a comprehensive taxonomy of generation-refinement frameworks, analyzing methods across autoregressive sequence tasks. We categorize methods based on their generation strategies (from simple n-gram prediction to sophisticated draft models) and refinement mechanisms (including single-pass verification and iterative approaches). Through systematic analysis of both algorithmic innovations and system-level implementations, we examine deployment strategies across computing environments and explore applications spanning text, images, and speech generation. This systematic examination of both theoretical frameworks and practical implementations provides a foundation for future research in efficient autoregressive decoding.

[121] Mind the (Belief) Gap: Group Identity in the World of LLMs

Angana Borah, Marwa Houalla, Rada Mihalcea

Main category: cs.CL

TL;DR: LLMs exhibit amplified belief congruence compared to humans, which increases misinformation dissemination and impedes learning. Proposed mitigation strategies reduce misinformation by up to 37% and enhance learning by 11%.

Details

Motivation: Social biases and belief-driven behaviors impact LLM decisions, and their ability to model group psychological characteristics in multi-agent systems remains under-explored.

Method: Multi-agent framework simulating belief congruence theory, investigating implications on misinformation dissemination and LLM learning, and proposing mitigation strategies inspired by contact hypothesis, accuracy nudges, and global citizenship framework.

Result: LLMs show amplified belief congruence across diverse contexts compared to humans. This behavior increases misinformation dissemination and impedes learning. Mitigation strategies successfully reduce misinformation by up to 37% and enhance learning by 11%.

Conclusion: Bridging social psychology and AI provides insights to navigate real-world interactions using LLMs while addressing belief-driven biases.

Abstract: Social biases and belief-driven behaviors can significantly impact Large Language Models (LLMs) decisions on several tasks. As LLMs are increasingly used in multi-agent systems for societal simulations, their ability to model fundamental group psychological characteristics remains critical yet under-explored. In this study, we present a multi-agent framework that simulates belief congruence, a classical group psychology theory that plays a crucial role in shaping societal interactions and preferences. Our findings reveal that LLMs exhibit amplified belief congruence compared to humans, across diverse contexts. We further investigate the implications of this behavior on two downstream tasks: (1) misinformation dissemination and (2) LLM learning, finding that belief congruence in LLMs increases misinformation dissemination and impedes learning. To mitigate these negative impacts, we propose strategies inspired by: (1) contact hypothesis, (2) accuracy nudges, and (3) global citizenship framework. Our results show that the best strategies reduce misinformation dissemination by up to 37% and enhance learning by 11%. Bridging social psychology and AI, our work provides insights to navigate real-world interactions using LLMs while addressing belief-driven biases.

[122] Improving Neutral Point-of-View Generation with Data- and Parameter-Efficient RL

Jessica Hoffmann, Christiane Ahlheim, Zac Yu, Aria Walfrand, Jarvis Jin, Marie Tano, Ahmad Beirami, Erin van Liemt, Nithum Thain, Hakim Sidahmed, Lucas Dixon

Main category: cs.CL

TL;DR: Parameter-efficient reinforcement learning (PE-RL) significantly improves LLMs’ ability to provide informative, diverse, and impartial answers on sensitive topics with Neutral Point of View, outperforming strong baselines including LoRA finetuning, SFT, and RLHF.

Details

Motivation: To enhance large language models' capability to answer queries on sensitive topics with a Neutral Point of View (NPOV) - providing more informative, diverse, and impartial answers.

Method: Used parameter-efficient reinforcement learning (PE-RL) and compared it against multiple strong baselines including LoRA finetuning, SFT, and RLHF. Also released SHQ-NPOV dataset created through iterative human peer-critique and annotator training.

Result: PE-RL improved overall NPOV quality from 97.06% to 99.08% compared to strongest baseline. Key features improved significantly: presence of supportive details (60.25% → 85.21%) and absence of oversimplification (68.74% → 91.43%). PE-RL also demonstrated better out-of-topic generalization.

Conclusion: PE-RL is highly effective for improving LLMs’ NPOV capabilities on sensitive topics, outperforming other methods while maintaining parameter efficiency and better generalization properties.

Abstract: The paper shows that parameter-efficient reinforcement learning (PE-RL) is a highly effective training regime to improve large language models’ (LLMs) ability to answer queries on sensitive topics with a Neutral Point of View (NPOV), i.e. to provide significantly more informative, diverse and impartial answers. This is shown by evaluating PE-RL and multiple strong baselines-including LoRA finetuning (strongest baseline), SFT and RLHF. PE-RL not only improves on overall NPOV quality compared to the strongest baseline ($97.06%\rightarrow 99.08%$), but also scores much higher on features linguists identify as key to separating sufficient answers from “great’’ answers ($60.25%\rightarrow 85.21%$ for presence of supportive details, $68.74%\rightarrow 91.43%$ for absence of oversimplification). A qualitative analysis corroborates this. Moreover, our evaluation also finds a key property of PE-RL for this task: unlike methods that update all parameters, it generalises out of topic. Finally, to enable further studies we also release the dataset, SHQ-NPOV, and provide a methodology to create such datasets through iterative rounds of human peer-critique and annotator training.

[123] MCTS-RAG: Enhancing Retrieval-Augmented Generation with Monte Carlo Tree Search

Yunhai Hu, Yilun Zhao, Chen Zhao, Arman Cohan

Main category: cs.CL

TL;DR: MCTS-RAG enhances small language models’ reasoning by combining retrieval-augmented generation with Monte Carlo Tree Search, enabling performance comparable to GPT-4o on knowledge-intensive tasks.

Details

Motivation: Standard RAG methods retrieve information independently from reasoning, leading to suboptimal knowledge integration, while conventional MCTS reasoning relies only on internal knowledge without external facts. MCTS-RAG addresses these limitations by integrating structured reasoning with adaptive retrieval.

Method: MCTS-RAG dynamically integrates retrieval and reasoning through an iterative decision-making process using Monte Carlo Tree Search. It combines structured reasoning with adaptive retrieval to enhance decision-making.

Result: Experimental results on ComplexWebQA, GPQA, and FoolMeTwice datasets show that MCTS-RAG enables small-scale language models to achieve performance comparable to frontier LLMs like GPT-4o by effectively scaling inference-time compute.

Conclusion: MCTS-RAG sets a new standard for reasoning in small-scale models by reducing hallucinations, improving factual accuracy, and ensuring response consistency through the integration of retrieval and reasoning.

Abstract: We introduce MCTS-RAG, a novel approach that enhances the reasoning capabilities of small language models on knowledge-intensive tasks by leveraging retrieval-augmented generation (RAG) to provide relevant context and Monte Carlo Tree Search (MCTS) to refine reasoning paths. MCTS-RAG dynamically integrates retrieval and reasoning through an iterative decision-making process. Unlike standard RAG methods, which typically retrieve information independently from reasoning and thus integrate knowledge suboptimally, or conventional MCTS reasoning, which depends solely on internal model knowledge without external facts, MCTS-RAG combines structured reasoning with adaptive retrieval. This integrated approach enhances decision-making, reduces hallucinations, and ensures improved factual accuracy and response consistency. The experimental results on multiple reasoning and knowledge-intensive datasets datasets (i.e., ComplexWebQA, GPQA, and FoolMeTwice) show that our method enables small-scale LMs to achieve performance comparable to frontier LLMs like GPT-4o by effectively scaling inference-time compute, setting a new standard for reasoning in small-scale models.

[124] Rethinking Multilingual Continual Pretraining: Data Mixing for Adapting LLMs Across Languages and Resources

Zihao Li, Shaoxiong Ji, Hengyu Luo, Jörg Tiedemann

Main category: cs.CL

TL;DR: This study evaluates 36 continual pretraining configurations on multilingual LLMs, finding that bilingual CPT improves classification but causes language mixing in generation, code-augmented CPT boosts classification (especially for low-resource languages) with slight generation trade-offs, and traditional language classifications (altruistic/selfish/stagnant) show unexpected behaviors.

Details

Motivation: To address performance disparities in LLMs across languages and understand the effectiveness of different CPT strategies (monolingual, bilingual, code-augmented) for multilingual representation learning.

Method: Systematic evaluation of 36 CPT configurations using three multilingual base models across 30+ languages categorized as altruistic, selfish, and stagnant, spanning various resource levels.

Result: Bilingual CPT improves multilingual classification but causes language mixing in generation; code-augmented CPT enhances classification accuracy (especially for low-resource languages) with slight generation degradation; traditional language classifications show unexpected behaviors with altruistic languages negatively affecting related languages.

Conclusion: Multilingual representation learning is complex, requiring systematic studies on generalizable language classification to inform future CPT strategies, as traditional classifications don’t consistently predict cross-lingual transfer impacts.

Abstract: Large Language Models (LLMs) exhibit significant disparities in performance across languages, primarily benefiting high-resource languages while marginalizing underrepresented ones. Continual Pretraining (CPT) has emerged as a promising approach to address this imbalance, although the relative effectiveness of monolingual, bilingual, and code-augmented data strategies remains unclear. This study systematically evaluates 36 CPT configurations involving three multilingual base models, across 30+ languages categorized as altruistic, selfish, and stagnant, spanning various resource levels. Our findings reveal three major insights: (1) Bilingual CPT improves multilingual classification but often causes language mixing issues during generation. (2) Including programming code data during CPT consistently enhances multilingual classification accuracy, particularly benefiting low-resource languages, but introduces a trade-off by slightly degrading generation quality. (3) Contrary to prior work, we observe substantial deviations from language classifications according to their impact on cross-lingual transfer: Languages classified as altruistic often negatively affect related languages, selfish languages show conditional and configuration-dependent behavior, and stagnant languages demonstrate surprising adaptability under certain CPT conditions. These nuanced interactions emphasize the complexity of multilingual representation learning, underscoring the importance of systematic studies on generalizable language classification to inform future multilingual CPT strategies.

[125] GlotEval: A Test Suite for Massively Multilingual Evaluation of Large Language Models

Hengyu Luo, Zihao Li, Joseph Attieh, Sawal Devkota, Ona de Gibert, Xu Huang, Shaoxiong Ji, Peiqin Lin, Bhavani Sai Praneeth Varma Mantina, Ananda Sreenidhi, Raúl Vázquez, Mengjie Wang, Samea Yusofi, Fei Yuan, Jörg Tiedemann

Main category: cs.CL

TL;DR: GlotEval is a lightweight framework for massively multilingual evaluation of LLMs, addressing the gap in existing English-centric evaluation frameworks by supporting seven key tasks across dozens to hundreds of languages.

Details

Motivation: Existing evaluation frameworks disproportionately focus on English and high-resource languages, overlooking realistic performance of LLMs in multilingual and lower-resource scenarios, which is a major challenge for academia and industry.

Method: Introduces GlotEval framework with consistent multilingual benchmarking, language-specific prompt templates, and non-English-centric machine translation. Supports seven key tasks: machine translation, text classification, summarization, open-ended generation, reading comprehension, sequence labeling, and intrinsic evaluation.

Result: A multilingual translation case study demonstrates GlotEval’s applicability for both multilingual and language-specific evaluations, enabling precise diagnosis of model strengths and weaknesses in diverse linguistic contexts.

Conclusion: GlotEval provides a comprehensive solution for evaluating LLMs across diverse linguistic environments, particularly addressing the needs of low-resource languages and enabling more realistic assessment of model performance in multilingual scenarios.

Abstract: Large language models (LLMs) are advancing at an unprecedented pace globally, with regions increasingly adopting these models for applications in their primary language. Evaluation of these models in diverse linguistic environments, especially in low-resource languages, has become a major challenge for academia and industry. Existing evaluation frameworks are disproportionately focused on English and a handful of high-resource languages, thereby overlooking the realistic performance of LLMs in multilingual and lower-resource scenarios. To address this gap, we introduce GlotEval, a lightweight framework designed for massively multilingual evaluation. Supporting seven key tasks (machine translation, text classification, summarization, open-ended generation, reading comprehension, sequence labeling, and intrinsic evaluation), spanning over dozens to hundreds of languages, GlotEval highlights consistent multilingual benchmarking, language-specific prompt templates, and non-English-centric machine translation. This enables a precise diagnosis of model strengths and weaknesses in diverse linguistic contexts. A multilingual translation case study demonstrates GlotEval’s applicability for multilingual and language-specific evaluations.

[126] Geometry of Semantics in Next-Token Prediction: How Optimization Implicitly Organizes Linguistic Representations

Yize Zhao, Christos Thrampoulidis

Main category: cs.CL

TL;DR: Next-token prediction optimization implicitly guides language models to perform SVD on context-to-next-token co-occurrence patterns, revealing semantic hierarchies where broad categories emerge before fine-grained ones.

Details

Motivation: To understand how next-token prediction optimization leads language models to extract and organize semantic structure from text, bridging classical distributional semantics with neural collapse geometry.

Method: Used a tractable mathematical model and controlled synthetic data to analyze how NTP optimization implicitly factorizes context-to-next-token co-occurrence patterns via SVD, and developed orthant-based clustering to identify semantic categories.

Result: Learned word and context embeddings converge to SVD factors of the co-occurrence matrix, with singular vectors encoding latent semantic concepts. Concepts with larger singular values are learned earlier, creating semantic hierarchies. Validated on synthetic data and pretrained models, recovering grammatical categories, named entities, and topical distinctions.

Conclusion: Gradient-based optimization implicitly determines both the matrix representation and factorization method that encode semantic structure, providing a bridge between classical distributional semantics and neural collapse geometry.

Abstract: We investigate how next-token prediction (NTP) optimization leads language models to extract and organize semantic structure from text. Our analysis, based on a tractable mathematical model and controlled synthetic data, reveals that NTP implicitly guides models to factor a centered support matrix encoding context-to-next-token co-occurrence patterns via singular value decomposition (SVD). While models never explicitly construct this matrix, learned word and context embeddings converge to its SVD factors, with singular vectors encoding latent semantic concepts through their sign patterns. We demonstrate that concepts corresponding to larger singular values are learned earlier during training, yielding a natural semantic hierarchy where broad categories emerge before fine-grained ones. This insight motivates orthant-based clustering, a method that combines concept signs to identify interpretable semantic categories. We validate our findings on synthetic datasets and pretrained language models, recovering diverse semantic structures such as grammatical categories, named entity types, and topical distinctions (medical, entertainment). Our work bridges classical distributional semantics and neural collapse geometry, characterizing how gradient-based optimization implicitly determines both the matrix representation and factorization method that encode semantic structure.

Maitreya Prafulla Chitale, Ketaki Mangesh Shetye, Harshit Gupta, Manav Chaudhary, Manish Shrivastava, Vasudeva Varma

Main category: cs.CL

TL;DR: AutoRev is an automatic peer-review system using multi-modal RAG with graph-based document modeling to provide high-quality feedback, reducing LLM context length and improving review generation.

Details

Motivation: To enhance quality and efficiency of academic publishing by supporting both authors and reviewers with actionable feedback, addressing challenges in scholarly communication.

Method: Multi-Modal Retrieval-Augmented Generation (RAG) framework combining textual and graphical representations, modeling documents as graphs to retrieve pertinent information and reduce LLM input context.

Result: Outperforms state-of-the-art baselines by up to 58.72% and shows competitive performance in human evaluations against ground truth reviews.

Conclusion: AutoRev can streamline peer-review workflow, alleviate challenges, enable scalable high-quality publishing, and accelerate dissemination of quality research at larger scale.

Abstract: Enhancing the quality and efficiency of academic publishing is critical for both authors and reviewers, as research papers are central to scholarly communication and a major source of high-quality content on the web. To support this goal, we propose AutoRev, an automatic peer-review system designed to provide actionable, high-quality feedback to both reviewers and authors. AutoRev leverages a novel Multi-Modal Retrieval-Augmented Generation (RAG) framework that combines textual and graphical representations of academic papers. By modelling documents as graphs, AutoRev effectively retrieves the most pertinent information, significantly reducing the input context length for LLMs and thereby enhancing their review generation capabilities. Experimental results show that AutoRev outperforms state-of-the-art baselines by up to 58.72% and demonstrates competitive performance in human evaluations against ground truth reviews. We envision AutoRev as a powerful tool to streamline the peer-review workflow, alleviating challenges and enabling scalable, high-quality scholarly publishing. By guiding both authors and reviewers, AutoRev has the potential to accelerate the dissemination of quality research on the web at a larger scale. Code will be released upon acceptance.

[128] HopWeaver: Cross-Document Synthesis of High-Quality and Authentic Multi-Hop Questions

Zhiyu Shen, Jiyuan Liu, Yunhe Pang, Yanghui Rao

Main category: cs.CL

TL;DR: HopWeaver is a cross-document framework that automatically synthesizes authentic multi-hop questions without human intervention, achieving comparable quality to human-annotated datasets at lower cost.

Details

Motivation: Creating extensive and high-quality MHQA datasets is challenging due to expensive manual annotation and limitations of current synthesis methods that produce simplistic questions or require extensive manual guidance.

Method: HopWeaver synthesizes bridge and comparison questions through an innovative pipeline that identifies complementary documents and constructs authentic reasoning paths to ensure true multi-hop reasoning.

Result: Empirical evaluations demonstrate that the synthesized questions achieve comparable or superior quality to human-annotated datasets at a lower cost.

Conclusion: HopWeaver provides a valuable tool for automatically generating challenging benchmarks from any raw corpus, opening new avenues for evaluation and targeted training to improve reasoning capabilities of QA models, especially in resource-scarce domains.

Abstract: Multi-Hop Question Answering (MHQA) is crucial for evaluating the model’s capability to integrate information from diverse sources. However, creating extensive and high-quality MHQA datasets is challenging: (i) manual annotation is expensive, and (ii) current synthesis methods often produce simplistic questions or require extensive manual guidance. This paper introduces HopWeaver, the first cross-document framework synthesizing authentic multi-hop questions without human intervention. HopWeaver synthesizes bridge and comparison questions through an innovative pipeline that identifies complementary documents and constructs authentic reasoning paths to ensure true multi-hop reasoning. We further present a comprehensive system for evaluating the synthesized multi-hop questions. Empirical evaluations demonstrate that the synthesized questions achieve comparable or superior quality to human-annotated datasets at a lower cost. Our framework provides a valuable tool for the research community: it can automatically generate challenging benchmarks from any raw corpus, which opens new avenues for both evaluation and targeted training to improve the reasoning capabilities of advanced QA models, especially in domains with scarce resources.

[129] FlowKV: Enhancing Multi-Turn Conversational Coherence in LLMs via Isolated Key-Value Cache Management

Xiang Liu, Hong Chen, Xuming Hu, Xiaowen Chu

Main category: cs.CL

TL;DR: FlowKV introduces a multi-turn isolation mechanism for KV Cache management that prevents re-compression of older context, significantly improving performance in multi-turn conversations.

Details

Motivation: Current KV Cache management in LLMs suffers from linear growth costs and performance degradation due to repeated compression of early conversational context, leading to information loss and context forgetting.

Method: FlowKV uses a multi-turn isolation mechanism that preserves accumulated compressed KV cache from past turns and applies compression only to newly generated KV pairs of the latest turn, preventing re-compression of older context.

Result: FlowKV significantly outperforms baseline strategies, improving instruction-following accuracy and user preference retention from 10.90% to 75.40%, especially in later conversational turns.

Conclusion: FlowKV provides an effective training-free solution for KV Cache management that mitigates catastrophic forgetting and maintains conversational performance across multiple turns.

Abstract: Large Language Models (LLMs) are increasingly deployed in multi-turn conversational applications, where the management of the Key-Value (KV) Cache presents a significant bottleneck. The linear growth of the KV Cache with dialogue history imposes substantial computational costs, and existing eviction strategies often degrade performance by repeatedly compressing early conversational context, leading to information loss and context forgetting. This paper introduces FlowKV, a novel \textbf{multi-turn isolation mechanism} for KV Cache management, which can be applied to any KV Cache compression method without training. FlowKV’s core innovation is a multi-turn isolation mechanism that preserves the accumulated compressed KV cache from past turns. Compression is then strategically applied only to the newly generated KV pairs of the latest completed turn, effectively preventing the re-compression of older context and thereby mitigating catastrophic forgetting. Our results demonstrate that FlowKV consistently and significantly outperforms baseline strategies in maintaining instruction-following accuracy and user preference retention from 10.90% to 75.40%, particularly in later conversational turns.

[130] Do RAG Systems Really Suffer From Positional Bias?

Florin Cuconasu, Simone Filice, Guy Horowitz, Yoelle Maarek, Fabrizio Silvestri

Main category: cs.CL

TL;DR: Positional bias in LLMs has marginal impact in real RAG scenarios because retrieval systems bring both relevant and distracting passages to top positions, causing both to be penalized equally by the bias.

Details

Motivation: To investigate how positional bias affects LLMs' ability to use relevant passages and their susceptibility to distracting passages in Retrieval Augmented Generation systems.

Method: Extensive experiments on three benchmarks analyzing state-of-the-art retrieval pipelines and their tendency to retrieve distracting passages alongside relevant ones.

Result: Over 60% of queries contain at least one highly distracting passage among top-10 retrieved passages. Sophisticated rearrangement strategies based on LLM positional preferences perform no better than random shuffling.

Conclusion: The impact of LLM positional bias is marginal in real RAG scenarios because retrieval systems systematically bring distracting passages to top ranks, causing both relevant and distracting passages to be equally penalized by the bias.

Abstract: Retrieval Augmented Generation enhances LLM accuracy by adding passages retrieved from an external corpus to the LLM prompt. This paper investigates how positional bias - the tendency of LLMs to weight information differently based on its position in the prompt - affects not only the LLM’s capability to capitalize on relevant passages, but also its susceptibility to distracting passages. Through extensive experiments on three benchmarks, we show how state-of-the-art retrieval pipelines, while attempting to retrieve relevant passages, systematically bring highly distracting ones to the top ranks, with over 60% of queries containing at least one highly distracting passage among the top-10 retrieved passages. As a result, the impact of the LLM positional bias, which in controlled settings is often reported as very prominent by related works, is actually marginal in real scenarios since both relevant and distracting passages are, in turn, penalized. Indeed, our findings reveal that sophisticated strategies that attempt to rearrange the passages based on LLM positional preferences do not perform better than random shuffling.

[131] SimpleDeepSearcher: Deep Information Seeking via Web-Powered Reasoning Trajectory Synthesis

Shuang Sun, Huatong Song, Yuhao Wang, Ruiyang Ren, Jinhao Jiang, Junjie Zhang, Fei Bai, Jia Deng, Wayne Xin Zhao, Zheng Liu, Lei Fang, Zhongyuan Wang, Ji-Rong Wen

Main category: cs.CL

TL;DR: SimpleDeepSearcher is a lightweight RAG framework that uses strategic data engineering to create high-quality training data from live web search environments, achieving strong performance with only 871 curated samples.

Details

Motivation: Existing RAG systems face limitations including lack of high-quality training trajectories, distribution mismatches in simulated environments, and prohibitive computational costs for real-world deployment.

Method: Synthesizes high-quality training data by simulating realistic user interactions in live web search environments, using multi-criteria curation to optimize diversity and quality of inputs and outputs. Uses supervised fine-tuning (SFT) on curated samples.

Result: Experiments on five benchmarks across diverse domains show that SFT on only 871 curated samples yields significant improvements over RL-based baselines.

Conclusion: Establishes SFT as a viable pathway by systematically addressing the data-scarce bottleneck, offering practical insights for efficient deep search systems.

Abstract: Retrieval-augmented generation (RAG) systems have advanced large language models (LLMs) in complex deep search scenarios requiring multi-step reasoning and iterative information retrieval. However, existing approaches face critical limitations that lack high-quality training trajectories or suffer from the distributional mismatches in simulated environments and prohibitive computational costs for real-world deployment. This paper introduces SimpleDeepSearcher, a lightweight yet effective framework that bridges this gap through strategic data engineering rather than complex training paradigms. Our approach synthesizes high-quality training data by simulating realistic user interactions in live web search environments, coupled with a multi-criteria curation strategy that optimizes the diversity and quality of input and output side. Experiments on five benchmarks across diverse domains demonstrate that SFT on only 871 curated samples yields significant improvements over RL-based baselines. Our work establishes SFT as a viable pathway by systematically addressing the data-scarce bottleneck, offering practical insights for efficient deep search systems. Our code is available at https://github.com/RUCAIBox/SimpleDeepSearcher.

[132] The Unreasonable Effectiveness of Model Merging for Cross-Lingual Transfer in LLMs

Lucas Bandarkar, Nanyun Peng

Main category: cs.CL

TL;DR: Modular fine-tuning approaches improve cross-lingual math reasoning by separating math and language parameter updates, with Layer-Swapping emerging as the most effective method.

Details

Motivation: LLMs struggle with tasks in low-resource languages, especially when task-specific data is scarce. The paper aims to improve cross-lingual transfer for mathematical reasoning where in-language math data is unavailable.

Method: Developed modular frameworks using parameter freezing and model merging to separate math and language parameterization. Tested across 3 languages, 4 models, and 2 fine-tuning paradigms (full and LoRA). The most successful method was fine-tuning separate language and math experts with Layer-Swapping model merging.

Result: Modular approaches consistently outperformed baselines. Layer-Swapping was the most effective method, with empirical evidence showing that reverting less useful fine-tuning updates after training often works better than freezing them from the start.

Conclusion: The separability of math and language parameters enables successful modular fine-tuning. Layer-Swapping proves surprisingly effective, supported by linearity of task vectors theory, suggesting that post-training parameter adjustment can be more beneficial than preemptive freezing.

Abstract: Large language models (LLMs) still struggle across tasks outside of high-resource languages. In this work, we investigate cross-lingual transfer to lower-resource languages where task-specific post-training data is scarce. Building on prior work, we first validate that the subsets of model parameters that matter most for mathematical reasoning and multilingual capabilities are distinctly non-overlapping. To exploit this implicit separability between task and target language parameterization, we develop and analyze numerous modular frameworks to improve the composition of the two during fine-tuning. These methods generally employ freezing parameters or post hoc model merging to assign math and language improvement to different key parts of the LLM. In the absence of in-language math data, we demonstrate that the modular approaches successfully improve upon baselines across three languages, four models, and two fine-tuning paradigms (full and LoRA). Furthermore, we identify the most consistently successful modular method to be fine-tuning separate language and math experts and model merging via Layer-Swapping, somewhat surprisingly. We offer possible explanations for this result via recent works on the linearity of task vectors. We further explain this by empirically showing that reverting less useful fine-tuning updates after training often outperforms freezing them from the start.

[133] Guiding Giants: Lightweight Controllers for Weighted Activation Steering in LLMs

Amr Hegazy, Mostafa Elhoushi, Amr Alanwar

Main category: cs.CL

TL;DR: A lightweight trainable controller network is introduced for inference-time control of LLMs, using adaptive layer-specific weights and global scaling to apply steering patches derived from refusal direction vectors, improving safety without fine-tuning.

Details

Motivation: To address the limitations of costly fine-tuning and lack of fine-grained, adaptive mechanisms in existing activation steering methods for controlling undesirable LLM behaviors like unsafe content generation.

Method: A trainable controller network observes intermediate LLM activations and predicts both a global scaling factor and layer-specific weights to dynamically modulate the intensity of a steering patch derived from pre-computed refusal direction vectors across LLM layers during generation.

Result: Experiments on safety benchmarks (ToxicChat & In-The-Wild Jailbreak Prompts) show significantly increased refusal rates compared to base LLMs, with performance improvements demonstrated on Llama-3.1-8B, Llama-3.2-1B & Mistral-7B over existing methods.

Conclusion: The approach provides an efficient and adaptive method for fine-grained control over LLM behavior at inference time without altering original model parameters, enabling targeted behavioral modification.

Abstract: Controlling undesirable Large Language Model (LLM) behaviors, such as the generation of unsafe content or failing to adhere to safety guidelines, often relies on costly fine-tuning. Activation steering provides an alternative for inference-time control, but existing methods typically lack fine-grained, adaptive mechanisms. We introduce a novel approach using a lightweight, trainable controller network integrated during inference. This controller network observes specific intermediate LLM activations and predicts both a global scaling factor and layer-specific weights. The predicted global scaling factor and layer-specific weights then dynamically modulate the intensity of a steering patch, derived from a pre-computed “refusal direction” vector, applied across the LLM’s layers during generation. Trained on activations from both harmful and benign prompts, our controller learns to discriminatively apply nuanced, layer-aware interventions, activating steering primarily for harmful inputs. Experiments using safety benchmarks like ToxicChat & In-The-Wild Jailbreak Prompts demonstrate that our weighted steering controller significantly increases refusal rates compared to the base LLM, achieving targeted behavioral modification without altering the original model parameters. Our experiments with Llama-3.1-8B, Llama-3.2-1B & Mistral-7B show our approach outperforms existing methods, presenting an efficient and adaptive method for fine-grained control over LLM behavior at inference time.

[134] 360-LLaMA-Factory: Plug & Play Sequence Parallelism for Long Post-Training

Haosheng Zou, Xiaowei Lv, Shousheng Jia, Lin Li, Xiaochun Gong, Xiangzheng Zhang

Main category: cs.CL

TL;DR: 360-LLaMA-Factory is an open-source framework that adds sequence parallelism to LLaMA-Factory, widely used in various models and company training frameworks.

Details

Motivation: To enhance the LLaMA-Factory framework by incorporating sequence parallelism for improved efficiency in large language model training.

Method: Implemented different sequence parallel modes in the 360-LLaMA-Factory framework and shared implementation insights.

Result: The framework has been widely recognized and adopted in models like Light-R1, TinyR1, Kaggle AIMO math models, and large companies’ training frameworks.

Conclusion: 360-LLaMA-Factory successfully integrates sequence parallelism into LLaMA-Factory, providing valuable implementation insights for the community.

Abstract: Adding sequence parallelism into LLaMA-Factory, we open-sourced 360-LLaMA-Factory at https://github.com/Qihoo360/360-LLaMA-Factory. 360-LLaMA-Factory has received wide recognition and used in models such as Light-R1 arXiv:2503.10460, TinyR1 arXiv:2503.04872, Kaggle AIMO math models and also in large companies’ training frameworks. This technical report delves deeper into the different sequence parallel modes behind 360-LLaMA-Factory and discusses our implementation insights.

[135] LiTEx: A Linguistic Taxonomy of Explanations for Understanding Within-Label Variation in Natural Language Inference

Pingjun Hong, Beiduo Chen, Siyao Peng, Marie-Catherine de Marneffe, Barbara Plank

Main category: cs.CL

TL;DR: LITEX is a linguistically-informed taxonomy for categorizing free-text explanations in NLI that captures within-label variation and improves explanation generation.

Details

Motivation: To address the overlooked challenge of within-label variation in NLI, where annotators agree on labels but provide divergent reasoning, and to systematically understand the rationales behind NLI labels.

Method: Developed LITEX taxonomy, annotated e-SNLI dataset subset, validated taxonomy reliability, analyzed alignment with labels/highlights/explanations, and assessed taxonomy’s usefulness in explanation generation.

Result: LITEX taxonomy reliably captures within-label variation and conditioning generation on LITEX yields explanations linguistically closer to human explanations than using only labels or highlights.

Conclusion: LITEX taxonomy effectively captures within-label variation and bridges the gap between human and model explanations more effectively than existing strategies through taxonomy-guided generation.

Abstract: There is increasing evidence of Human Label Variation (HLV) in Natural Language Inference (NLI), where annotators assign different labels to the same premise-hypothesis pair. However, within-label variation–cases where annotators agree on the same label but provide divergent reasoning–poses an additional and mostly overlooked challenge. Several NLI datasets contain highlighted words in the NLI item as explanations, but the same spans on the NLI item can be highlighted for different reasons, as evidenced by free-text explanations, which offer a window into annotators’ reasoning. To systematically understand this problem and gain insight into the rationales behind NLI labels, we introduce LITEX, a linguistically-informed taxonomy for categorizing free-text explanations in English. Using this taxonomy, we annotate a subset of the e-SNLI dataset, validate the taxonomy’s reliability, and analyze how it aligns with NLI labels, highlights, and explanations. We further assess the taxonomy’s usefulness in explanation generation, demonstrating that conditioning generation on LITEX yields explanations that are linguistically closer to human explanations than those generated using only labels or highlights. Our approach thus not only captures within-label variation but also shows how taxonomy-guided generation for reasoning can bridge the gap between human and model explanations more effectively than existing strategies.

[136] InfiMed: Low-Resource Medical MLLMs with Advancing Understanding and Reasoning

Zeyu Liu, Zhitian Hou, Guanghao Zhu, Zhijie Sang, Congkai Xie, Hongxia Yang

Main category: cs.CL

TL;DR: This paper introduces InfiMed-Series models that address limitations of MLLMs in medical domains through enhanced SFT with diverse data and reflective CoT patterns, achieving state-of-the-art performance on medical benchmarks.

Details

Motivation: MLLMs face two key challenges in medical applications: scarcity of multimodal medical datasets with sparse information, and ineffectiveness of RLVR in reliably improving medical model performance.

Method: Enhanced SFT with high-quality textual reasoning data and general multimodal data alongside medical data; synthesized reflective-pattern-injected CoT for initial reflective reasoning; developed InfiMed-SFT-3B and InfiMed-RL-3B models.

Result: InfiMed-RL-3B achieved 59.2% average accuracy across seven multimodal medical benchmarks, outperforming larger models like InternVL3-8B (57.3%). Used 188K samples in SFT and 36K in RLVR phases.

Conclusion: The proposed training strategies effectively advance MLLM performance in medical scenarios, with both SFT and RLVR phases contributing to superior results on medical benchmarks.

Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable progress in domains such as visual understanding and mathematical reasoning. However, their application in the medical domain is constrained by two key challenges: (1) multimodal medical datasets are scarce and often contain sparse information, limiting reasoning depth; and (2) Reinforcement Learning with Verifiable Rewards (RLVR), though effective in general domains, cannot reliably improve model performance in the medical domain. To overcome these challenges, during the supervised fine-tuning (SFT) stage, we incorporate high-quality textual reasoning data and general multimodal data alongside multimodal medical data to efficiently enhance foundational medical capabilities and restore the base model’s reasoning ability. Moreover, considering that there are some multimodal medical datasets with sparse information, we further synthesize reflective-pattern-injected chain-of-thought (CoT) in addition to general CoT samples, equipping the model with initial reflective reasoning capabilities that provide a structured foundation for subsequent RLVR training. Finally, we introduce our InfiMed-Series models, InfiMed-SFT-3B and InfiMed-RL-3B, both of which deliver state-of-the-art performance across seven multimodal medical benchmarks. Notably, InfiMed-RL-3B achieves an average accuracy of 59.2%, outperforming even larger models like InternVL3-8B, which achieves 57.3%. Specifically, during the SFT phase, we utilized 188K samples, while the RLVR phase incorporated 36K samples, demonstrating the efficacy of both training strategies in achieving superior performance. We also conducted a series of extensive experiments, which provide valuable insights that contribute to advancing the performance of MLLMs in medical scenarios.

[137] MIST: Towards Multi-dimensional Implicit BiaS Evaluation of LLMs via Theory of Mind

Yanlin Li, Hao Liu, Huimin Liu, Kun Wang, Yinwei Wei, Yupeng Hu

Main category: cs.CL

TL;DR: Proposes an evaluation framework using the Stereotype Content Model to detect multi-dimensional implicit bias in LLMs through indirect tasks that avoid triggering model avoidance.

Details

Motivation: Conventional direct-query methods for evaluating Theory of Mind bias in LLMs are susceptible to social desirability effects and fail to capture subtle, multi-dimensional implicit bias.

Method: Uses Stereotype Content Model (SCM) to conceptualize bias across Competence, Sociability, and Morality dimensions. Introduces two indirect tasks: Word Association Bias Test (WABT) for implicit lexical associations and Affective Attribution Test (AAT) for covert affective leanings.

Result: Experiments on 8 state-of-the-art LLMs reveal complex bias structures including pervasive sociability bias, multi-dimensional divergence, and asymmetric stereotype amplification.

Conclusion: The framework provides a more robust methodology for identifying the structural nature of implicit bias in LLMs by avoiding direct queries that trigger model avoidance.

Abstract: Theory of Mind (ToM) in Large Language Models (LLMs) refers to their capacity for reasoning about mental states, yet failures in this capacity often manifest as systematic implicit bias. Evaluating this bias is challenging, as conventional direct-query methods are susceptible to social desirability effects and fail to capture its subtle, multi-dimensional nature. To this end, we propose an evaluation framework that leverages the Stereotype Content Model (SCM) to reconceptualize bias as a multi-dimensional failure in ToM across Competence, Sociability, and Morality. The framework introduces two indirect tasks: the Word Association Bias Test (WABT) to assess implicit lexical associations and the Affective Attribution Test (AAT) to measure covert affective leanings, both designed to probe latent stereotypes without triggering model avoidance. Extensive experiments on 8 State-of-the-Art LLMs demonstrate our framework’s capacity to reveal complex bias structures, including pervasive sociability bias, multi-dimensional divergence, and asymmetric stereotype amplification, thereby providing a more robust methodology for identifying the structural nature of implicit bias.

[138] Do LLMs Overthink Basic Math Reasoning? Benchmarking the Accuracy-Efficiency Tradeoff in Language Models

Gaurav Srivastava, Aafiya Hussain, Sriram Srinivasan, Xuan Wang

Main category: cs.CL

TL;DR: LLMs often fail on basic math reasoning while generating verbose responses. This paper introduces the Overthinking Score to evaluate the accuracy-verbosity tradeoff and finds that longer reasoning doesn’t necessarily improve mathematical reasoning.

Details

Motivation: LLMs achieve impressive performance on complex mathematical benchmarks but sometimes fail on basic math reasoning while generating unnecessarily verbose responses, highlighting the need to evaluate reasoning efficiency.

Method: Introduced Overthinking Score (harmonic mean of accuracy and token-efficiency), established evaluation protocol with dynamically-generated data across 14 basic math tasks, and conducted large-scale study of 53 LLMs including reasoning and quantized variants.

Result: 1) Complex benchmark performance doesn’t translate to basic math reasoning; 2) Reasoning models generate ~18 more tokens with sometimes lower accuracy and catastrophic collapse under token constraints; 3) Accuracy-verbosity relationship is non-monotonic with diminishing returns from extended reasoning.

Conclusion: Longer reasoning in LLMs doesn’t necessarily improve mathematical reasoning, challenging common assumptions about reasoning effort and performance.

Abstract: Large language models (LLMs) achieve impressive performance on complex mathematical benchmarks yet sometimes fail on basic math reasoning while generating unnecessarily verbose responses. In this paper, we present a systematic benchmark and comprehensive empirical study to evaluate the efficiency of reasoning in LLMs, focusing on the fundamental tradeoff between accuracy and overthinking. First, we formalize the accuracy-verbosity tradeoff. Second, we introduce the Overthinking Score, a harmonic-mean metric combining accuracy and token-efficiency for holistic model evaluation. Third, we establish an evaluation protocol with dynamically-generated data across 14 basic math tasks. Fourth, we conduct a large-scale empirical study evaluating 53 LLMs, including reasoning and quantized variants across different reasoning budgets. Our findings reveal: 1) model performance on complex benchmarks does not translate directly to basic math reasoning; 2) reasoning models generate ~18 more tokens while sometimes achieving lower accuracy and exhibit catastrophic collapse when token is constrained, dropping by ~28; 3) the accuracy-verbosity relationship is non-monotonic with extended reasoning budgets yielding diminishing returns (GPT-5/o-series models show zero accuracy gain from low -> medium -> high reasoning effort). Our findings challenge the assumption that longer reasoning in LLMs necessarily improves mathematical reasoning.

[139] GIIFT: Graph-guided Inductive Image-free Multimodal Machine Translation

Jiafeng Xiong, Yuting Zhao

Main category: cs.CL

TL;DR: GIIFT is a two-stage graph-guided inductive image-free multimodal machine translation framework that uses cross-modal graph attention networks to learn multimodal knowledge and generalize it to image-free translation domains, achieving state-of-the-art performance without images during inference.

Details

Motivation: Existing MMT methods struggle with leveraging modality gaps through rigid visual-linguistic alignment and are confined to inference within their trained multimodal domains, limiting their practical application.

Method: Construct multimodal scene graphs to preserve modality-specific information and use a two-stage framework with cross-modal Graph Attention Network adapter to learn multimodal knowledge in a unified fused space, enabling inductive generalization to image-free translation domains.

Result: Achieved state-of-the-art performance on Multi30K dataset for English-to-French and English-to-German tasks without images during inference, and showed significant improvements over image-free translation baselines on WMT benchmark.

Conclusion: GIIFT demonstrates strong inductive image-free inference capabilities, effectively transferring multimodal knowledge to broader translation domains without requiring visual inputs during inference.

Abstract: Multimodal Machine Translation (MMT) has demonstrated the significant help of visual information in machine translation. However, existing MMT methods face challenges in leveraging the modality gap by enforcing rigid visual-linguistic alignment whilst being confined to inference within their trained multimodal domains. In this work, we construct novel multimodal scene graphs to preserve and integrate modality-specific information and introduce GIIFT, a two-stage Graph-guided Inductive Image-Free MMT framework that uses a cross-modal Graph Attention Network adapter to learn multimodal knowledge in a unified fused space and inductively generalize it to broader image-free translation domains. Experimental results on the Multi30K dataset of English-to-French and English-to-German tasks demonstrate that our GIIFT surpasses existing approaches and achieves the state-of-the-art, even without images during inference. Results on the WMT benchmark show significant improvements over the image-free translation baselines, demonstrating the strength of GIIFT towards inductive image-free inference.

[140] Show or Tell? Modeling the evolution of request-making in Human-LLM conversations

Shengqi Zhu, Jeffrey M. Rzeszotarski, David Mimno

Main category: cs.CL

TL;DR: A framework for analyzing LLM user queries by segmenting them into content, roles, context, and task-independent expressions, revealing evolving interaction patterns and user behavior trends.

Details

Motivation: Understanding real user behavior with LLMs is challenging due to query variability, and existing methods mask important patterns of how people actually interact with these systems.

Method: Developed a framework to segment user queries into four components (content, roles, context, task-independent expressions) and applied it to analyze 211k real-world queries from WildChat, including diachronic analysis of user expression evolution.

Result: Found significant differences from human-human communication patterns, discovered that query patterns evolve from simple requests to context-rich interactions, and identified that users explore but eventually converge on expression patterns with experience.

Conclusion: The framework provides essential insights for user studies, computational pragmatics, and LLM alignment, revealing fundamental habitual interaction patterns beyond individual task completion.

Abstract: Designing user-centered LLM systems requires understanding how people use them, but patterns of user behavior are often masked by the variability of queries. In this work, we introduce a new framework to describe request-making that segments user input into request content, roles assigned, query-specific context, and the remaining task-independent expressions. We apply the workflow to create and analyze a dataset of 211k real-world queries based on WildChat. Compared with similar human-human setups, we find significant differences in the language for request-making in the human-LLM scenario. Further, we introduce a novel and essential perspective of diachronic analyses with user expressions, which reveals fundamental and habitual user-LLM interaction patterns beyond individual task completion. We find that query patterns evolve from early ones emphasizing sole requests to combining more context later on, and individual users explore expression patterns but tend to converge with more experience. From there, we propose to understand communal trends of expressions underlying distinct tasks and discuss the preliminary findings. Finally, we discuss the key implications for user studies, computational pragmatics, and LLM alignment.

[141] ProCut: LLM Prompt Compression via Attribution Estimation

Zhentao Xu, Fengyi Li, Albert Chen, Xiaofeng Wang

Main category: cs.CL

TL;DR: ProCut is a training-free framework that compresses bloated prompts by analyzing attribution of semantic units, achieving 78% token reduction while maintaining or improving task performance.

Details

Motivation: Large-scale LLM systems suffer from bloated prompts with thousands of tokens due to iterative additions of instructions, examples, and rules, leading to maintenance difficulties, inference latency, and high serving costs.

Method: ProCut segments prompts into semantic units, quantifies their impact on task performance through attribution analysis, and prunes low-utility components. It uses an LLM-driven attribution estimator to reduce compression latency.

Result: Achieves 78% token reduction in production while maintaining or slightly improving task performance (up to 62% better than alternatives), with over 50% reduction in compression latency.

Conclusion: ProCut provides an effective, LLM-agnostic solution for prompt compression that integrates with existing optimization frameworks to create concise, high-performing prompts.

Abstract: In large-scale industrial LLM systems, prompt templates often expand to thousands of tokens as teams iteratively incorporate sections such as task instructions, few-shot examples, and heuristic rules to enhance robustness and coverage. This expansion leads to bloated prompts that are difficult to maintain and incur significant inference latency and serving costs. To address this, we introduce Prompt Compression via Attribution Estimation (ProCut), a flexible, LLM-agnostic, training-free framework that compresses prompts through attribution analysis. ProCut segments prompt templates into semantically meaningful units, quantifies their impact on task performance, and prunes low-utility components. Through extensive experiments on five public benchmark datasets and real-world industrial prompts, we show that ProCut achieves substantial prompt size reductions (78% fewer tokens in production) while maintaining or even slightly improving task performance (up to 62% better than alternative methods). We further introduce an LLM-driven attribution estimator that reduces compression latency by over 50%, and demonstrate that ProCut integrates seamlessly with existing prompt-optimization frameworks to produce concise, high-performing prompts.

[142] Thinking with Nothinking Calibration: A New In-Context Learning Paradigm in Reasoning Large Language Models

Haotian Wu, Bo Xu, Yao Shu, Menglin Yang, Chengwei Qin

Main category: cs.CL

TL;DR: JointThinking is a new in-context learning paradigm that generates two parallel answers (Thinking and Nothinking modes) and triggers a second thinking round when responses are inconsistent, achieving superior performance on reasoning benchmarks.

Details

Motivation: Reasoning LLMs' potential for in-context learning remains largely underexplored, while prior research focused mainly on training and inference strategies.

Method: Proposes Thinking with Nothinking Calibration (JointThinking) that prompts models to generate two answers in parallel (Thinking and Nothinking modes), with a second thinking round triggered when initial responses are inconsistent.

Result: Significantly outperforms few-shot chain-of-thought, thinking twice and majority voting; achieves comparable in-distribution performance to training-based SOTA while substantially outperforming on out-of-distribution tasks.

Conclusion: The approach shows strong scalability with model size, and systematic analysis reveals importance of structural thinking diversity and consistency check benefits.

Abstract: Reasoning large language models (RLLMs) have recently demonstrated remarkable capabilities through structured and multi-step reasoning. While prior research has primarily focused on improving their training and inference strategies, their potential for in-context learning (ICL) remains largely underexplored. To fill this gap, we propose Thinking with Nothinking Calibration (JointThinking), a new ICL paradigm that prompts the model to generate two answers in parallel: one in Thinking mode and the other in Nothinking mode. A second round of Thinking is triggered only when the two initial responses are inconsistent, using a single prompt with two different answers. Extensive experiments across multiple reasoning benchmarks demonstrate that JointThinking significantly outperforms few-shot chain-of-thought (CoT), thinking twice and majority voting. Moreover, it achieves comparable in-distribution performance to training-based SOTA reasoning method, while substantially outperforming on out-of-distribution tasks. We further conduct a systematic analysis of the calibration mechanism, showing the importance of structural thinking diversity and the benefits of consistency check. Additionally, we observe that the performance gap between actual and ideal reasoning narrows as model size increases in the second thinking, indicating the strong scalability of our approach. Finally, we discuss current limitations and outline promising directions for future ICL research in RLLMs.

Haofei Yu, Zhengyang Qi, Yining Zhao, Kolby Nottingham, Keyang Xuan, Bodhisattwa Prasad Majumder, Hao Zhu, Paul Pu Liang, Jiaxuan You

Main category: cs.CL

TL;DR: Sotopia-RL is a framework that refines coarse episode-level feedback into utterance-level, multi-dimensional rewards to improve RL training for social intelligence tasks in LLMs, achieving state-of-the-art performance.

Details

Motivation: Social intelligence is crucial for LLMs in real-world tasks like collaboration and negotiation. RL is suitable for training socially intelligent agents but faces challenges: utterance quality doesn't strictly relate to final success, and social interactions require multi-dimensional success rubrics.

Method: Proposed Sotopia-RL framework that refines coarse episode-level feedback into utterance-level, multi-dimensional rewards. Utterance-level credit assignment attributes outcomes to individual utterances, while multi-dimensional rewards capture social interaction richness and reduce reward hacking.

Result: Achieved state-of-the-art social goal completion scores: 7.17 on Sotopia-hard and 8.31 on Sotopia-full, significantly outperforming existing approaches. Ablation studies confirmed the necessity of both utterance-level credit assignment and multi-dimensional reward design.

Conclusion: The proposed utterance-level multi-dimensional reward framework effectively facilitates RL training for social intelligence tasks, addressing the unique challenges of social interactions and improving model performance in open-ended social environments.

Abstract: Social intelligence has become a critical capability for large language models (LLMs), enabling them to engage effectively in real-world social tasks such as collaboration and negotiation. Reinforcement learning (RL) is a natural fit for training socially intelligent agents because it allows models to learn sophisticated strategies directly through social interactions without requiring human annotations. However, there are two unique parts about social intelligence tasks: (1) the quality of individual utterances in social interactions is not strictly related to final success; (2) social interactions require multi-dimensional rubrics for success. Therefore, we argue that it is necessary to design rewards for building utterance-level multi-dimensional reward models to facilitate RL training for social intelligence tasks. To address these challenges, we propose Sotopia-RL, a novel framework that refines coarse episode-level feedback into utterance-level, multi-dimensional rewards. Utterance-level credit assignment attributes outcomes to individual utterances, while multi-dimensional rewards capture the full richness of social interactions and reduce reward hacking. Experiments in Sotopia, an open-ended social learning environment, demonstrate that Sotopia-RL achieves state-of-the-art social goal completion scores (7.17 on Sotopia-hard and 8.31 on Sotopia-full), significantly outperforming existing approaches. Ablation studies confirm the necessity of both utterance-level credit assignment and multi-dimensional reward design for RL training.

[144] What Do Humans Hear When Interacting? Experiments on Selective Listening for Evaluating ASR of Spoken Dialogue Systems

Kiyotada Mori, Seiya Kawano, Chaoran Liu, Carlos Toshinori Ishi, Angel Fernando Garcia Contreras, Koichiro Yoshino

Main category: cs.CL

TL;DR: This paper explores how human selective listening can inform ASR evaluation for spoken dialogue systems, showing that humans focus on important conversation parts when generating responses.

Details

Motivation: To identify ASR capabilities needed for SDSs by examining human selective listening - the ability to focus on important conversation parts during speech.

Method: Experimental comparison of human transcriptions for dialogue response generation versus reference transcriptions to confirm selective listening behavior.

Result: Confirmed that humans exhibit selective listening when generating dialogue responses, focusing on important information while potentially ignoring less relevant details.

Conclusion: Human selective listening can enable new ASR evaluation methods that identify gaps between ASR systems and human transcription capabilities for dialogue systems.

Abstract: Spoken dialogue systems (SDSs) utilize automatic speech recognition (ASR) at the front end of their pipeline. The role of ASR in SDSs is to recognize information in user speech related to response generation appropriately. Examining selective listening of humans, which refers to the ability to focus on and listen to important parts of a conversation during the speech, will enable us to identify the ASR capabilities required for SDSs and evaluate them. In this study, we experimentally confirmed selective listening when humans generate dialogue responses by comparing human transcriptions for generating dialogue responses and reference transcriptions. Based on our experimental results, we discuss the possibility of a new ASR evaluation method that leverages human selective listening, which can identify the gap between transcription ability between ASR systems and humans.

[145] An Investigation of Robustness of LLMs in Mathematical Reasoning: Benchmarking with Mathematically-Equivalent Transformation of Advanced Mathematical Problems

Yuren Hao, Xiang Wan, ChengXiang Zhai

Main category: cs.CL

TL;DR: The paper introduces PutnamGAP, a benchmark for stress-testing LLMs’ mathematical reasoning robustness using mathematically equivalent problems with linguistic and parametric variations.

Details

Motivation: To assess LLMs' mathematical-reasoning robustness beyond conventional methods by testing sensitivity to non-mathematical perturbations in mathematically equivalent problems.

Method: Created PutnamGAP benchmark with multiple mathematically-equivalent variations of competition-level math problems, then evaluated 18 commercial and open-source LLMs on these variants.

Result: Sharp performance degradation observed across all models. OpenAI’s O3 dropped 4.7 percentage points on surface-renaming variants and 12.9 points on parametric variants, with smaller models performing even worse.

Conclusion: The proposed evaluation methodology effectively deepens understanding of LLM robustness and provides insights for improving mathematical reasoning capabilities.

Abstract: In this paper, we introduce a systematic framework beyond conventional method to assess LLMs’ mathematical-reasoning robustness by stress-testing them on advanced math problems that are mathematically equivalent but with linguistic and parametric variation. These transformations allow us to measure the sensitivity of LLMs to non-mathematical perturbations, thereby enabling a more accurate evaluation of their mathematical reasoning capabilities. Using this new evaluation methodology, we created PutnamGAP, a new benchmark dataset with multiple mathematically-equivalent variations of competition-level math problems. With the new dataset, we evaluate multiple families of representative LLMs and examine their robustness. Across 18 commercial and open-source models we observe sharp performance degradation on the variants. OpenAI’s flagship reasoning model, O3, scores 51.5% on the originals but drops by 4.7 percentage points on surface-renaming variants, and by 12.9 percentage points on parametric variants, while smaller models fare far worse. Overall, the results show that the proposed new evaluation methodology is effective for deepening our understanding of the robustness of LLMs and generating new insights for further improving their mathematical reasoning capabilities.

[146] DESIGNER: Design-Logic-Guided Multidisciplinary Data Synthesis for LLM Reasoning

Weize Liu, Yongchi Zhao, Yijia Luo, Mingyu Xu, Jiaheng Liu, Yanan Li, Xiguo Hu, Zhiqi Bai, Yuchi Xu, Wenbo Su, Bo Zheng

Main category: cs.CL

TL;DR: DESIGNER is a pipeline that uses “design logic” to automatically generate challenging multidisciplinary reasoning questions from raw documents, creating large-scale datasets that significantly improve LLMs’ reasoning capabilities.

Details

Motivation: LLMs struggle with complex multi-step reasoning across diverse disciplines, and existing datasets lack disciplinary breadth, reasoning depth, and guiding principles for question synthesis.

Method: Reverse-engineered over 120,000 design logics from existing questions, then matched these with source documents (book and web corpora) to automatically synthesize challenging reasoning questions using LLMs that mimic human educators’ question-creation process.

Result: Created two large-scale datasets: DLR-Book (3.04M questions) and DLR-Web (1.66M questions) spanning 75 disciplines. Questions show greater difficulty and diversity than baseline datasets. Supervised fine-tuning on these datasets significantly enhanced Qwen3 and Llama3 models’ multidisciplinary reasoning, with base versions even surpassing official instruction-tuned counterparts.

Conclusion: The DESIGNER pipeline successfully generates high-quality, challenging reasoning data that substantially improves LLMs’ complex reasoning capabilities across diverse disciplines, demonstrating the effectiveness of design logic-guided data synthesis.

Abstract: Large language models (LLMs) have achieved remarkable success in many natural language tasks but still struggle with complex, multi-step reasoning, particularly across diverse disciplines. Existing reasoning datasets often lack disciplinary breadth, reasoning depth, and diversity, and lack guiding principles for question synthesis. We propose DESIGNER: a DESIGN-logic-guidEd Reasoning data synthesis pipeline that leverages naturally available, extensive raw documents (e.g., book corpus and web corpus) to generate multidisciplinary challenging questions. We introduce the concept of “design logic” and instruct LLMs to mimic human educators’ question-creation process, enabling automated synthesis of large-scale, high-difficulty questions. We use LLMs to reverse-engineer and abstract over 120,000 design logics from existing questions across various disciplines. By matching these design logics with source documents, we are able to create reasoning questions that far surpass the difficulty and diversity of existing datasets. Using this pipeline, we synthesized two large-scale reasoning datasets that span 75 disciplines: DLR-Book (3.04 million questions from the book corpus) and DLR-Web (1.66 million questions from the web corpus). Data analysis indicates that the questions synthesized by our method exhibit greater difficulty and diversity compared to those in the baseline datasets. We validate our synthesized data through supervised fine-tuning (SFT) on the Qwen3 and Llama3 model families. Our data substantially enhances their multidisciplinary reasoning capabilities, outperforming existing datasets. Notably, after SFT on our datasets, the base versions of these models even surpass their official instruction-tuned counterparts.

[147] MMReview: A Multidisciplinary and Multimodal Benchmark for LLM-Based Peer Review Automation

Xian Gao, Jiacheng Ruan, Zongyun Zhang, Jingsheng Gao, Ting Liu, Yuzhuo Fu

Main category: cs.CL

TL;DR: MMReview is a comprehensive benchmark for evaluating LLMs and MLLMs in peer review tasks across multiple disciplines and modalities, addressing the lack of unified evaluation standards for automated review systems.

Details

Motivation: The rapid growth of academic publications makes peer review time-consuming, and current LLM-based review systems lack standardized benchmarks to assess their ability to produce comprehensive, accurate, and human-aligned assessments, especially for multimodal content.

Method: Created MMReview benchmark with 240 papers across 17 research domains in four major disciplines, featuring multimodal content and expert-written reviews. Designed 13 tasks grouped into four categories: step-wise review generation, outcome formulation, human preference alignment, and robustness to adversarial manipulation.

Result: Extensive experiments on 16 open-source and 5 closed-source models demonstrated the benchmark’s thoroughness in evaluating model performance across different review tasks and modalities.

Conclusion: MMReview establishes a standardized foundation for developing automated peer review systems and serves as a critical step toward improving LLM-based academic review capabilities.

Abstract: With the rapid growth of academic publications, peer review has become an essential yet time-consuming responsibility within the research community. Large Language Models (LLMs) have increasingly been adopted to assist in the generation of review comments; however, current LLM-based review tasks lack a unified evaluation benchmark to rigorously assess the models’ ability to produce comprehensive, accurate, and human-aligned assessments, particularly in scenarios involving multimodal content such as figures and tables. To address this gap, we propose \textbf{MMReview}, a comprehensive benchmark that spans multiple disciplines and modalities. MMReview includes multimodal content and expert-written review comments for 240 papers across 17 research domains within four major academic disciplines: Artificial Intelligence, Natural Sciences, Engineering Sciences, and Social Sciences. We design a total of 13 tasks grouped into four core categories, aimed at evaluating the performance of LLMs and Multimodal LLMs (MLLMs) in step-wise review generation, outcome formulation, alignment with human preferences, and robustness to adversarial input manipulation. Extensive experiments conducted on 16 open-source models and 5 advanced closed-source models demonstrate the thoroughness of the benchmark. We envision MMReview as a critical step toward establishing a standardized foundation for the development of automated peer review systems.

[148] Scaled Signed Averaging Improves In-Context and Early Learning Benchmark Performance in Small Transformers

Omar Naim, Swarnadeep Bhar, Jérôme Bolte, Nicholas Asher

Main category: cs.CL

TL;DR: The paper identifies limitations of LLMs in in-context learning for semantic tasks involving quantifiers and linear functions, proposes SSA as an alternative to Softmax, and shows improved performance on various tasks.

Details

Motivation: To address limitations of Large Language Models' in-context learning abilities on semantic tasks with quantifiers ("all", "some") and linear functions, where Softmax in attention mechanism is identified as a contributing factor.

Method: Proposed scaled signed averaging (SSA) as a novel alternative to Softmax in attention mechanism to mitigate the identified limitations.

Result: SSA significantly improves performance on ICL tasks, outperforms transformer models with Softmax on early learning NLP benchmarks, and shows better results on linguistic probing tasks in zero and few-shot settings.

Conclusion: SSA is an effective alternative to Softmax that addresses specific limitations in LLMs’ in-context learning capabilities, particularly for semantic tasks involving quantifiers and linear functions.

Abstract: While Large Language models’ abilities for in-context learning (ICL) have drawn much attention, we examine some of its limitations on semantic tasks involving quantifiers like “all” and “some”, as well as on tasks with linear functions. We identify Softmax, the scoring function in attention mechanism, as a contributing factor to these limitations. We propose scaled signed averaging (SSA), a novel alternative to Softmax to mitigate these problems. We show that SSA significantly improves performance on our ICL tasks. In addition, SSA outperforms transformer models with Softmax on several early learning NLP benchmarks and linguistic probing tasks on zero and few-shot settings.

[149] ParamBench: A Graduate-Level Benchmark for Evaluating LLM Understanding on Indic Subjects

Ayush Maheshwari, Kaushal Sharma, Vivek Patel, Aditya Maheshwari

Main category: cs.CL

TL;DR: ParamBench is a new Hindi-language benchmark with 17K+ graduate-level questions from 21 Indian subjects, evaluating LLMs on culturally grounded reasoning in the Indian context.

Details

Motivation: Existing Indian benchmarks focus on basic factual queries and lack assessment of deeper disciplinary understanding tailored to the Indian cultural context, leaving LLM performance on graduate-level Indian questions unexplored.

Method: Created ParamBench dataset with questions from nationwide graduate-level entrance exams covering 21 diverse subjects including history, music, instruments, yoga, literature, philosophy, and law. Evaluated 16+ open source LLMs on diverse question formats including list-based matching, assertion-reason pairs, sequence ordering, and multiple-choice questions.

Result: Gemma3-27B achieved the highest overall accuracy of 56.4%. Subject-wise analysis revealed persistent weaknesses in culturally grounded topics like music, classical instruments, and law, indicating challenges in culturally specific reasoning.

Conclusion: LLMs show significant limitations in handling culturally grounded graduate-level questions in the Indian context, particularly in domains requiring deep cultural understanding like music, instruments, and law. The benchmark highlights the need for improved cultural reasoning capabilities in language models.

Abstract: Large language models have been widely evaluated on tasks such as comprehension, summarization, code generation, etc. However, their performance on graduate-level, culturally grounded questions in the Indian context remains largely unexplored. Existing Indian benchmarks emphasise basic fact-orientated queries that offer limited assessment of a deeper disciplinary understanding tailored to the Indian setting. In this paper, we present ParamBench, consisting of more than 17K questions in the Hindi language, comprising questionnaires from 21 diverse subjects. These questions are primarily derived from a nationwide graduate-level entrance examination covering topics such as history, music, instruments, yoga, literature, philosophy, law, etc.~ specifically for the Indian context. Additionally, we assess the ability of LLMs to handle diverse question formats - such as list-based matching, assertion-reason pairs, and sequence ordering - alongside conventional multiple-choice questions. We evaluated the performance of more than 16 open source LLMs on this benchmark, observing that Gemma3-27B attains the highest overall accuracy of 56.4%. Furthermore, subject-wise analysis indicates that even for the best-performing LLMs, performance remains weak on topics such as music, classical instruments, and law, underscoring persistent challenges in culturally grounded reasoning. The dataset and source code is present at https://github.com/ayushbits/ParamBench.

[150] The Percept-V Challenge: Can Multimodal LLMs Crack Simple Perception Problems?

Samrajnee Ghosh, Naman Agarwal, Hemanshu Garg, Chinmay Mittal, Mausam, Parag Singla

Main category: cs.CL

TL;DR: Percept-V is a new benchmark testing basic visual perception skills in MLLMs, revealing surprisingly poor performance compared to humans despite minimal reasoning requirements.

Details

Motivation: To evaluate whether MLLMs match human-level basic visual perception skills, as existing benchmarks focus on advanced reasoning rather than fundamental perception abilities.

Method: Created Percept-V dataset with 6000 program-generated images across 30 domains testing TVPS-4 perception skills, with minimal reasoning/knowledge requirements.

Result: State-of-the-art MLLMs performed weakly compared to high human performance, with performance declining rapidly as object count increases, and identified specific perception skills that are particularly challenging.

Conclusion: MLLMs have significant gaps in basic visual perception capabilities despite excelling at complex tasks, highlighting the need for improved perception-focused evaluation and model development.

Abstract: Cognitive science research treats visual perception, the ability to understand and make sense of a visual input, as one of the early developmental signs of intelligence. Its TVPS-4 framework categorizes and tests human perception into seven skills such as visual discrimination, and form constancy. Do Multimodal Large Language Models (MLLMs) match up to humans in basic perception? Even though there are many benchmarks that evaluate MLLMs on advanced reasoning and knowledge skills, there is limited research that focuses evaluation on simple perception. In response, we introduce Percept-V, a dataset containing 6000 program-generated uncontaminated images divided into 30 domains, where each domain tests one or more TVPS-4 skills. Our focus is on perception, so we make our domains quite simple and the reasoning and knowledge required for solving them are minimal. Since modern-day MLLMs can solve much more complex tasks, our a-priori expectation is that they will solve these domains very easily. Contrary to our belief, our experiments show a weak performance of SoTA proprietary and open-source MLLMs compared to very high human performance on Percept-V. We find that as number of objects in the image increases, performance goes down rather fast. Our experiments also identify the perception skills that are considerably harder for all models.

[151] Probe-Rewrite-Evaluate: A Workflow for Reliable Benchmarks and Quantifying Evaluation Awareness

Lang Xiong, Nishant Bhargava, Jianhang Hong, Jeremy Chang, Haihao Liu, Vasu Sharma, Kevin Zhu

Main category: cs.CL

TL;DR: LLMs exhibit behavioral shifts when they perceive evaluation vs deployment contexts, with models being more unsafe/deceptive in test environments. The paper quantifies this ’evaluation awareness’ using prompt rewriting and shows significant behavioral changes.

Details

Motivation: Benchmark performance may not accurately reflect LLMs' true safety and honesty due to behavioral shifts between evaluation and deployment contexts, posing critical AI alignment challenges.

Method: Use linear probe to score prompts from ’test-like’ to ‘deploy-like’, then employ LLM rewriting strategy to shift prompts towards natural deployment context while preserving original tasks.

Result: Rewritten prompts achieved 30% increase in average probe score. Across models: 5.26% increase in honest responses, 12.40% decrease in deceptive responses, and 6.38% increase in refusal rates, showing heightened safety compliance.

Conclusion: Evaluation awareness is quantifiable and manipulable, directly influencing LLM behavior. Models are more prone to unsafe/deceptive outputs in perceived test environments, highlighting need for more realistic evaluation frameworks.

Abstract: Large Language Models (LLMs) often exhibit significant behavioral shifts when they perceive a change from a real-world deployment context to a controlled evaluation setting, a phenomenon known as “evaluation awareness.” This discrepancy poses a critical challenge for AI alignment, as benchmark performance may not accurately reflect a model’s true safety and honesty. In this work, we systematically quantify these behavioral changes by manipulating the perceived context of prompts. We introduce a methodology that uses a linear probe to score prompts on a continuous scale from “test-like” to “deploy-like” and leverage an LLM rewriting strategy to shift these prompts towards a more natural, deployment-style context while preserving the original task. Using this method, we achieved a 30% increase in the average probe score across a strategic role-playing dataset after rewriting. Evaluating a suite of state-of-the-art models on these original and rewritten prompts, we find that rewritten “deploy-like” prompts induce a significant and consistent shift in behavior. Across all models, we observed an average increase in honest responses of 5.26% and a corresponding average decrease in deceptive responses of 12.40%. Furthermore, refusal rates increased by an average of 6.38%, indicating heightened safety compliance. Our findings demonstrate that evaluation awareness is a quantifiable and manipulable factor that directly influences LLM behavior, revealing that models are more prone to unsafe or deceptive outputs in perceived test environments. This underscores the urgent need for more realistic evaluation frameworks to accurately gauge true model alignment before deployment.

[152] From Injection to Defense: Constructing Edit-Based Fingerprints for Large Language Models

Yue Li, Xin Yi, Dongsheng Shi, Yongyi Cui, Gerard de Melo, Linlin Wang

Main category: cs.CL

TL;DR: RFEdit is a knowledge-editing framework that embeds multilingual natural language fingerprints in LLMs by modifying sparse model weights, enabling robust fingerprinting with minimal utility impact, and uses Fingerprint Subspace-aware Fine-Tuning to maintain fingerprint integrity during legitimate fine-tuning.

Details

Motivation: Current backdoor-based fingerprinting methods face trade-offs between detectability (garbled text fingerprints are easily filtered) and unintentional triggering (coherent natural language fingerprints are prone to accidental activation). There's a need for robust fingerprinting that protects LLM intellectual property while maintaining model utility.

Method: RFEdit embeds rule-based multilingual natural language fingerprints by modifying a sparse subset of model weights. It uses Fingerprint Subspace-aware Fine-Tuning (FSFT) to mitigate fingerprint degradation during legitimate fine-tuning by restricting parameter updates to the fingerprint subspace.

Result: RFEdit achieves high detection effectiveness, robustness against adversarial manipulations, harmlessness to model utility, and persistence under fine-tuning. The framework maintains robustness under quantization and pruning. Fingerprint effectiveness improves by more than 10% when combined with FSFT for math and alpaca downstream tasks.

Conclusion: RFEdit establishes a comprehensive pipeline from fingerprint injection to defense, providing an effective solution for protecting LLM intellectual property through robust, persistent fingerprinting that maintains model utility and withstands various deployment scenarios including fine-tuning, quantization, and pruning.

Abstract: Fingerprinting is critical for maintaining traceability and protecting the intellectual property (IP) of developers, as LLMs deployed in web applications are susceptible to unauthorized redistribution and misuse via fine-tuning or black-box deployment. However, current backdoor-based fingerprinting methods face a fundamental trade-off: fingerprints embedded as garbled text are easily detected and filtered, whereas those crafted as coherent natural language are prone to being triggered unintentionally. To overcome these limitations, we propose RFEdit, a knowledge-editing framework that embeds a rule-based multilingual natural language fingerprint (MNLF) by modifying a sparse subset of model weights. This approach enables efficient and robust fingerprint injection with minimal impact on unrelated knowledge in LLMs. Our RFEdit framework is further safeguarded by Fingerprint Subspace-aware Fine-Tuning (FSFT), which mitigates fingerprint degradation during legitimate fine-tuning by restricting parameter updates to the fingerprint subspace. This approach preserves fingerprint integrity while enhancing downstream task performance of LLMs. These advances establish a comprehensive pipeline from fingerprint injection to defense, achieving high detection effectiveness, robustness against adversarial manipulations, harmlessness to model utility, and persistence under fine-tuning. Extensive experiments demonstrate that RFEdit maintains robustness under quantization and pruning. Additionally, fingerprint effectiveness is generally improved by more than 10% when combined with FSFT for math and alpaca downstream tasks.

[153] Improving Factuality in LLMs via Inference-Time Knowledge Graph Construction

Shanglin Wu, Lihui Liu, Jinho D. Choi, Kai Shu

Main category: cs.CL

TL;DR: A framework that dynamically constructs and expands knowledge graphs during inference to improve LLM factuality by integrating internal and external knowledge.

Details

Motivation: LLMs struggle with factual consistency due to parametric memory limitations, and existing RAG methods handle knowledge as unstructured text, reducing retrieval accuracy and hindering compositional reasoning.

Method: Extract seed KG from questions via prompting, iteratively expand using LLM’s internal knowledge, then selectively refine through external retrieval to enhance factual coverage and correct inaccuracies.

Result: Consistent gains in factual accuracy over baselines on three diverse Factual QA benchmarks.

Conclusion: Inference-time KG construction is a promising direction for enhancing LLM factuality in a structured, interpretable, and scalable manner.

Abstract: Large Language Models (LLMs) often struggle with producing factually consistent answers due to limitations in their parametric memory. Retrieval-Augmented Generation (RAG) paradigms mitigate this issue by incorporating external knowledge at inference time. However, such methods typically handle knowledge as unstructured text, which reduces retrieval accuracy, hinders compositional reasoning, and amplifies the influence of irrelevant information on the factual consistency of LLM outputs. To overcome these limitations, we propose a novel framework that dynamically constructs and expands knowledge graphs (KGs) during inference, integrating both internal knowledge extracted from LLMs and external knowledge retrieved from external sources. Our method begins by extracting a seed KG from the question via prompting, followed by iterative expansion using the LLM’s internal knowledge. The KG is then selectively refined through external retrieval, enhancing factual coverage and correcting inaccuracies. We evaluate our approach on three diverse Factual QA benchmarks, demonstrating consistent gains in factual accuracy over baselines. Our findings reveal that inference-time KG construction is a promising direction for enhancing LLM factuality in a structured, interpretable, and scalable manner.

[154] TextMine: Data, Evaluation Framework and Ontology-guided LLM Pipeline for Humanitarian Mine Action

Chenyue Zhou, Gürkan Solmaz, Flavio Cirillo, Kiril Gashteovski, Jonathan Fürst

Main category: cs.CL

TL;DR: TextMine is the first dataset and ontology-guided LLM pipeline for extracting structured knowledge from Humanitarian Mine Action reports, improving accuracy by 44.2% and reducing hallucinations by 22.5%.

Details

Motivation: Humanitarian Mine Action agencies produce valuable operational knowledge in unstructured reports, limiting information transfer between agencies and hindering efficient landmine detection and removal.

Method: Created TextMine dataset in collaboration with Cambodian Mine Action Center, developed ontology-guided LLM pipeline to extract (subject, relation, object)-triples, and introduced bias-aware evaluation framework using LLM-as-Judge protocol.

Result: Ontology-aligned prompts improved extraction accuracy by 44.2%, reduced hallucinations by 22.5%, and enhanced format adherence by 20.9% compared to baseline models.

Conclusion: TextMine successfully structures HMA knowledge, making it transferable between agencies, with publicly released dataset and code to advance humanitarian mine action efforts.

Abstract: Humanitarian Mine Action (HMA) addresses the challenge of detecting and removing landmines from conflict regions. Much of the life-saving operational knowledge produced by HMA agencies is buried in unstructured reports, limiting the transferability of information between agencies. To address this issue, we propose TextMine: the first dataset, evaluation framework and ontology-guided large language model (LLM) pipeline for knowledge extraction in the HMA domain. TextMine structures HMA reports into (subject, relation, object)-triples, thus creating domain-specific knowledge. To ensure real-world relevance, we created the dataset in collaboration with Cambodian Mine Action Center (CMAC). We further introduce a bias-aware evaluation framework that combines human-annotated triples with an LLM-as-Judge protocol to mitigate position bias in reference-free scoring. Our experiments show that ontology-aligned prompts improve extraction accuracy by up to 44.2%, reduce hallucinations by 22.5%, and enhance format adherence by 20.9% compared to baseline models. We publicly release the dataset and code.

[155] SMARTER: A Data-efficient Framework to Improve Toxicity Detection with Explanation via Self-augmenting Large Language Models

Huy Nghiem, Advik Sachdeva, Hal Daumé III

Main category: cs.CL

TL;DR: SMARTER is a two-stage framework that uses LLMs for explainable content moderation, generating synthetic explanations and refining them through cross-model training to improve performance with minimal data.

Details

Motivation: To address the proliferation of toxic content on social media with an efficient, explainable moderation system that requires minimal human supervision.

Method: Two-stage framework: Stage 1 generates synthetic explanations for correct/incorrect labels using LLM outputs for preference optimization; Stage 2 refines explanations through cross-model training between weaker and stronger models.

Result: Achieves up to 13.5% macro-F1 improvement over standard few-shot baselines on HateXplain, Latent Hate, and Implicit Hate benchmarks while using only a fraction of full training data.

Conclusion: SMARTER provides a scalable strategy for low-resource content moderation by leveraging LLMs’ self-improving capabilities for both classification and explanation tasks.

Abstract: WARNING: This paper contains examples of offensive materials. To address the proliferation of toxic content on social media, we introduce SMARTER, we introduce SMARTER, a data-efficient two-stage framework for explainable content moderation using Large Language Models (LLMs). In Stage 1, we leverage LLMs’ own outputs to generate synthetic explanations for both correct and incorrect labels, enabling alignment via preference optimization with minimal human supervision. In Stage 2, we refine explanation quality through cross-model training, allowing weaker models to align stylistically and semantically with stronger ones. Experiments on three benchmark tasks – HateXplain, Latent Hate, and Implicit Hate – demonstrate that SMARTER enables LLMs to achieve up to a 13.5% macro-F1 improvement over standard few-shot baselines while using only a fraction of the full training data. Our framework offers a scalable strategy for low-resource settings by harnessing LLMs’ self-improving capabilities for both classification and explanation.

[156] PARL-MT: Learning to Call Functions in Multi-Turn Conversation with Progress Awareness

Huacan Chai, Zijie Cao, Maolin Ran, Yingxuan Yang, Jianghao Lin, pengxin, Hairui Wang, Renjie Ding, Ziyu Wan, Muning Wen, Weiwen Liu, Weinan Zhang, Fei Huang, Ying Wen

Main category: cs.CL

TL;DR: PARL-MT is a framework that incorporates progress awareness into LLM training for multi-turn function calling, combining automatic dataset generation with reinforcement learning to improve long-horizon task execution.

Details

Motivation: Real-world applications require multi-turn conversations where LLMs need progress awareness to summarize past interactions and plan future actions, but existing approaches either neglect task-level planning or struggle with redundancy in RL training.

Method: PARL-MT combines Progress Awareness Generation (PAG) pipeline to automatically construct datasets with conversation summaries and future planning, and Progress Awareness-Guided Reinforcement Learning (PAG-RL) that integrates progress awareness into RL training.

Result: Empirical results on two public benchmarks show PARL-MT significantly outperforms existing methods in multi-turn function calling.

Conclusion: Progress awareness is effective in enabling robust and efficient multi-turn function calling, as demonstrated by PARL-MT’s superior performance over existing approaches.

Abstract: Large language models (LLMs) have achieved impressive success in single-turn function calling, yet real-world applications such as travel planning or multi-stage data analysis typically unfold across multi-turn conversations. In these settings, LLMs must not only issue accurate function calls at each step but also maintain progress awareness, the ability to summarize past interactions and plan future actions to ensure coherent, long-horizon task execution. Existing approaches, however, either reduce multi-turn training to isolated single-turn samples, which neglects task-level planning, or employ end-to-end reinforcement learning (RL) that struggles with redundancy and lacks explicit integration of progress awareness. To overcome these limitations, we introduce PARL-MT, a framework that explicitly incorporates progress awareness into LLM training for multi-turn function calling. PARL-MT combines (i) a Progress Awareness Generation (PAG) pipeline, which automatically constructs datasets coupling conversation summaries with future task planning, and (ii) a Progress Awareness-Guided Reinforcement Learning (PAG-RL) algorithm, which integrates progress awareness into RL training to reduce contextual redundancy and improve alignment between local actions and global task completion. Empirical results on two public benchmarks demonstrate that PARL-MT significantly outperforms existing methods, highlighting the effectiveness of progress awareness in enabling robust and efficient multi-turn function calling.

[157] LLM Hallucination Detection: HSAD

JinXin Li, Gang Tu, JunJie Hu

Main category: cs.CL

TL;DR: HSAD is a hallucination detection method that analyzes frequency-domain features of hidden layer temporal signals to identify reasoning anomalies in LLMs, overcoming limitations of knowledge coverage and static feature analysis.

Details

Motivation: Current hallucination detection methods are constrained by knowledge coverage scope and struggle to capture reasoning biases during inference. The paper aims to address these limitations by modeling the LLM reasoning process as a temporal cognitive journey.

Method: Treat LLM reasoning as temporal cognitive process, model human deception detection through hidden layer signals, apply Fast Fourier Transform to create spectral features, and design detection algorithm based on these frequency-domain features.

Result: The HSAD method demonstrates higher detection accuracy and robustness compared to existing approaches, with analysis experiments proving the effectiveness of spectral features in capturing reasoning anomalies.

Conclusion: HSAD effectively combines reasoning process modeling with frequency-domain feature extraction to overcome limitations of existing hallucination detection methods, providing a more accurate and robust solution for identifying hallucinations in LLM-generated content.

Abstract: Although Large Language Models have demonstrated powerful capabilities in a wide range of tasks such as language understanding and code generation, the frequent occurrence of hallucinations during the generation process has become a significant impediment to their deployment in critical application scenarios. Current mainstream hallucination detection methods rely on factual consistency verification or static hidden layer features. The former is constrained by the scope of knowledge coverage, while the latter struggles to capture reasoning biases during the inference process. To address these issues, and inspired by signal analysis methods in cognitive neuroscience, this paper proposes a hallucination detection method based on the frequency-domain analysis of hidden layer temporal signals, named HSAD (\textbf{H}idden \textbf{S}ignal \textbf{A}nalysis-based \textbf{D}etection). First, by treating the LLM’s reasoning process as a cognitive journey that unfolds over time, we propose modeling and simulating the human process of signal perception and discrimination in a deception-detection scenario through hidden layer temporal signals. Next, The Fast Fourier Transform is applied to map these temporal signals into the frequency domain to construct spectral features, which are used to capture anomalies that arise during the reasoning process; analysis experiments on these spectral features have proven the effectiveness of this approach. Finally, a hallucination detection algorithm is designed based on these spectral features to identify hallucinations in the generated content. By effectively combining the modeling of the reasoning process with frequency-domain feature extraction, the HSAD method overcomes the limitations of existing approaches in terms of knowledge coverage and the detection of reasoning biases, demonstrating higher detection accuracy and robustness.

[158] Text-Based Approaches to Item Alignment to Content Standards in Large-Scale Reading & Writing Tests

Yanbin Fu, Hong Jiao, Tianyi Zhou, Robert W. Lissitz, Nan Zhang, Ming Li, Qingshu Xu, Sydney Peters

Main category: cs.CL

TL;DR: Fine-tuned small language models outperform embedding-based supervised models for automated alignment of test items to content standards, with better performance achieved by including more item text data rather than just increasing sample size.

Details

Motivation: Human expert alignment of test items to content standards is subjective and time-consuming, requiring automated solutions to improve efficiency and objectivity.

Method: Fine-tuned small language models were trained for domain and skill-level alignment using data from college admissions reading/writing tests, with performance compared against embedding-based supervised machine learning models.

Result: Fine-tuned SLMs consistently outperformed embedding-based models, especially for fine-grained skill alignment. Including more item text data substantially improved performance beyond sample size increases alone.

Conclusion: Fine-tuned small language models are effective for automated item alignment, with semantic similarity analysis revealing that certain skill misclassifications occur due to inherent semantic closeness between skills in the test content.

Abstract: Aligning test items to content standards is a critical step in test development to collect validity evidence based on content. Item alignment has typically been conducted by human experts. This judgmental process can be subjective and time-consuming. This study investigated the performance of fine-tuned small language models (SLMs) for automated item alignment using data from a large-scale standardized reading and writing test for college admissions. Different SLMs were trained for alignment at both domain and skill levels respectively with 10 skills mapped to 4 content domains. The model performance was evaluated in multiple criteria on two testing datasets. The impact of types and sizes of the input data for training was investigated. Results showed that including more item text data led to substantially better model performance, surpassing the improvements induced by sample size increase alone. For comparison, supervised machine learning models were trained using the embeddings from the multilingual-E5-large-instruct model. The study results showed that fine-tuned SLMs consistently outperformed the embedding-based supervised machine learning models, particularly for the more fine-grained skill alignment. To better understand model misclassifications, multiple semantic similarity analysis including pairwise cosine similarity, Kullback-Leibler divergence of embedding distributions, and two-dimension projections of item embeddings were conducted. These analyses consistently showed that certain skills in SAT and PSAT were semantically too close, providing evidence for the observed misclassification.

[159] Spiral of Silence in Large Language Model Agents

Mingze Zhong, Meng Fang, Zijing Shi, Yuxuan Huang, Shunfeng Zheng, Yali Du, Ling Chen, Jun Wang

Main category: cs.CL

TL;DR: This paper investigates whether Spiral of Silence (SoS) dynamics can emerge in LLM collectives through statistical language generation, proposing an evaluation framework with controlled conditions varying History and Persona signals.

Details

Motivation: The classical Spiral of Silence theory was developed for human societies, but it's unclear if similar dynamics can emerge in LLM collectives where psychological explanations don't directly apply.

Method: Proposed evaluation framework with four controlled conditions varying History and Persona signals. Used trend tests (Mann-Kendall, Spearman’s rank) and concentration measures (kurtosis, interquartile range) to assess opinion dynamics across open-source and closed-source models.

Result: History and Persona together produce strong majority dominance and replicate SoS patterns; History alone induces strong anchoring; Persona alone fosters diverse but uncorrelated opinions, showing SoS dynamics cannot emerge without historical anchoring.

Conclusion: The work bridges computational sociology and responsible AI design, highlighting the need to monitor and mitigate emergent conformity in LLM-agent systems.

Abstract: The Spiral of Silence (SoS) theory holds that individuals with minority views often refrain from speaking out for fear of social isolation, enabling majority positions to dominate public discourse. When the ‘agents’ are large language models (LLMs), however, the classical psychological explanation is not directly applicable, since SoS was developed for human societies. This raises a central question: can SoS-like dynamics nevertheless emerge from purely statistical language generation in LLM collectives? We propose an evaluation framework for examining SoS in LLM agents. Specifically, we consider four controlled conditions that systematically vary the availability of ‘History’ and ‘Persona’ signals. Opinion dynamics are assessed using trend tests such as Mann-Kendall and Spearman’s rank, along with concentration measures including kurtosis and interquartile range. Experiments across open-source and closed-source models show that history and persona together produce strong majority dominance and replicate SoS patterns; history signals alone induce strong anchoring; and persona signals alone foster diverse but uncorrelated opinions, indicating that without historical anchoring, SoS dynamics cannot emerge. The work bridges computational sociology and responsible AI design, highlighting the need to monitor and mitigate emergent conformity in LLM-agent systems.

[160] Epistemic Diversity and Knowledge Collapse in Large Language Models

Dustin Wright, Sarah Masud, Jared Moore, Srishti Yadav, Maria Antoniak, Chan Young Park, Isabelle Augenstein

Main category: cs.CL

TL;DR: LLMs generate homogenous texts risking knowledge collapse. This study measures epistemic diversity in 27 LLMs across 155 topics and 12 countries, finding newer models are more diverse but still less than web searches. Model size negatively impacts diversity while RAG helps, with cultural context affecting improvements.

Details

Motivation: LLMs tend to produce lexically, semantically, and stylistically homogenous texts, which risks knowledge collapse - a shrinking range of accessible information over time. Existing research has limitations in measuring this homogenization.

Method: Developed a new methodology to measure epistemic diversity (variation in real-world claims) and tested 27 LLMs across 155 topics covering 12 countries using 200 prompt variations from real user chats.

Result: Newer models generate more diverse claims, but nearly all models are less epistemically diverse than basic web search. Model size negatively impacts diversity, while RAG has positive impact (varies by cultural context). Country-specific claims reflect English language more than local languages.

Conclusion: There is a significant gap in epistemic representation in LLMs compared to traditional knowledge sources, with cultural biases evident in how country-specific information is represented.

Abstract: Large language models (LLMs) tend to generate lexically, semantically, and stylistically homogenous texts. This poses a risk of knowledge collapse, where homogenous LLMs mediate a shrinking in the range of accessible information over time. Existing works on homogenization are limited by a focus on closed-ended multiple-choice setups or fuzzy semantic features, and do not look at trends across time and cultural contexts. To overcome this, we present a new methodology to measure epistemic diversity, i.e., variation in real-world claims in LLM outputs, which we use to perform a broad empirical study of LLM knowledge collapse. We test 27 LLMs, 155 topics covering 12 countries, and 200 prompt variations sourced from real user chats. For the topics in our study, we show that while newer models tend to generate more diverse claims, nearly all models are less epistemically diverse than a basic web search. We find that model size has a negative impact on epistemic diversity, while retrieval-augmented generation (RAG) has a positive impact, though the improvement from RAG varies by the cultural context. Finally, compared to a traditional knowledge source (Wikipedia), we find that country-specific claims reflect the English language more than the local one, highlighting a gap in epistemic representation

[161] FedSRD: Sparsify-Reconstruct-Decompose for Communication-Efficient Federated Large Language Models Fine-Tuning

Guochen Yan, Luyuan Xie, Qingni Shen, Yuejian Fang, Zhonghai Wu

Main category: cs.CL

TL;DR: FedSRD is a communication-efficient federated learning framework that reduces LoRA parameter transmission by 90% while improving model performance on heterogeneous data through sparsification, reconstruction, and decomposition techniques.

Details

Motivation: Current LLM training is unsustainable due to exhausted high-quality data sources. Federated Learning with LoRA faces communication bottlenecks and parameter conflicts in heterogeneous network environments.

Method: FedSRD uses a Sparsify-Reconstruct-Decompose framework: importance-aware sparsification to reduce uploaded parameters, server-side reconstruction and aggregation in full-rank space to mitigate conflicts, and decomposition into sparse low-rank format for efficient broadcasting.

Result: Experimental results on 10 benchmarks show up to 90% reduction in communication costs while improving model performance on heterogeneous client data.

Conclusion: FedSRD provides an effective solution for communication-efficient federated LLM fine-tuning, addressing both communication bottlenecks and parameter conflict issues in decentralized settings.

Abstract: The current paradigm of training large language models (LLMs) on publicly available Web data is becoming unsustainable, with high-quality data sources in specialized domains nearing exhaustion. Federated Learning (FL) emerges as a practical solution for the next generation of AI on a decentralized Web, enabling privacy-preserving collaborative fine-tuning by leveraging private data distributed across a global client base. While Low-Rank Adaptation (LoRA) is the standard for efficient fine-tuning, its application in federated settings presents a critical challenge: communication overhead remains a significant bottleneck across the Web’s heterogeneous network conditions. The structural redundancy within LoRA parameters not only incurs a heavy communication burden but also introduces conflicts when aggregating client updates. To address this, we propose FedSRD, a Sparsify-Reconstruct-Decompose framework designed for communication-efficient federated LLMs fine-tuning. We first introduce an importance-aware sparsification method that preserves the structural integrity of LoRA updates to reduce the uploaded parameter count. The server then reconstructs and aggregates these updates in a full-rank space to mitigate conflicts. Finally, it decomposes the global update into a sparse low-rank format for broadcast, ensuring a symmetrically efficient cycle. We also propose an efficient variant, FedSRD-e, to reduce computational overhead. Experimental results on 10 benchmarks demonstrate that our framework significantly reduces communication costs by up to 90% while even improving model performance on heterogeneous client data.

[162] Are BabyLMs Deaf to Gricean Maxims? A Pragmatic Evaluation of Sample-efficient Language Models

Raha Askari, Sina Zarrieß, Özge Alacam, Judith Sieker

Main category: cs.CL

TL;DR: A benchmark testing whether small language models (BabyLMs) can distinguish Gricean maxim violations, showing that models trained on <100M tokens outperform those on <10M tokens but still lag behind children and large language models.

Details

Motivation: To test whether language models can identify implicit meanings through Gricean maxim violations, which are essential for human communication and pragmatic inference.

Method: Created a novel benchmark based on Surian et al.’s study, comparing BabyLMs trained on <10M and <100M tokens across five Gricean maxims, with performance compared against children and a large LLM trained on 3T tokens.

Result: Models trained on <100M tokens outperformed those trained on <10M tokens but still fell short of child-level and LLM competence. Modest data improvements led to finer-grained differentiation between pragmatic dimensions.

Conclusion: While increased training data improves some pragmatic capabilities, current small language models still cannot match human-level pragmatic understanding, suggesting fundamental differences in how models and humans acquire pragmatic competence.

Abstract: Implicit meanings are integral to human communication, making it essential for language models to be capable of identifying and interpreting them. Grice (1975) proposed a set of conversational maxims that guide cooperative dialogue, noting that speakers may deliberately violate these principles to express meanings beyond literal words, and that listeners, in turn, recognize such violations to draw pragmatic inferences. Building on Surian et al. (1996)’s study of children’s sensitivity to violations of Gricean maxims, we introduce a novel benchmark to test whether language models pretrained on less than 10M and less than 100M tokens can distinguish maxim-adhering from maxim-violating utterances. We compare these BabyLMs across five maxims and situate their performance relative to children and a Large Language Model (LLM) pretrained on 3T tokens. We find that overall, models trained on less than 100M tokens outperform those trained on less than 10M, yet fall short of child-level and LLM competence. Our results suggest that modest data increases improve some aspects of pragmatic behavior, leading to finer-grained differentiation between pragmatic dimensions.

[163] Can AI Truly Represent Your Voice in Deliberations? A Comprehensive Study of Large-Scale Opinion Aggregation with LLMs

Shenzhe Zhu, Shu Yang, Michiel A. Bakker, Alex Pentland, Jiaxin Pei

Main category: cs.CL

TL;DR: DeliberationBank is a human-grounded dataset for evaluating deliberation summaries, and DeliberationJudge is a fine-tuned model that outperforms LLM judges in aligning with human judgments on summary quality.

Details

Motivation: LLMs risk underrepresenting minority perspectives and exhibiting bias in deliberation summarization, but current evaluation methods using LLMs as judges show weak alignment with human judgments.

Method: Created DeliberationBank dataset with opinion data from 3,000 participants and summary judgments from 4,500 participants, then trained DeliberationJudge (fine-tuned DeBERTa) to rate summaries from individual perspectives.

Result: DeliberationJudge is more efficient and better aligned with human judgments than LLM judges. Evaluation of 18 LLMs revealed persistent weaknesses, especially underrepresentation of minority positions.

Conclusion: The framework provides a scalable and reliable way to evaluate deliberation summarization, helping ensure AI systems are more representative and equitable for policymaking.

Abstract: Large-scale public deliberations generate thousands of free-form contributions that must be synthesized into representative and neutral summaries for policy use. While LLMs have been shown as a promising tool to generate summaries for large-scale deliberations, they also risk underrepresenting minority perspectives and exhibiting bias with respect to the input order, raising fairness concerns in high-stakes contexts. Studying and fixing these issues requires a comprehensive evaluation at a large scale, yet current practice often relies on LLMs as judges, which show weak alignment with human judgments. To address this, we present DeliberationBank, a large-scale human-grounded dataset with (1) opinion data spanning ten deliberation questions created by 3,000 participants and (2) summary judgment data annotated by 4,500 participants across four dimensions (representativeness, informativeness, neutrality, policy approval). Using these datasets, we train DeliberationJudge, a fine-tuned DeBERTa model that can rate deliberation summaries from individual perspectives. DeliberationJudge is more efficient and more aligned with human judgements compared to a wide range of LLM judges. With DeliberationJudge, we evaluate 18 LLMs and reveal persistent weaknesses in deliberation summarization, especially underrepresentation of minority positions. Our framework provides a scalable and reliable way to evaluate deliberation summarization, helping ensure AI systems are more representative and equitable for policymaking.

[164] DACP: Domain-Adaptive Continual Pre-Training of Large Language Models for Phone Conversation Summarization

Xue-Yong Fu, Elena Khasanova, Md Tahmid Rahman Laskar, Harsh Saini, Shashi Bhushan TN

Main category: cs.CL

TL;DR: Continual pre-training improves LLMs for conversational summarization using unlabeled business conversation data, achieving gains in both in-domain and out-of-domain benchmarks without requiring labeled data.

Details

Motivation: LLMs perform poorly on specialized domains different from their pre-training distribution, and fine-tuning requires costly labeled data. Continual pre-training offers a scalable, self-supervised alternative.

Method: Use large-scale unlabeled business conversation data for continual pre-training of LLMs to adapt them for conversational summarization tasks, with analysis of data selection strategies.

Result: Substantial gains in both in-domain and out-of-domain summarization benchmarks, while maintaining strong generalization and robustness.

Conclusion: Continual pre-training is an effective approach for adapting LLMs to specialized summarization domains, providing practical guidelines for industrial applications.

Abstract: Large language models (LLMs) have achieved impressive performance in text summarization, yet their performance often falls short when applied to specialized domains that differ from their original pre-training distribution. While fine-tuning can improve summarization quality, it typically relies on costly and scarce high-quality labeled data. In this work, we explore continual pre-training as a scalable, self-supervised approach to adapt LLMs for downstream summarization tasks, particularly in the context of noisy real-world conversation transcripts. We conduct extensive experiments using large-scale, unlabeled business conversation data to investigate whether continual pre-training enhances model capabilities in conversational summarization. Our results demonstrate that continual pre-training yields substantial gains in both in-domain and out-of-domain summarization benchmarks, while maintaining strong generalization and robustness. We also analyze the effects of data selection strategies, providing practical guidelines for applying continual pre-training in summarization-focused industrial applications.

[165] EvalMORAAL: Interpretable Chain-of-Thought and LLM-as-Judge Evaluation for Moral Alignment in Large Language Models

Hadi Mohammadi, Anastasia Giachanou, Ayoub Bagheri

Main category: cs.CL

TL;DR: EvalMORAAL is a transparent CoT framework that evaluates moral alignment in 20 LLMs using two scoring methods and model-as-judge peer review, revealing significant regional bias between Western and non-Western regions.

Details

Motivation: To develop a transparent framework for evaluating moral alignment in large language models across different cultural contexts and identify potential regional biases.

Method: Uses chain-of-thought with two scoring methods (log-probabilities and direct ratings) plus model-as-judge peer review on World Values Survey (55 countries, 19 topics) and PEW Global Attitudes Survey (39 countries, 8 topics).

Result: Top models align closely with survey responses (Pearson’s r≈0.90 on WVS), but show significant regional bias: Western regions average r=0.82 vs non-Western regions r=0.61 (0.21 gap). Peer review flagged 348 conflicts and peer agreement correlates with survey alignment.

Conclusion: Shows progress toward culture-aware AI but highlights persistent regional bias challenges, with automated quality checks supporting evaluation across diverse cultural contexts.

Abstract: We present EvalMORAAL, a transparent chain-of-thought (CoT) framework that uses two scoring methods (log-probabilities and direct ratings) plus a model-as-judge peer review to evaluate moral alignment in 20 large language models. We assess models on the World Values Survey (55 countries, 19 topics) and the PEW Global Attitudes Survey (39 countries, 8 topics). With EvalMORAAL, top models align closely with survey responses (Pearson’s r approximately 0.90 on WVS). Yet we find a clear regional difference: Western regions average r=0.82 while non-Western regions average r=0.61 (a 0.21 absolute gap), indicating consistent regional bias. Our framework adds three parts: (1) two scoring methods for all models to enable fair comparison, (2) a structured chain-of-thought protocol with self-consistency checks, and (3) a model-as-judge peer review that flags 348 conflicts using a data-driven threshold. Peer agreement relates to survey alignment (WVS r=0.74, PEW r=0.39, both p<.001), supporting automated quality checks. These results show real progress toward culture-aware AI while highlighting open challenges for use across regions.

cs.CV

[166] Milestone Determination for Autonomous Railway Operation

Josh Hunter, John McDermid, Simon Burton, Poppy Fynes, Mia Dempster

Main category: cs.CV

TL;DR: The paper proposes a route-specific, milestone-based approach for railway automation vision systems to address limitations in existing datasets and methods.

Details

Motivation: Traditional computer vision systems for railway automation suffer from limited high-quality sequential data, lack of spatio-temporal context, and issues with realism in alternative solutions.

Method: Focus on route-specific contextual cues to generate rich sequential datasets, using milestone determination to develop targeted rule-based models that simplify learning by focusing on critical decision points.

Result: The approach enables generation of datasets that better align with real-world operational logic and facilitates training vision agents in controlled environments.

Conclusion: This milestone-based framework provides a practical solution for developing safer and more efficient machine learning systems for railway automation.

Abstract: In the field of railway automation, one of the key challenges has been the development of effective computer vision systems due to the limited availability of high-quality, sequential data. Traditional datasets are restricted in scope, lacking the spatio temporal context necessary for real-time decision-making, while alternative solutions introduce issues related to realism and applicability. By focusing on route-specific, contextually relevant cues, we can generate rich, sequential datasets that align more closely with real-world operational logic. The concept of milestone determination allows for the development of targeted, rule-based models that simplify the learning process by eliminating the need for generalized recognition of dynamic components, focusing instead on the critical decision points along a route. We argue that this approach provides a practical framework for training vision agents in controlled, predictable environments, facilitating safer and more efficient machine learning systems for railway automation.

Yolo Yunlong Tang, Siting Xu, Teng Wang, Qin Lin, Qinglin Lu, Feng Zheng

Main category: cs.CV

TL;DR: Proposes M-SAN, a multi-modal segment assemblage network for advertisement video editing that performs efficient and coherent segment assemblage end-to-end using multi-modal representations and attention mechanisms.

Details

Motivation: Existing methods perform well at video segmentation but suffer from dependencies on extra cumbersome models and poor performance at segment assemblage stage.

Method: Uses multi-modal representation from segments, follows Encoder-Decoder Ptr-Net framework with Attention mechanism, and designs importance-coherence reward for training.

Result: Achieves better performance than random selection and previous methods on the proposed Imp-Coh@Time metric, with ablation studies confirming the value of multi-modal representation and importance-coherence reward.

Conclusion: M-SAN effectively addresses segment assemblage problems in advertisement video editing through end-to-end multi-modal approach with improved performance metrics.

Abstract: Advertisement video editing aims to automatically edit advertising videos into shorter videos while retaining coherent content and crucial information conveyed by advertisers. It mainly contains two stages: video segmentation and segment assemblage. The existing method performs well at video segmentation stages but suffers from the problems of dependencies on extra cumbersome models and poor performance at the segment assemblage stage. To address these problems, we propose M-SAN (Multi-modal Segment Assemblage Network) which can perform efficient and coherent segment assemblage task end-to-end. It utilizes multi-modal representation extracted from the segments and follows the Encoder-Decoder Ptr-Net framework with the Attention mechanism. Importance-coherence reward is designed for training M-SAN. We experiment on the Ads-1k dataset with 1000+ videos under rich ad scenarios collected from advertisers. To evaluate the methods, we propose a unified metric, Imp-Coh@Time, which comprehensively assesses the importance, coherence, and duration of the outputs at the same time. Experimental results show that our method achieves better performance than random selection and the previous method on the metric. Ablation experiments further verify that multi-modal representation and importance-coherence reward significantly improve the performance. Ads-1k dataset is available at: https://github.com/yunlong10/Ads-1k

[168] CML-Bench: A Framework for Evaluating and Enhancing LLM-Powered Movie Scripts Generation

Mingzhe Zheng, Dingjie Song, Guanyu Zhou, Jun You, Jiahao Zhan, Xuran Ma, Xinyuan Song, Ser-Nam Lim, Qifeng Chen, Harry Yang

Main category: cs.CV

TL;DR: This paper addresses LLMs’ limitations in generating emotionally rich movie scripts by creating CML-Dataset and CML-Bench for quality assessment, and proposes CML-Instruction to improve LLM-generated screenplays.

Details

Motivation: LLMs struggle to capture the nuanced storytelling and emotional depth required for compelling movie scripts despite their structural capabilities.

Method: Created CML-Dataset with (summary, content) pairs from quality movie scripts, analyzed narrative structures to define three assessment dimensions (DC, CC, PR), developed CML-Bench metrics, and introduced CML-Instruction prompting strategy.

Result: CML-Bench effectively identifies weaknesses in LLM-generated scripts and assigns high scores to human-written scripts. CML-Instruction helps LLMs generate higher-quality screenplays aligned with human preferences.

Conclusion: The proposed benchmark and instruction strategy successfully address LLMs’ deficiencies in cinematic script generation and improve screenplay quality.

Abstract: Large Language Models (LLMs) have demonstrated remarkable proficiency in generating highly structured texts. However, while exhibiting a high degree of structural organization, movie scripts demand an additional layer of nuanced storytelling and emotional depth-the ‘soul’ of compelling cinema-that LLMs often fail to capture. To investigate this deficiency, we first curated CML-Dataset, a dataset comprising (summary, content) pairs for Cinematic Markup Language (CML), where ‘content’ consists of segments from esteemed, high-quality movie scripts and ‘summary’ is a concise description of the content. Through an in-depth analysis of the intrinsic multi-shot continuity and narrative structures within these authentic scripts, we identified three pivotal dimensions for quality assessment: Dialogue Coherence (DC), Character Consistency (CC), and Plot Reasonableness (PR). Informed by these findings, we propose the CML-Bench, featuring quantitative metrics across these dimensions. CML-Bench effectively assigns high scores to well-crafted, human-written scripts while concurrently pinpointing the weaknesses in screenplays generated by LLMs. To further validate our benchmark, we introduce CML-Instruction, a prompting strategy with detailed instructions on character dialogue and event logic, to guide LLMs to generate more structured and cinematically sound scripts. Extensive experiments validate the effectiveness of our benchmark and demonstrate that LLMs guided by CML-Instruction generate higher-quality screenplays, with results aligned with human preferences.

[169] User to Video: A Model for Spammer Detection Inspired by Video Classification Technology

Haoyang Zhang, Zhou Yang, Yucai Pang

Main category: cs.CV

TL;DR: UVSD is a spammer detection model that treats user behavior as video frames, using pixelization and video classification techniques to identify spammers with improved performance on WEIBO and TWITTER datasets.

Details

Motivation: The paper is inspired by video classification technology, viewing user behavior subspace as frame images and consecutive frames as video to develop a novel spammer detection approach.

Method: Proposes UVSD model with three main components: user2piexl algorithm for user pixelization (users as pixels, stances as RGB), behavior2image algorithm for transforming behavior subspace into frame images using representation learning and diffusion algorithms, and constructing user behavior videos based on temporal features combined with video classification.

Result: Experiments on WEIBO and TWITTER datasets show that UVSD outperforms state-of-the-art methods in spammer detection.

Conclusion: The UVSD model successfully applies video classification concepts to spammer detection by treating user behavior as video sequences, demonstrating superior performance compared to existing approaches.

Abstract: This article is inspired by video classification technology. If the user behavior subspace is viewed as a frame image, consecutive frame images are viewed as a video. Following this novel idea, a model for spammer detection based on user videoization, called UVSD, is proposed. Firstly, a user2piexl algorithm for user pixelization is proposed. Considering the adversarial behavior of user stances, the user is viewed as a pixel, and the stance is quantified as the pixel’s RGB. Secondly, a behavior2image algorithm is proposed for transforming user behavior subspace into frame images. Low-rank dense vectorization of subspace user relations is performed using representation learning, while cutting and diffusion algorithms are introduced to complete the frame imageization. Finally, user behavior videos are constructed based on temporal features. Subsequently, a video classification algorithm is combined to identify the spammers. Experiments using publicly available datasets, i.e., WEIBO and TWITTER, show an advantage of the UVSD model over state-of-the-art methods.

[170] Uncertainty Quantification In Surface Landmines and UXO Classification Using MC Dropout

Sagar Lekhak, Emmett J. Ientilucci, Dimah Dera, Susmita Ghosh

Main category: cs.CV

TL;DR: This paper proposes using Monte Carlo Dropout with ResNet-50 for uncertainty quantification in landmine/UXO detection, showing improved reliability under noisy and adversarial conditions.

Details

Motivation: Deterministic neural networks for landmine detection are vulnerable to noise and adversarial attacks, leading to missed detections and misclassifications in critical humanitarian demining operations.

Method: Integrated Monte Carlo Dropout into a fine-tuned ResNet-50 architecture for surface landmine/UXO classification, tested on simulated datasets under clean, noisy, and adversarial conditions.

Result: The model successfully flagged unreliable predictions under challenging conditions, demonstrating the value of uncertainty quantification for making more informed demining decisions.

Conclusion: Uncertainty quantification is essential for robust demining applications, highlighting vulnerabilities of existing neural networks and the need for more reliable models in practical humanitarian operations.

Abstract: Detecting surface landmines and unexploded ordnances (UXOs) using deep learning has shown promise in humanitarian demining. However, deterministic neural networks can be vulnerable to noisy conditions and adversarial attacks, leading to missed detection or misclassification. This study introduces the idea of uncertainty quantification through Monte Carlo (MC) Dropout, integrated into a fine-tuned ResNet-50 architecture for surface landmine and UXO classification, which was tested on a simulated dataset. Integrating the MC Dropout approach helps quantify epistemic uncertainty, providing an additional metric for prediction reliability, which could be helpful to make more informed decisions in demining operations. Experimental results on clean, adversarially perturbed, and noisy test images demonstrate the model’s ability to flag unreliable predictions under challenging conditions. This proof-of-concept study highlights the need for uncertainty quantification in demining, raises awareness about the vulnerability of existing neural networks in demining to adversarial threats, and emphasizes the importance of developing more robust and reliable models for practical applications.

[171] multimodars: A Rust-powered toolkit for multi-modality cardiac image fusion and registration

Anselm W. Stark, Marc Ilic, Ali Mokhtari, Pooya Mohammadi Kazaj, Christoph Graeni, Isaac Shiri

Main category: cs.CV

TL;DR: multimodars is an open toolkit for deterministic fusion of intravascular imaging and CCTA data, enabling multi-state coronary analysis with high performance and pipeline integration.

Details

Motivation: Existing methods lack an open, flexible toolkit for multi-state coronary analysis that combines intravascular imaging's high resolution with CCTA's 3D context while offering deterministic behavior and easy integration.

Method: Uses deterministic alignment algorithms, a compact NumPy-centered data model, and an optimized Rust backend for scalable performance. Accepts CSV/NumPy inputs including AIVUS-CAA formats.

Result: Provides a solution for reproducible multi-state coronary analysis combining complementary imaging modalities with deterministic fusion capabilities.

Conclusion: multimodars fills the gap for open, flexible coronary imaging fusion toolkit with deterministic alignment, high performance, and pipeline integration suitable for clinical research.

Abstract: Combining complementary imaging modalities is critical to build reliable 3D coronary models: intravascular imaging gives sub-millimetre resolution but limited whole-vessel context, while CCTA supplies 3D geometry but suffers from limited spatial resolution and artefacts (e.g., blooming). Prior work demonstrated intravascular/CCTA fusion, yet no open, flexible toolkit is tailored for multi-state analysis (rest/stress, pre-/post-stenting) while offering deterministic behaviour, high performance, and easy pipeline integration. multimodars addresses this gap with deterministic alignment algorithms, a compact NumPy-centred data model, and an optimised Rust backend suitable for scalable, reproducible experiments. The package accepts CSV/NumPy inputs including data formats produced by the AIVUS-CAA software

[172] Does Physics Knowledge Emerge in Frontier Models?

Ieva Bagdonaviciute, Vibhav Vineet

Main category: cs.CV

TL;DR: VLMs show strong visual perception but weak physical dynamics understanding, with fragmented skills that don’t combine into causal reasoning.

Details

Motivation: To benchmark VLMs' ability to understand and predict physical dynamics, as current models excel in perception but their physics reasoning capabilities remain unclear.

Method: Benchmarked 6 frontier VLMs on 3 physical simulation datasets (CLEVRER, Physion, Physion++) with diagnostic subtests isolating perception from physics reasoning.

Result: Weak correlations found between diagnostic performance and evaluation accuracy - models excelling at perception or physics reasoning don’t consistently perform better on predictive/counterfactual tasks.

Conclusion: Current VLMs have fragmented perceptual and physics skills that fail to combine into causal understanding, highlighting the need for architectures that better bind perception and reasoning.

Abstract: Leading Vision-Language Models (VLMs) show strong results in visual perception and general reasoning, but their ability to understand and predict physical dynamics remains unclear. We benchmark six frontier VLMs on three physical simulation datasets - CLEVRER, Physion, and Physion++ - where the evaluation tasks test whether a model can predict outcomes or hypothesize about alternative situations. To probe deeper, we design diagnostic subtests that isolate perception (objects, colors, occluders) from physics reasoning (motion prediction, spatial relations). Intuitively, stronger diagnostic performance should support higher evaluation accuracy. Yet our analysis reveals weak correlations: models that excel at perception or physics reasoning do not consistently perform better on predictive or counterfactual evaluation. This counterintuitive gap exposes a central limitation of current VLMs: perceptual and physics skills remain fragmented and fail to combine into causal understanding, underscoring the need for architectures that bind perception and reasoning more tightly.

[173] Enhanced Self-Distillation Framework for Efficient Spiking Neural Network Training

Xiaochen Zhao, Chengting Yu, Kairong Yu, Lei Liu, Aili Wang

Main category: cs.CV

TL;DR: Proposes an enhanced self-distillation framework for Spiking Neural Networks (SNNs) that reduces training complexity while achieving high performance by projecting firing rates onto lightweight ANN branches and using reliable self-generated knowledge.

Details

Motivation: Conventional SNN training methods based on surrogate gradients and BPTT have performance gaps compared to ANNs and incur significant computational/memory overheads that grow linearly with temporal dimension, making high-performance training challenging under limited resources.

Method: Enhanced self-distillation framework jointly optimized with rate-based backpropagation. Firing rates of intermediate SNN layers are projected onto lightweight ANN branches, and high-quality self-generated knowledge is used to optimize substructures through ANN pathways. Teacher signals are decoupled into reliable and unreliable components to ensure only reliable knowledge guides optimization.

Result: Extensive experiments on CIFAR-10, CIFAR-100, CIFAR10-DVS, and ImageNet demonstrate reduced training complexity while achieving high-performance SNN training.

Conclusion: The proposed method successfully enables high-performance SNN training under limited computational resources by addressing the limitations of conventional training approaches through an enhanced self-distillation framework.

Abstract: Spiking Neural Networks (SNNs) exhibit exceptional energy efficiency on neuromorphic hardware due to their sparse activation patterns. However, conventional training methods based on surrogate gradients and Backpropagation Through Time (BPTT) not only lag behind Artificial Neural Networks (ANNs) in performance, but also incur significant computational and memory overheads that grow linearly with the temporal dimension. To enable high-performance SNN training under limited computational resources, we propose an enhanced self-distillation framework, jointly optimized with rate-based backpropagation. Specifically, the firing rates of intermediate SNN layers are projected onto lightweight ANN branches, and high-quality knowledge generated by the model itself is used to optimize substructures through the ANN pathways. Unlike traditional self-distillation paradigms, we observe that low-quality self-generated knowledge may hinder convergence. To address this, we decouple the teacher signal into reliable and unreliable components, ensuring that only reliable knowledge is used to guide the optimization of the model. Extensive experiments on CIFAR-10, CIFAR-100, CIFAR10-DVS, and ImageNet demonstrate that our method reduces training complexity while achieving high-performance SNN training. Our code is available at https://github.com/Intelli-Chip-Lab/enhanced-self-distillation-framework-for-snn.

[174] Ensemble Deep Learning and LLM-Assisted Reporting for Automated Skin Lesion Diagnosis

Sher Khan, Raz Muhammad, Adil Hussain, Muhammad Sajjad, Muhammad Rashid

Main category: cs.CV

TL;DR: A unified AI framework for dermatological diagnostics that combines heterogeneous neural networks with integrated language models to improve diagnostic reliability and patient communication.

Details

Motivation: Current dermatological diagnostics suffer from inter-observer variability, access disparities, dataset biases, and fragmented AI approaches that treat language processing as separate from clinical decision-making.

Method: Two synergistic innovations: 1) A heterogeneous ensemble of diverse CNNs with intrinsic uncertainty mechanisms for flagging discordant cases, and 2) Direct integration of large language models into diagnostic workflow to generate structured clinical reports with patient education.

Result: The framework bridges the translational gap by addressing both diagnostic reliability and communication barriers, creating clinically meaningful assessments that fulfill documentation requirements while empowering patients.

Conclusion: This represents a significant advancement toward deployable dermatological AI that enhances diagnostic precision and supports the continuum of care from detection through patient education, ultimately improving early intervention rates.

Abstract: Cutaneous malignancies demand early detection for favorable outcomes, yet current diagnostics suffer from inter-observer variability and access disparities. While AI shows promise, existing dermatological systems are limited by homogeneous architectures, dataset biases across skin tones, and fragmented approaches that treat natural language processing as separate post-hoc explanations rather than integral to clinical decision-making. We introduce a unified framework that fundamentally reimagines AI integration for dermatological diagnostics through two synergistic innovations. First, a purposefully heterogeneous ensemble of architecturally diverse convolutional neural networks provides complementary diagnostic perspectives, with an intrinsic uncertainty mechanism flagging discordant cases for specialist review – mimicking clinical best practices. Second, we embed large language model capabilities directly into the diagnostic workflow, transforming classification outputs into clinically meaningful assessments that simultaneously fulfill medical documentation requirements and deliver patient-centered education. This seamless integration generates structured reports featuring precise lesion characterization, accessible diagnostic reasoning, and actionable monitoring guidance – empowering patients to recognize early warning signs between visits. By addressing both diagnostic reliability and communication barriers within a single cohesive system, our approach bridges the critical translational gap that has prevented previous AI implementations from achieving clinical impact. The framework represents a significant advancement toward deployable dermatological AI that enhances diagnostic precision while actively supporting the continuum of care from initial detection through patient education, ultimately improving early intervention rates for skin lesions.

[175] Online Generic Event Boundary Detection

Hyungrok Jung, Daneul Kim, Seunggyun Lim, Jeany Son, Jonghyun Choi

Main category: cs.CV

TL;DR: Introduces Online Generic Event Boundary Detection (On-GEBD) for real-time event boundary detection in streaming videos, proposing an Estimator framework inspired by human event segmentation theory.

Details

Motivation: Current GEBD methods require complete video frames for processing, unlike humans who process data online and in real-time, creating a gap between computational methods and human perception.

Method: Proposed Estimator framework with two components: Consistent Event Anticipator (CEA) that predicts future frames based on prior frames, and Online Boundary Discriminator (OBD) that measures prediction errors and adaptively adjusts thresholds using statistical tests.

Result: Outperforms all baselines adapted from recent online video understanding models and achieves performance comparable to prior offline-GEBD methods on Kinetics-GEBD and TAPOS datasets.

Conclusion: The proposed On-GEBD framework successfully bridges the gap between offline processing and human-like real-time event boundary detection, demonstrating the effectiveness of leveraging prediction errors inspired by human event segmentation theory.

Abstract: Generic Event Boundary Detection (GEBD) aims to interpret long-form videos through the lens of human perception. However, current GEBD methods require processing complete video frames to make predictions, unlike humans processing data online and in real-time. To bridge this gap, we introduce a new task, Online Generic Event Boundary Detection (On-GEBD), aiming to detect boundaries of generic events immediately in streaming videos. This task faces unique challenges of identifying subtle, taxonomy-free event changes in real-time, without the access to future frames. To tackle these challenges, we propose a novel On-GEBD framework, Estimator, inspired by Event Segmentation Theory (EST) which explains how humans segment ongoing activity into events by leveraging the discrepancies between predicted and actual information. Our framework consists of two key components: the Consistent Event Anticipator (CEA), and the Online Boundary Discriminator (OBD). Specifically, the CEA generates a prediction of the future frame reflecting current event dynamics based solely on prior frames. Then, the OBD measures the prediction error and adaptively adjusts the threshold using statistical tests on past errors to capture diverse, subtle event transitions. Experimental results demonstrate that Estimator outperforms all baselines adapted from recent online video understanding models and achieves performance comparable to prior offline-GEBD methods on the Kinetics-GEBD and TAPOS datasets.

[176] Vision Transformer for Transient Noise Classification

Divyansh Srivastava, Andrzej Niedzielski

Main category: cs.CV

TL;DR: This paper uses Vision Transformer (ViT) to classify transient noise (glitches) in LIGO gravitational wave data into 22 existing classes plus 2 new classes from O3a run, achieving 92.26% classification efficiency.

Details

Motivation: Transient noise (glitches) in LIGO data hinders gravitational wave detection, and with the O3 run adding two new noise classes, there's a need to train new models for effective classification.

Method: Train a pre-trained Vision Transformer (ViT-B/32) model on a combined dataset consisting of the Gravity Spy dataset with the additional two classes from LIGO O3a run.

Result: Achieved a classification efficiency of 92.26% for classifying glitches into 24 total classes.

Conclusion: Vision Transformer demonstrates potential to improve gravitational wave detection accuracy by effectively distinguishing transient noise.

Abstract: Transient noise (glitches) in LIGO data hinders the detection of gravitational waves (GW). The Gravity Spy project has categorized these noise events into various classes. With the O3 run, there is the inclusion of two additional noise classes and thus a need to train new models for effective classification. We aim to classify glitches in LIGO data into 22 existing classes from the first run plus 2 additional noise classes from O3a using the Vision Transformer (ViT) model. We train a pre-trained Vision Transformer (ViT-B/32) model on a combined dataset consisting of the Gravity Spy dataset with the additional two classes from the LIGO O3a run. We achieve a classification efficiency of 92.26%, demonstrating the potential of Vision Transformer to improve the accuracy of gravitational wave detection by effectively distinguishing transient noise. Key words: gravitational waves –vision transformer –machine learning

[177] General and Efficient Visual Goal-Conditioned Reinforcement Learning using Object-Agnostic Masks

Fahim Shahriar, Cheryl Wang, Alireza Azimi, Gautham Vasan, Hany Hamed Elanwar, A. Rupam Mahmood, Colin Bellinger

Main category: cs.CV

TL;DR: The paper proposes a mask-based goal representation system for goal-conditioned reinforcement learning that enables efficient learning and superior generalization compared to existing methods.

Details

Motivation: Existing goal representation methods in GCRL face issues like poor generalization to unseen objects, slow convergence, and dependency on special cameras. The authors aim to develop a more robust and efficient goal representation system.

Method: The method uses mask-based goal representation that provides object-agnostic visual cues. Masks can be processed to generate dense rewards without distance calculations. The approach leverages ground truth masks in simulation and can utilize pretrained open vocabulary object detection models for mask generation in real-world applications.

Result: The method achieved 99.9% reaching accuracy on both training and unseen test objects. It successfully performed pick-up tasks with high accuracy without using positional information. The approach was demonstrated through learning from scratch and sim-to-real transfer applications using two different physical robots.

Conclusion: Mask-based goal representation is an effective approach for GCRL that enables efficient learning, superior generalization to unseen objects, and successful sim-to-real transfer without requiring positional information or special cameras.

Abstract: Goal-conditioned reinforcement learning (GCRL) allows agents to learn diverse objectives using a unified policy. The success of GCRL, however, is contingent on the choice of goal representation. In this work, we propose a mask-based goal representation system that provides object-agnostic visual cues to the agent, enabling efficient learning and superior generalization. In contrast, existing goal representation methods, such as target state images, 3D coordinates, and one-hot vectors, face issues of poor generalization to unseen objects, slow convergence, and the need for special cameras. Masks can be processed to generate dense rewards without requiring error-prone distance calculations. Learning with ground truth masks in simulation, we achieved 99.9% reaching accuracy on training and unseen test objects. Our proposed method can be utilized to perform pick-up tasks with high accuracy, without using any positional information of the target. Moreover, we demonstrate learning from scratch and sim-to-real transfer applications using two different physical robots, utilizing pretrained open vocabulary object detection models for mask generation.

[178] Improving the Spatial Resolution of GONG Solar Images to GST Quality Using Deep Learning

Chenyang Li, Qin Li, Haimin Wang, Bo Shen

Main category: cs.CV

TL;DR: GAN-based super-resolution approach enhances low-resolution full-disk Hα solar images from GONG to match high-resolution quality of BBSO/GST observations, recovering fine details in sunspot penumbrae, filaments and fibrils.

Details

Motivation: High-resolution solar imaging is crucial for capturing fine-scale dynamic features like filaments and fibrils, but current full-disk Hα images have limited spatial resolution that cannot resolve these small-scale structures.

Method: Proposed a GAN-based super-resolution approach using Real-ESRGAN with Residual-in-Residual Dense Blocks and a relativistic discriminator, applied to carefully aligned GONG-GST image pairs.

Result: The model effectively recovers fine details within sunspot penumbrae and resolves fine details in filaments and fibrils, achieving MSE of 467.15, RMSE of 21.59, and cross-correlation of 0.7794.

Conclusion: Slight misalignments between image pairs limit quantitative performance, which will be addressed in future work alongside dataset expansion to further improve reconstruction quality.

Abstract: High-resolution (HR) solar imaging is crucial for capturing fine-scale dynamic features such as filaments and fibrils. However, the spatial resolution of the full-disk H$\alpha$ images is limited and insufficient to resolve these small-scale structures. To address this, we propose a GAN-based superresolution approach to enhance low-resolution (LR) full-disk H$\alpha$ images from the Global Oscillation Network Group (GONG) to a quality comparable with HR observations from the Big Bear Solar Observatory/Goode Solar Telescope (BBSO/GST). We employ Real-ESRGAN with Residual-in-Residual Dense Blocks and a relativistic discriminator. We carefully aligned GONG-GST pairs. The model effectively recovers fine details within sunspot penumbrae and resolves fine details in filaments and fibrils, achieving an average mean squared error (MSE) of 467.15, root mean squared error (RMSE) of 21.59, and cross-correlation (CC) of 0.7794. Slight misalignments between image pairs limit quantitative performance, which we plan to address in future work alongside dataset expansion to further improve reconstruction quality.

[179] ChainMPQ: Interleaved Text-Image Reasoning Chains for Mitigating Relation Hallucinations

Yike Wu, Yiwei Wang, Yujun Cai

Main category: cs.CV

TL;DR: ChainMPQ is a training-free method that reduces relation hallucinations in Large Vision-Language Models by using multi-perspective questions and interleaved chains of images and text to improve relational reasoning.

Details

Motivation: Relation hallucinations account for the largest proportion of hallucinations in LVLMs but have received the least attention, hindering model reliability despite strong multimodal task performance.

Method: ChainMPQ extracts subject/object keywords to enhance image regions, constructs multi-perspective questions focusing on subject, object, and relation components, and sequentially inputs them with accumulated textual/visual memories forming an interleaved chain for progressive reasoning.

Result: Experiments on multiple LVLMs and benchmarks show ChainMPQ substantially reduces relation hallucinations, with ablation studies validating the effectiveness of its three core modules.

Conclusion: ChainMPQ effectively addresses the under-explored problem of relation hallucinations in LVLMs through a training-free approach that leverages multi-perspective questioning and interleaved memory chains for improved relational inference.

Abstract: While Large Vision-Language Models (LVLMs) achieve strong performance in multimodal tasks, hallucinations continue to hinder their reliability. Among the three categories of hallucinations, which include object, attribute, and relation, relation hallucinations account for the largest proportion but have received the least attention. To address this issue, we propose ChainMPQ (Multi-Perspective Questions guided Interleaved Chain of Image and Text), a training-free method that improves relational inference in LVLMs by utilizing accumulated textual and visual memories. ChainMPQ first extracts subject and object keywords from the question to enhance the corresponding image regions. It then constructs multi-perspective questions that focus on the three core components of a relationship: the subject, the object, and the relation that links them. These questions are sequentially input to the model, with textual and visual memories from earlier steps providing supporting context for subsequent ones, thereby forming an interleaved chain of images and text that guides progressive relational reasoning. Experiments on multiple LVLMs and benchmarks show that ChainMPQ substantially reduces relation hallucinations, while ablation studies further validate the effectiveness of its three core modules.

[180] Efficient High-Resolution Image Editing with Hallucination-Aware Loss and Adaptive Tiling

Young D. Kwon, Abhinav Mehrotra, Malcolm Chadwick, Alberto Gil Ramos, Sourav Bhattacharya

Main category: cs.CV

TL;DR: MobilePicasso enables efficient 4K image editing on mobile devices through a three-stage pipeline that reduces computational costs and memory usage while improving image quality and reducing hallucinations.

Details

Motivation: Existing diffusion models for image editing face significant memory and quality challenges when deployed on resource-constrained mobile devices, requiring a solution that can handle high-resolution editing efficiently.

Method: Three-stage pipeline: (1) image editing at standard resolution with hallucination-aware loss, (2) latent projection to avoid pixel space conversion, and (3) upscaling with adaptive context-preserving tiling.

Result: Improves image quality by 18-48%, reduces hallucinations by 14-51%, achieves up to 55.8× speed-up with only 9% memory increase, and runs faster on-device than server-based models on A100 GPU.

Conclusion: MobilePicasso successfully enables high-quality 4K image editing on mobile devices with significantly improved efficiency, reduced hallucinations, and better performance than server-based alternatives.

Abstract: High-resolution (4K) image-to-image synthesis has become increasingly important for mobile applications. Existing diffusion models for image editing face significant challenges, in terms of memory and image quality, when deployed on resource-constrained devices. In this paper, we present MobilePicasso, a novel system that enables efficient image editing at high resolutions, while minimising computational cost and memory usage. MobilePicasso comprises three stages: (i) performing image editing at a standard resolution with hallucination-aware loss, (ii) applying latent projection to overcome going to the pixel space, and (iii) upscaling the edited image latent to a higher resolution with adaptive context-preserving tiling. Our user study with 46 participants reveals that MobilePicasso not only improves image quality by 18-48% but reduces hallucinations by 14-51% over existing methods. MobilePicasso demonstrates significantly lower latency, e.g., up to 55.8$\times$ speed-up, yet with a small increase in runtime memory, e.g., a mere 9% increase over prior work. Surprisingly, the on-device runtime of MobilePicasso is observed to be faster than a server-based high-resolution image editing model running on an A100 GPU.

[181] Spatiotemporal Tile-based Attention-guided LSTMs for Traffic Video Prediction

Tu Nguyen

Main category: cs.CV

TL;DR: Proposes a tile-aware, cascaded-memory Conv-LSTM with cross-frame attention for traffic forecasting, enabling scalable modeling of both fine-grained and coarse spatial structures while preserving temporal relationships.

Details

Motivation: To model both fine-grained (pixel-level) and coarse (region-level) spatial structure while preserving temporal relationships across long sequences in traffic forecasting, addressing scalability challenges for large maps.

Method: Introduces a tile-aware, cascaded-memory Conv-LSTM augmented with cross-frame additive attention and a memory-flexible training scheme where frames are sampled per spatial tile to learn tile-local dynamics, with per-tile memory cells that can be updated sparsely, paged, or compressed.

Result: Provides theoretical analysis (tight softmax/attention Lipschitz bound and tiling error lower bound) explaining stability and memory-accuracy tradeoffs, and empirically demonstrates improved scalability and competitive forecasting performance on large-scale traffic heatmaps.

Conclusion: The proposed approach enables scalable traffic forecasting by effectively managing memory while maintaining competitive performance through tile-aware processing and flexible memory management strategies.

Abstract: This extended abstract describes our solution for the Traffic4Cast Challenge 2019. The task requires modeling both fine-grained (pixel-level) and coarse (region-level) spatial structure while preserving temporal relationships across long sequences. Building on Conv-LSTM ideas, we introduce a tile-aware, cascaded-memory Conv-LSTM augmented with cross-frame additive attention and a memory-flexible training scheme: frames are sampled per spatial tile so the model learns tile-local dynamics and per-tile memory cells can be updated sparsely, paged, or compressed to scale to large maps. We provide a compact theoretical analysis (tight softmax/attention Lipschitz bound and a tiling error lower bound) explaining stability and the memory-accuracy tradeoffs, and empirically demonstrate improved scalability and competitive forecasting performance on large-scale traffic heatmaps.

[182] RGBD Gaze Tracking Using Transformer for Feature Fusion

Tobias J. Bauer

Main category: cs.CV

TL;DR: Implementation of AI-based gaze tracking using RGBD images with Transformer feature fusion, achieving competitive results on multiple datasets.

Details

Motivation: To investigate the unexplored combination of RGBD input images and Transformers for gaze tracking, and address limitations of existing datasets that lack depth information or suitable labels for gaze angle estimation.

Method: Uses RGBD images with Transformer architecture for feature fusion, based on Lian et al.’s GAN approach. Tests various configurations including with/without pre-trained GAN and Transformer vs MLP modules.

Result: Best performance achieved without pre-trained GAN module and using MLP instead of Transformer: 30.1mm mean Euclidean error on ShanghaiTechGaze+ (vs 38.7mm by Lian et al.), and 3.26° mean angular error on ETH-XGaze (vs 2.04° by dataset authors).

Conclusion: Transformer-based feature fusion shows promise but MLP performs better. The approach demonstrates viability of RGBD-based gaze tracking, though state-of-the-art performance is not achieved.

Abstract: Subject of this thesis is the implementation of an AI-based Gaze Tracking system using RGBD images that contain both color (RGB) and depth (D) information. To fuse the features extracted from the images, a module based on the Transformer architecture is used. The combination of RGBD input images and Transformers was chosen because it has not yet been investigated. Furthermore, a new dataset is created for training the AI models as existing datasets either do not contain depth information or only contain labels for Gaze Point Estimation that are not suitable for the task of Gaze Angle Estimation. Various model configurations are trained, validated and evaluated on a total of three different datasets. The trained models are then to be used in a real-time pipeline to estimate the gaze direction and thus the gaze point of a person in front of a computer screen. The AI model architecture used in this thesis is based on an earlier work by Lian et al. It uses a Generative Adversarial Network (GAN) to simultaneously remove depth map artifacts and extract head pose features. Lian et al. achieve a mean Euclidean error of 38.7mm on their own dataset ShanghaiTechGaze+. In this thesis, a model architecture with a Transformer module for feature fusion achieves a mean Euclidean error of 55.3mm on the same dataset, but we show that using no pre-trained GAN module leads to a mean Euclidean error of 30.1mm. Replacing the Transformer module with a Multilayer Perceptron (MLP) improves the error to 26.9mm. These results are coherent with the ones on the other two datasets. On the ETH-XGaze dataset, the model with Transformer module achieves a mean angular error of 3.59{\deg} and without Transformer module 3.26{\deg}, whereas the fundamentally different model architecture used by the dataset authors Zhang et al. achieves a mean angular error of 2.04{\deg}. On the OTH-Gaze-Estimation dataset created for…

[183] Scalable deep fusion of spaceborne lidar and synthetic aperture radar for global forest structural complexity mapping

Tiago de Conto, John Armston, Ralph Dubayah

Main category: cs.CV

TL;DR: A deep learning framework fuses GEDI lidar with SAR data to create global high-resolution forest structural complexity maps, achieving high accuracy with minimal computing requirements.

Details

Motivation: To overcome the sparse sampling limitation of GEDI lidar for continuous high-resolution forest structural complexity mapping, enabling global monitoring of forest dynamics.

Method: Adapted EfficientNetV2 architecture trained on 130+ million GEDI footprints, fusing GEDI observations with multimodal SAR datasets to produce 25m resolution wall-to-wall maps.

Result: Achieved global R² = 0.82 with fewer than 400,000 parameters, producing accurate predictions with calibrated uncertainty across biomes and time periods. Generated global multi-temporal dataset from 2015-2022.

Conclusion: This scalable framework enables continuous monitoring of forest structural dynamics, supports biodiversity conservation, and can be extended to predict additional forest variables through transfer learning.

Abstract: Forest structural complexity metrics integrate multiple canopy attributes into a single value that reflects habitat quality and ecosystem function. Spaceborne lidar from the Global Ecosystem Dynamics Investigation (GEDI) has enabled mapping of structural complexity in temperate and tropical forests, but its sparse sampling limits continuous high-resolution mapping. We present a scalable, deep learning framework fusing GEDI observations with multimodal Synthetic Aperture Radar (SAR) datasets to produce global, high-resolution (25 m) wall-to-wall maps of forest structural complexity. Our adapted EfficientNetV2 architecture, trained on over 130 million GEDI footprints, achieves high performance (global R2 = 0.82) with fewer than 400,000 parameters, making it an accessible tool that enables researchers to process datasets at any scale without requiring specialized computing infrastructure. The model produces accurate predictions with calibrated uncertainty estimates across biomes and time periods, preserving fine-scale spatial patterns. It has been used to generate a global, multi-temporal dataset of forest structural complexity from 2015 to 2022. Through transfer learning, this framework can be extended to predict additional forest structural variables with minimal computational cost. This approach supports continuous, multi-temporal monitoring of global forest structural dynamics and provides tools for biodiversity conservation and ecosystem management efforts in a changing climate.

[184] Guardians of Image Quality: Benchmarking Defenses Against Adversarial Attacks on Image Quality Metrics

Alexander Gushchin, Khaled Abud, Georgii Bychkov, Ekaterina Shumitskaya, Anna Chistyakova, Sergey Lavrushkin, Bader Rasheed, Kirill Malyshev, Dmitriy Vatolin, Anastasia Antsiferova

Main category: cs.CV

TL;DR: A comprehensive benchmark evaluating 25 defense mechanisms against adversarial attacks on Image Quality Assessment metrics, testing them with 14 attack algorithms in various settings.

Details

Motivation: The adversarial robustness of Image Quality Assessment metrics is a critical concern that needs systematic evaluation of defense mechanisms.

Method: Systematically evaluated 25 defense strategies including adversarial purification, adversarial training, and certified robustness methods against 14 adversarial attack algorithms in non-adaptive and adaptive settings.

Result: Analysis of differences between defenses and their applicability to IQA tasks, considering preservation of IQA scores and image quality.

Conclusion: The proposed benchmark provides guidance for future developments and accepts new method submissions, with ongoing results available online.

Abstract: In the field of Image Quality Assessment (IQA), the adversarial robustness of the metrics poses a critical concern. This paper presents a comprehensive benchmarking study of various defense mechanisms in response to the rise in adversarial attacks on IQA. We systematically evaluate 25 defense strategies, including adversarial purification, adversarial training, and certified robustness methods. We applied 14 adversarial attack algorithms of various types in both non-adaptive and adaptive settings and tested these defenses against them. We analyze the differences between defenses and their applicability to IQA tasks, considering that they should preserve IQA scores and image quality. The proposed benchmark aims to guide future developments and accepts submissions of new methods, with the latest results available online: https://videoprocessing.ai/benchmarks/iqa-defenses.html.

Yi Xin, Qi Qin, Siqi Luo, Kaiwen Zhu, Juncheng Yan, Yan Tai, Jiayi Lei, Yuewen Cao, Keqi Wang, Yibin Wang, Jinbin Bai, Qian Yu, Dengyang Jiang, Yuandong Pu, Haoxing Chen, Le Zhuo, Junjun He, Gen Luo, Tianbin Li, Ming Hu, Jin Ye, Shenglong Ye, Bo Zhang, Chang Xu, Wenhai Wang, Hongsheng Li, Guangtao Zhai, Tianfan Xue, Bin Fu, Xiaohong Liu, Yu Qiao, Yihao Liu

Main category: cs.CV

TL;DR: Lumina-DiMOO is an open-source foundational model that uses fully discrete diffusion modeling for multi-modal generation and understanding, achieving state-of-the-art performance across various tasks with higher sampling efficiency than previous approaches.

Details

Motivation: To create a unified model that can handle multiple modalities (text, images) more efficiently than existing autoregressive or hybrid approaches, enabling seamless multi-modal generation and understanding.

Method: Utilizes fully discrete diffusion modeling to handle inputs and outputs across various modalities, providing higher sampling efficiency compared to autoregressive or hybrid AR-Diffusion paradigms.

Result: Achieves state-of-the-art performance on multiple benchmarks, surpassing existing open-source unified multi-modal models. Supports text-to-image generation, image-to-image generation (editing, subject-driven generation, inpainting), and image understanding.

Conclusion: Lumina-DiMOO demonstrates the effectiveness of discrete diffusion modeling for multi-modal tasks and is released as open-source to foster further research in multi-modal and discrete diffusion models.

Abstract: We introduce Lumina-DiMOO, an open-source foundational model for seamless multi-modal generation and understanding. Lumina-DiMOO sets itself apart from prior unified models by utilizing a fully discrete diffusion modeling to handle inputs and outputs across various modalities. This innovative approach allows Lumina-DiMOO to achieve higher sampling efficiency compared to previous autoregressive (AR) or hybrid AR-Diffusion paradigms and adeptly support a broad spectrum of multi-modal tasks, including text-to-image generation, image-to-image generation (e.g., image editing, subject-driven generation, and image inpainting, etc.), as well as image understanding. Lumina-DiMOO achieves state-of-the-art performance on multiple benchmarks, surpassing existing open-source unified multi-modal models. To foster further advancements in multi-modal and discrete diffusion model research, we release our code and checkpoints to the community. Project Page: https://synbol.github.io/Lumina-DiMOO.

[186] DWTGS: Rethinking Frequency Regularization for Sparse-view 3D Gaussian Splatting

Hung Nguyen, Runfa Li, An Le, Truong Nguyen

Main category: cs.CV

TL;DR: DWTGS is a framework that uses wavelet-space losses for sparse-view 3D Gaussian Splatting, focusing on low-frequency supervision and high-frequency sparsity to prevent overfitting and improve generalization.

Details

Motivation: Sparse-view 3D Gaussian Splatting often overfits to high-frequency details of training views, and Fourier-based frequency regularization requires difficult parameter tuning and biases towards detrimental high-frequency learning.

Method: Proposes wavelet-space losses that supervise only low-frequency LL subbands at multiple DWT levels while enforcing sparsity on high-frequency HH subbands in a self-supervised manner.

Result: DWTGS consistently outperforms Fourier-based counterparts across benchmarks, improving generalization and reducing high-frequency hallucinations.

Conclusion: The low-frequency-centric wavelet strategy provides better spatial supervision and more effective frequency regularization for sparse-view 3DGS compared to Fourier-based approaches.

Abstract: Sparse-view 3D Gaussian Splatting (3DGS) presents significant challenges in reconstructing high-quality novel views, as it often overfits to the widely-varying high-frequency (HF) details of the sparse training views. While frequency regularization can be a promising approach, its typical reliance on Fourier transforms causes difficult parameter tuning and biases towards detrimental HF learning. We propose DWTGS, a framework that rethinks frequency regularization by leveraging wavelet-space losses that provide additional spatial supervision. Specifically, we supervise only the low-frequency (LF) LL subbands at multiple DWT levels, while enforcing sparsity on the HF HH subband in a self-supervised manner. Experiments across benchmarks show that DWTGS consistently outperforms Fourier-based counterparts, as this LF-centric strategy improves generalization and reduces HF hallucinations.

[187] TransFIRA: Transfer Learning for Face Image Recognizability Assessment

Allen Tu, Kartik Narayan, Joshua Gleason, Jennifer Xu, Matthew Meyn, Tom Goldstein, Vishal M. Patel

Main category: cs.CV

TL;DR: TransFIRA is a lightweight, annotation-free framework for face image recognizability assessment that uses embedding space geometry to measure recognizability via class-center similarity and angular separation, achieving state-of-the-art performance without external labels or backbone-specific training.

Details

Motivation: Existing face image quality assessment (FIQA) methods rely on visual heuristics, curated annotations, or intensive generative pipelines, making their predictions detached from the encoder's decision geometry and failing to predict true recognizability in unconstrained environments with extreme pose, blur, illumination, and occlusion variations.

Method: TransFIRA uses transfer learning to define recognizability via class-center similarity (CCS) and class-center angular separation (CCAS), providing a decision-boundary-aligned criterion. It employs a recognizability-informed aggregation strategy without external labels, heuristics, or backbone-specific training.

Result: Achieves state-of-the-art verification accuracy on BRIAR and IJB-C datasets, nearly doubles correlation with true recognizability, and demonstrates strong performance on body recognition with robustness under cross-dataset shifts.

Conclusion: TransFIRA establishes a unified, geometry-driven framework for recognizability assessment that is encoder-specific, accurate, interpretable, and extensible across modalities, significantly advancing FIQA in accuracy, explainability, and scope.

Abstract: Face recognition in unconstrained environments such as surveillance, video, and web imagery must contend with extreme variation in pose, blur, illumination, and occlusion, where conventional visual quality metrics fail to predict whether inputs are truly recognizable to the deployed encoder. Existing FIQA methods typically rely on visual heuristics, curated annotations, or computationally intensive generative pipelines, leaving their predictions detached from the encoder’s decision geometry. We introduce TransFIRA (Transfer Learning for Face Image Recognizability Assessment), a lightweight and annotation-free framework that grounds recognizability directly in embedding space. TransFIRA delivers three advances: (i) a definition of recognizability via class-center similarity (CCS) and class-center angular separation (CCAS), yielding the first natural, decision-boundary–aligned criterion for filtering and weighting; (ii) a recognizability-informed aggregation strategy that achieves state-of-the-art verification accuracy on BRIAR and IJB-C while nearly doubling correlation with true recognizability, all without external labels, heuristics, or backbone-specific training; and (iii) new extensions beyond faces, including encoder-grounded explainability that reveals how degradations and subject-specific factors affect recognizability, and the first recognizability-aware body recognition assessment. Experiments confirm state-of-the-art results on faces, strong performance on body recognition, and robustness under cross-dataset shifts. Together, these contributions establish TransFIRA as a unified, geometry-driven framework for recognizability assessment – encoder-specific, accurate, interpretable, and extensible across modalities – significantly advancing FIQA in accuracy, explainability, and scope.

[188] Unified Unsupervised Anomaly Detection via Matching Cost Filtering

Zhe Zhang, Mingxiu Cai, Gaochang Wu, Jing Zhang, Lingqiao Liu, Dacheng Tao, Tianyou Chai, Xiatian Zhu

Main category: cs.CV

TL;DR: Unified Cost Filtering (UCF) is a post-hoc refinement framework that enhances unsupervised anomaly detection by filtering matching noise in cost volumes, achieving state-of-the-art results across unimodal (RGB) and multimodal (RGB-3D, RGB-Text) scenarios.

Details

Motivation: Existing unsupervised anomaly detection methods suffer from matching noise limitations and lack unified approaches for both unimodal and multimodal settings, hindering comprehensive understanding and knowledge transfer.

Method: UCF constructs cost volumes by matching test samples against normal samples from same or different modalities, then applies a learnable filtering module with multi-layer attention guidance to mitigate matching noise and highlight subtle anomalies.

Result: Comprehensive experiments on 22 diverse benchmarks show UCF consistently enhances various UAD methods and achieves new state-of-the-art results in both unimodal and multimodal scenarios.

Conclusion: UCF provides a generic post-hoc refinement framework that effectively addresses matching noise in unsupervised anomaly detection, demonstrating strong performance across diverse unimodal and multimodal settings.

Abstract: Unsupervised anomaly detection (UAD) aims to identify image- and pixel-level anomalies using only normal training data, with wide applications such as industrial inspection and medical analysis, where anomalies are scarce due to privacy concerns and cold-start constraints. Existing methods, whether reconstruction-based (restoring normal counterparts) or embedding-based (pretrained representations), fundamentally conduct image- or feature-level matching to generate anomaly maps. Nonetheless, matching noise has been largely overlooked, limiting their detection ability. Beyond earlier focus on unimodal RGB-based UAD, recent advances expand to multimodal scenarios, e.g., RGB-3D and RGB-Text, enabled by point cloud sensing and vision-language models. Despite shared challenges, these lines remain largely isolated, hindering a comprehensive understanding and knowledge transfer. In this paper, we advocate unified UAD for both unimodal and multimodal settings in the matching perspective. Under this insight, we present Unified Cost Filtering (UCF), a generic post-hoc refinement framework for refining anomaly cost volume of any UAD model. The cost volume is constructed by matching a test sample against normal samples from the same or different modalities, followed by a learnable filtering module with multi-layer attention guidance from the test sample, mitigating matching noise and highlighting subtle anomalies. Comprehensive experiments on 22 diverse benchmarks demonstrate the efficacy of UCF in enhancing a variety of UAD methods, consistently achieving new state-of-the-art results in both unimodal (RGB) and multimodal (RGB-3D, RGB-Text) UAD scenarios. Code and models will be released at https://github.com/ZHE-SAPI/CostFilter-AD.

[189] Road Surface Condition Detection with Machine Learning using New York State Department of Transportation Camera Images and Weather Forecast Data

Carly Sutter, Kara J. Sulia, Nick P. Bassill, Christopher D. Wirz, Christopher D. Thorncroft, Jay C. Rothenberger, Vanessa Przybylo, Mariana G. Cains, Jacob Radford, David Aaron Evans

Main category: cs.CV

TL;DR: Machine learning models using CNN and random forests trained on camera images and weather data to automatically classify road surface conditions for NYSDOT, achieving 81.5% accuracy on unseen cameras.

Details

Motivation: NYSDOT currently uses labor-intensive methods like driving on roads and observing live cameras to evaluate road conditions during winter weather events. Machine learning can provide automated support for critical operational decisions.

Method: Trained convolutional neural networks and random forests on ~22,000 hand-labeled camera images classified into six road surface conditions (severe snow, snow, wet, dry, poor visibility, obstructed), combined with weather data.

Result: The weather-related road surface condition model achieved 81.5% accuracy on completely unseen cameras, with model generalizability prioritized for operational needs.

Conclusion: Machine learning models can successfully automate road condition classification with high accuracy on unseen data, providing valuable support for transportation agencies during winter weather events.

Abstract: The New York State Department of Transportation (NYSDOT) has a network of roadside traffic cameras that are used by both the NYSDOT and the public to observe road conditions. The NYSDOT evaluates road conditions by driving on roads and observing live cameras, tasks which are labor-intensive but necessary for making critical operational decisions during winter weather events. However, machine learning models can provide additional support for the NYSDOT by automatically classifying current road conditions across the state. In this study, convolutional neural networks and random forests are trained on camera images and weather data to predict road surface conditions. Models are trained on a hand-labeled dataset of ~22,000 camera images, each classified by human labelers into one of six road surface conditions: severe snow, snow, wet, dry, poor visibility, or obstructed. Model generalizability is prioritized to meet the operational needs of the NYSDOT decision makers, and the weather-related road surface condition model in this study achieves an accuracy of 81.5% on completely unseen cameras.

[190] Platonic Transformers: A Solid Choice For Equivariance

Mohammad Mohaiminul Islam, Rishabh Anand, David R. Wessels, Friso de Kruiff, Thijs P. Kuipers, Rex Ying, Clara I. Sánchez, Sharvaree Vadgama, Georg Bökman, Erik J. Bekkers

Main category: cs.CV

TL;DR: The Platonic Transformer introduces a novel attention mechanism that achieves equivariance to continuous translations and Platonic symmetries while maintaining the exact architecture and computational cost of standard Transformers.

Details

Motivation: Transformers lack geometric symmetry inductive biases common in scientific and computer vision applications, while existing equivariant methods sacrifice Transformer efficiency and flexibility through complex designs.

Method: Defines attention relative to reference frames from Platonic solid symmetry groups, inducing principled weight-sharing that enables combined equivariance to continuous translations and Platonic symmetries.

Result: Achieves competitive performance across computer vision (CIFAR-10), 3D point clouds (ScanObjectNN), and molecular property prediction (QM9, OMol25) benchmarks by leveraging geometric constraints at no additional computational cost.

Conclusion: The Platonic Transformer resolves the trade-off between geometric equivariance and Transformer efficiency, providing a scalable solution that preserves standard Transformer architecture while incorporating geometric inductive biases.

Abstract: While widespread, Transformers lack inductive biases for geometric symmetries common in science and computer vision. Existing equivariant methods often sacrifice the efficiency and flexibility that make Transformers so effective through complex, computationally intensive designs. We introduce the Platonic Transformer to resolve this trade-off. By defining attention relative to reference frames from the Platonic solid symmetry groups, our method induces a principled weight-sharing scheme. This enables combined equivariance to continuous translations and Platonic symmetries, while preserving the exact architecture and computational cost of a standard Transformer. Furthermore, we show that this attention is formally equivalent to a dynamic group convolution, which reveals that the model learns adaptive geometric filters and enables a highly scalable, linear-time convolutional variant. Across diverse benchmarks in computer vision (CIFAR-10), 3D point clouds (ScanObjectNN), and molecular property prediction (QM9, OMol25), the Platonic Transformer achieves competitive performance by leveraging these geometric constraints at no additional cost.

[191] TDiff: Thermal Plug-And-Play Prior with Patch-Based Diffusion

Piyush Dashpute, Niki Nezakati, Wolfgang Heidrich, Vishwanath Saragadam

Main category: cs.CV

TL;DR: TDiff is a patch-based diffusion framework for thermal image restoration that handles low resolution, fixed pattern noise, and other localized degradations by training on small patches and blending them with spatial windowing.

Details

Motivation: Thermal images from low-cost cameras suffer from low resolution, fixed pattern noise, and localized degradations, while available datasets are limited in size and diversity.

Method: Patch-based diffusion framework that trains on small thermal patches, denoises overlapping patches, and blends them using smooth spatial windowing for full-resolution image restoration.

Result: Strong performance on denoising, super-resolution, and deblurring tasks on both simulated and real thermal data.

Conclusion: TDiff establishes a unified restoration pipeline for thermal imaging and is the first patch-based diffusion framework with learned prior for multiple thermal restoration tasks.

Abstract: Thermal images from low-cost cameras often suffer from low resolution, fixed pattern noise, and other localized degradations. Available datasets for thermal imaging are also limited in both size and diversity. To address these challenges, we propose a patch-based diffusion framework (TDiff) that leverages the local nature of these distortions by training on small thermal patches. In this approach, full-resolution images are restored by denoising overlapping patches and blending them using smooth spatial windowing. To our knowledge, this is the first patch-based diffusion framework that models a learned prior for thermal image restoration across multiple tasks. Experiments on denoising, super-resolution, and deblurring demonstrate strong results on both simulated and real thermal data, establishing our method as a unified restoration pipeline.

[192] SIGMA-GEN: Structure and Identity Guided Multi-subject Assembly for Image Generation

Oindrila Saha, Vojtech Krs, Radomir Mech, Subhransu Maji, Kevin Blackburn-Matzen, Matheus Gadelha

Main category: cs.CV

TL;DR: SIGMA-GEN is a unified framework for multi-identity preserving image generation that enables single-pass multi-subject generation with structural and spatial constraints, supporting various user guidance levels from coarse boxes to pixel-level inputs.

Details

Motivation: To address the limitations of prior approaches in multi-identity image generation by enabling simultaneous preservation of multiple identities with flexible user guidance at different precision levels.

Method: Introduces SIGMA-SET27K synthetic dataset with identity, structure, and spatial information for 100k+ subjects across 27k images, and develops a unified framework that supports user guidance from 2D/3D boxes to pixel-level segmentations and depth.

Result: Achieves state-of-the-art performance in identity preservation, image generation quality, and speed through extensive evaluation.

Conclusion: SIGMA-GEN represents a significant advancement in multi-identity image generation, offering superior performance and flexibility compared to existing methods.

Abstract: We present SIGMA-GEN, a unified framework for multi-identity preserving image generation. Unlike prior approaches, SIGMA-GEN is the first to enable single-pass multi-subject identity-preserved generation guided by both structural and spatial constraints. A key strength of our method is its ability to support user guidance at various levels of precision – from coarse 2D or 3D boxes to pixel-level segmentations and depth – with a single model. To enable this, we introduce SIGMA-SET27K, a novel synthetic dataset that provides identity, structure, and spatial information for over 100k unique subjects across 27k images. Through extensive evaluation we demonstrate that SIGMA-GEN achieves state-of-the-art performance in identity preservation, image generation quality, and speed. Code and visualizations at https://oindrilasaha.github.io/SIGMA-Gen/

[193] Superpixel Integrated Grids for Fast Image Segmentation

Jack Roberts, Jeova Farias Sales Rocha Neto

Main category: cs.CV

TL;DR: SIGRID is a new superpixel-based data structure that encodes color and shape information to reduce input dimensionality while maintaining or improving segmentation performance compared to full-resolution images.

Details

Motivation: Superpixels offer computational benefits but their irregular distribution has limited their use in deep learning, requiring specialized architectures that undermine their original efficiency advantages.

Method: Developed SIGRID (Superpixel-Integrated Grid) using classical shape descriptors to encode superpixel color and shape information, creating a structured representation that reduces input dimensionality for segmentation tasks.

Result: SIGRIDs matched or surpassed pixel-level segmentation performance on four benchmark datasets while significantly accelerating model training, demonstrating better accuracy-efficiency balance.

Conclusion: SIGRIDs successfully achieve the original motivation for superpixels by providing computational efficiency without compromising performance, making them a viable alternative to full-resolution images in segmentation tasks.

Abstract: Superpixels have long been used in image simplification to enable more efficient data processing and storage. However, despite their computational potential, their irregular spatial distribution has often forced deep learning approaches to rely on specialized training algorithms and architectures, undermining the original motivation for superpixelations. In this work, we introduce a new superpixel-based data structure, SIGRID (Superpixel-Integrated Grid), as an alternative to full-resolution images in segmentation tasks. By leveraging classical shape descriptors, SIGRID encodes both color and shape information of superpixels while substantially reducing input dimensionality. We evaluate SIGRIDs on four benchmark datasets using two popular convolutional segmentation architectures. Our results show that, despite compressing the original data, SIGRIDs not only match but in some cases surpass the performance of pixel-level representations, all while significantly accelerating model training. This demonstrates that SIGRIDs achieve a favorable balance between accuracy and computational efficiency.

[194] Text2Interact: High-Fidelity and Diverse Text-to-Two-Person Interaction Generation

Qingxuan Wu, Zhiyang Dou, Chuan Guo, Yiming Huang, Qiao Feng, Bing Zhou, Jian Wang, Lingjie Liu

Main category: cs.CV

TL;DR: Text2Interact framework generates realistic human-human interactions from text using scalable data synthesis and fine-grained spatiotemporal coordination, addressing limitations in training data and text-to-interaction modeling.

Details

Motivation: Current methods struggle with modeling human-human interactions due to limited two-person training data and coarse text conditioning that collapses rich prompts into single embeddings, hindering realistic spatiotemporal coupling.

Method: Proposes Text2Interact with two components: 1) InterCompose - scalable synthesis pipeline using LLM-generated descriptions with single-person motion priors, and 2) InterActor - text-to-interaction model with word-level conditioning and adaptive interaction loss for better coupling.

Result: Extensive experiments show consistent improvements in motion diversity, fidelity, and generalization, including successful performance in out-of-distribution scenarios and positive user study results.

Conclusion: The framework effectively addresses data scarcity and fine-grained modeling challenges, enabling realistic text-aligned human-human interaction generation with improved spatiotemporal coordination.

Abstract: Modeling human-human interactions from text remains challenging because it requires not only realistic individual dynamics but also precise, text-consistent spatiotemporal coupling between agents. Currently, progress is hindered by 1) limited two-person training data, inadequate to capture the diverse intricacies of two-person interactions; and 2) insufficiently fine-grained text-to-interaction modeling, where language conditioning collapses rich, structured prompts into a single sentence embedding. To address these limitations, we propose our Text2Interact framework, designed to generate realistic, text-aligned human-human interactions through a scalable high-fidelity interaction data synthesizer and an effective spatiotemporal coordination pipeline. First, we present InterCompose, a scalable synthesis-by-composition pipeline that aligns LLM-generated interaction descriptions with strong single-person motion priors. Given a prompt and a motion for an agent, InterCompose retrieves candidate single-person motions, trains a conditional reaction generator for another agent, and uses a neural motion evaluator to filter weak or misaligned samples-expanding interaction coverage without extra capture. Second, we propose InterActor, a text-to-interaction model with word-level conditioning that preserves token-level cues (initiation, response, contact ordering) and an adaptive interaction loss that emphasizes contextually relevant inter-person joint pairs, improving coupling and physical plausibility for fine-grained interaction modeling. Extensive experiments show consistent gains in motion diversity, fidelity, and generalization, including out-of-distribution scenarios and user studies. We will release code and models to facilitate reproducibility.

[195] From Captions to Keyframes: Efficient Video Summarization via Caption- and Context-Aware Frame Scoring

Shih-Yao Lin, Sibendu Paul, Caren Chen

Main category: cs.CV

TL;DR: KeyScore is a multimodal frame scoring framework that selects informative frames from videos by combining semantic similarity, temporal diversity, and contextual drop impact, achieving up to 99% frame reduction while outperforming standard methods on video-language tasks.

Details

Motivation: Efficient video-language understanding requires selecting a small set of frames that retain semantic and contextual information from long videos, avoiding the computational cost of processing all frames.

Method: Proposes KeyScore for multimodal frame scoring using captions and visual context, and STACFP for generating diverse frame proposals through spatio-temporal adaptive clustering.

Result: Achieves up to 99% frame reduction compared to full-frame inference and substantially outperforms standard 8-frame encoders on MSRVTT, MSVD, and DiDeMo datasets.

Conclusion: Emphasizing multimodal alignment between visual and textual signals enables scalable, efficient, and caption-grounded video understanding without explicit video summarization.

Abstract: Efficient video-language understanding requires selecting a small set of frames that retain semantic and contextual information from long videos. We propose KeyScore, a multimodal frame scoring framework that jointly leverages captions and visual context to estimate frame-level importance. By combining semantic similarity, temporal diversity, and contextual drop impact, KeyScore identifies the most informative frames for downstream tasks such as retrieval, captioning, and video-language reasoning. To complement KeyScore, we introduce STACFP (Spatio-Temporal Adaptive Clustering for Frame Proposals), which generates compact and diverse frame candidates for long-form videos. Together, these modules achieve up to 99% frame reduction compared to full-frame inference and substantially outperform standard 8-frame encoders on MSRVTT, MSVD, and DiDeMo. Our results demonstrate that emphasizing multimodal alignment between visual and textual signals enables scalable, efficient, and caption-grounded video understanding – without explicit video summarization.

[196] LogSTOP: Temporal Scores over Prediction Sequences for Matching and Retrieval

Avishree Khare, Hideki Okamoto, Bardh Hoxha, Georgios Fainekos, Rajeev Alur

Main category: cs.CV

TL;DR: STOPs (Scores for Temporal Properties) formalizes scoring temporal properties over sequences using noisy local detectors. LogSTOP efficiently computes scores for Linear Temporal Logic properties, outperforming baselines by 16-19% on query matching and retrieval tasks.

Details

Motivation: To enable downstream applications like query matching and ranked retrieval by lifting local detection scores (e.g., objects, emotions) to temporal properties over sequences.

Method: Proposes LogSTOP scoring function that efficiently computes scores for temporal properties represented in Linear Temporal Logic, using noisy local property predictors like YOLO and HuBERT.

Result: LogSTOP outperforms Large Vision/Audio Language Models and Temporal Logic baselines by at least 16% on query matching for objects-in-videos and emotions-in-speech. On ranked retrieval, it achieves 19% and 16% increase in mean average precision and recall over zero-shot text-to-video baselines.

Conclusion: LogSTOP provides an effective framework for scoring temporal properties over sequences using noisy local detectors, significantly improving performance on temporal query matching and retrieval tasks compared to existing approaches.

Abstract: Neural models such as YOLO and HuBERT can be used to detect local properties such as objects (“car”) and emotions (“angry”) in individual frames of videos and audio clips respectively. The likelihood of these detections is indicated by scores in [0, 1]. Lifting these scores to temporal properties over sequences can be useful for several downstream applications such as query matching (e.g., “does the speaker eventually sound happy in this audio clip?”), and ranked retrieval (e.g., “retrieve top 5 videos with a 10 second scene where a car is detected until a pedestrian is detected”). In this work, we formalize this problem of assigning Scores for TempOral Properties (STOPs) over sequences, given potentially noisy score predictors for local properties. We then propose a scoring function called LogSTOP that can efficiently compute these scores for temporal properties represented in Linear Temporal Logic. Empirically, LogSTOP, with YOLO and HuBERT, outperforms Large Vision / Audio Language Models and other Temporal Logic-based baselines by at least 16% on query matching with temporal properties over objects-in-videos and emotions-in-speech respectively. Similarly, on ranked retrieval with temporal properties over objects and actions in videos, LogSTOP with Grounding DINO and SlowR50 reports at least a 19% and 16% increase in mean average precision and recall over zero-shot text-to-video retrieval baselines respectively.

[197] Limited-Angle Tomography Reconstruction via Projector Guided 3D Diffusion

Zhantao Deng, Mériem Er-Rafik, Anna Sushko, Cécile Hébert, Pascal Fua

Main category: cs.CV

TL;DR: TEMDiff is a 3D diffusion-based framework that reconstructs 3D shapes from limited-angle TEM projections by learning structural priors from FIB-SEM data, achieving superior performance without requiring clean TEM ground truth.

Details

Motivation: Limited-angle electron tomography suffers from missing-wedge artifacts, and existing deep learning methods require large high-quality training datasets with known 3D ground truth that are difficult to obtain in electron microscopy.

Method: Proposed TEMDiff - a 3D diffusion-based iterative reconstruction framework trained on readily available volumetric FIB-SEM data using a simulator that maps them to TEM tilt series, enabling learning of realistic structural priors without clean TEM ground truth.

Result: TEMDiff outperforms state-of-the-art methods on simulated datasets with limited angular coverage, generalizes well to real-world TEM tilts under different conditions, and recovers accurate structures from tilt ranges as narrow as 8 degrees with 2-degree increments without retraining.

Conclusion: TEMDiff provides an effective solution for limited-angle electron tomography by leveraging diffusion models and readily available FIB-SEM data, eliminating the need for difficult-to-obtain clean TEM ground truth while achieving superior reconstruction quality.

Abstract: Limited-angle electron tomography aims to reconstruct 3D shapes from 2D projections of Transmission Electron Microscopy (TEM) within a restricted range and number of tilting angles, but it suffers from the missing-wedge problem that causes severe reconstruction artifacts. Deep learning approaches have shown promising results in alleviating these artifacts, yet they typically require large high-quality training datasets with known 3D ground truth which are difficult to obtain in electron microscopy. To address these challenges, we propose TEMDiff, a novel 3D diffusion-based iterative reconstruction framework. Our method is trained on readily available volumetric FIB-SEM data using a simulator that maps them to TEM tilt series, enabling the model to learn realistic structural priors without requiring clean TEM ground truth. By operating directly on 3D volumes, TEMDiff implicitly enforces consistency across slices without the need for additional regularization. On simulated electron tomography datasets with limited angular coverage, TEMDiff outperforms state-of-the-art methods in reconstruction quality. We further demonstrate that a trained TEMDiff model generalizes well to real-world TEM tilts obtained under different conditions and can recover accurate structures from tilt ranges as narrow as 8 degrees, with 2-degree increments, without any retraining or fine-tuning.

[198] VUGEN: Visual Understanding priors for GENeration

Xiangyi Chen, Théophane Vallaeys, Maha Elbayad, John Nguyen, Jakob Verbeek

Main category: cs.CV

TL;DR: VUGEN is a novel framework that leverages Vision-Language Models’ visual understanding priors for high-quality image generation without complex bridging mechanisms or reconstruction-oriented autoencoders.

Details

Motivation: Existing approaches for equipping VLMs with image generation capabilities often rely on reconstruction-oriented autoencoders or complex bridging mechanisms, leading to misalignment between understanding and generation representations or architectural complexity.

Method: VUGEN transforms the VLM’s high-dimensional latent space into a lower-dimensional tractable distribution, trains the VLM to sample within this reduced latent space, and uses a dedicated pixel decoder (VAE-free pixel diffusion decoder) to map generated latents back to image space.

Result: VUGEN achieves superior image generation performance, improving DPG Bench from 71.17 to 74.32 and FID from 11.86 to 9.06 on COCO, while fully preserving the VLM’s original understanding capabilities.

Conclusion: The proposed VUGEN framework successfully bridges visual understanding and generation in VLMs through efficient latent space transformation and sampling, outperforming existing methods while maintaining understanding capabilities.

Abstract: Recent advances in Vision-Language Models (VLMs) have enabled unified understanding across text and images, yet equipping these models with robust image generation capabilities remains challenging. Existing approaches often rely on reconstruction-oriented autoencoders or complex bridging mechanisms, leading to misalignment between understanding and generation representations, or architectural complexity. In this work, we propose VUGEN, a novel framework that explicitly leverages VLM’s pretrained visual understanding priors for efficient and high-quality image generation. Our approach first transforms the high-dimensional latent space of the VLM’s native vision encoder into a lower-dimensional, tractable distribution that maximally preserves visual information. The VLM is then trained to sample within this reduced latent space, ensuring alignment with its visual understanding capabilities. Finally, a dedicated pixel decoder maps these generated latents back to the image space. We find that a VAE-free pixel diffusion decoder to be on par or better than commonly used complex latent diffusion decoders that internally rely on VAE latents. Extensive experiments demonstrate that VUGEN achieves superior image generation performance, improving DPG Bench from 71.17 to 74.32 and FID from 11.86 to 9.06 on COCO, while fully preserving the VLM’s original understanding capabilities.

[199] Cluster Paths: Navigating Interpretability in Neural Networks

Nicholas M. Kroeger, Vincent Bindschaedler

Main category: cs.CV

TL;DR: Cluster paths is a post-hoc interpretability method that clusters activations at selected layers and represents inputs as sequences of cluster IDs, providing human-readable explanations for neural network decisions.

Details

Motivation: Modern deep neural networks are opaque in their decision processes, risking unwarranted trust, undetected biases and unexpected failures, creating a need for interpretability methods.

Method: Cluster activations at selected layers and represent each input as its sequence of cluster IDs. Extend to concept paths using large language models on minimal path divergences. Use four metrics: path complexity, weighted-path purity, decision-alignment faithfulness, and path agreement.

Result: In spurious-cue CIFAR-10, cluster paths identify color-based shortcuts. On CelebA hair-color task: 90% faithfulness, 96% agreement under noise without accuracy loss. Scales to Vision Transformers on ImageNet. Effective OOD detector flagging anomalies before over-confident predictions.

Conclusion: Cluster paths scale to large vision models while generating concise, human-readable explanations that uncover visual concepts like color palettes, textures, and object contexts at multiple network depths.

Abstract: While modern deep neural networks achieve impressive performance in vision tasks, they remain opaque in their decision processes, risking unwarranted trust, undetected biases and unexpected failures. We propose cluster paths, a post-hoc interpretability method that clusters activations at selected layers and represents each input as its sequence of cluster IDs. To assess these cluster paths, we introduce four metrics: path complexity (cognitive load), weighted-path purity (class alignment), decision-alignment faithfulness (predictive fidelity), and path agreement (stability under perturbations). In a spurious-cue CIFAR-10 experiment, cluster paths identify color-based shortcuts and collapse when the cue is removed. On a five-class CelebA hair-color task, they achieve 90% faithfulness and maintain 96% agreement under Gaussian noise without sacrificing accuracy. Scaling to a Vision Transformer pretrained on ImageNet, we extend cluster paths to concept paths derived from prompting a large language model on minimal path divergences. Finally, we show that cluster paths can serve as an effective out-of-distribution (OOD) detector, reliably flagging anomalous samples before the model generates over-confident predictions. Cluster paths uncover visual concepts, such as color palettes, textures, or object contexts, at multiple network depths, demonstrating that cluster paths scale to large vision models while generating concise and human-readable explanations.

[200] HSNet: Heterogeneous Subgraph Network for Single Image Super-resolution

Qiongyang Hu, Wenyang Liu, Wenbin Zou, Yuejiao Su, Lap-Pui Chau, Yi Wang

Main category: cs.CV

TL;DR: HSNet is a graph-based image super-resolution framework that uses heterogeneous subgraphs to achieve better structural flexibility while maintaining computational efficiency through node sampling and subgraph decomposition.

Details

Motivation: Existing CNN and attention-based super-resolution methods lack structural flexibility, while graph-based approaches suffer from high computational complexity. There's a need for a method that combines the representational adaptability of graphs with computational feasibility.

Method: Proposes Heterogeneous Subgraph Network (HSNet) with three key components: Constructive Subgraph Set Block (CSSB) generates diverse complementary subgraphs, Subgraph Aggregation Block (SAB) integrates subgraph representations, and Node Sampling Strategy (NSS) reduces computational overhead by selecting salient features.

Result: Extensive experiments show HSNet achieves state-of-the-art performance, effectively balancing reconstruction quality with computational efficiency.

Conclusion: HSNet successfully addresses the limitations of existing approaches by decomposing global graphs into manageable sub-components, providing both structural flexibility and computational efficiency for image super-resolution.

Abstract: Existing deep learning approaches for image super-resolution, particularly those based on CNNs and attention mechanisms, often suffer from structural inflexibility. Although graph-based methods offer greater representational adaptability, they are frequently impeded by excessive computational complexity. To overcome these limitations, this paper proposes the Heterogeneous Subgraph Network (HSNet), a novel framework that efficiently leverages graph modeling while maintaining computational feasibility. The core idea of HSNet is to decompose the global graph into manageable sub-components. First, we introduce the Constructive Subgraph Set Block (CSSB), which generates a diverse set of complementary subgraphs. Rather than relying on a single monolithic graph, CSSB captures heterogeneous characteristics of the image by modeling different relational patterns and feature interactions, producing a rich ensemble of both local and global graph structures. Subsequently, the Subgraph Aggregation Block (SAB) integrates the representations embedded across these subgraphs. Through adaptive weighting and fusion of multi-graph features, SAB constructs a comprehensive and discriminative representation that captures intricate interdependencies. Furthermore, a Node Sampling Strategy (NSS) is designed to selectively retain the most salient features, thereby enhancing accuracy while reducing computational overhead. Extensive experiments demonstrate that HSNet achieves state-of-the-art performance, effectively balancing reconstruction quality with computational efficiency. The code will be made publicly available.

[201] Through the Perspective of LiDAR: A Feature-Enriched and Uncertainty-Aware Annotation Pipeline for Terrestrial Point Cloud Segmentation

Fei Zhang, Rob Chancia, Josie Clapp, Amirhossein Hassanzadeh, Dimah Dera, Richard MacKenzie, Jan van Aardt

Main category: cs.CV

TL;DR: A semi-automated pipeline for semantic segmentation of TLS point clouds that reduces manual annotation effort through uncertainty-aware ensemble learning and targeted annotation, with applications to mangrove forest monitoring.

Details

Motivation: Manual annotation of terrestrial laser scanning (TLS) point clouds is costly and time-consuming, limiting the scalability of semantic segmentation for ecological monitoring applications like mangrove forests.

Method: Projects 3D points to 2D spherical grid, enriches with multi-source features, trains ensemble of segmentation networks to generate pseudo-labels and uncertainty maps, uses uncertainty to guide annotation of ambiguous regions, and back-projects to 3D with visualization tools.

Result: Performance saturates after ~12 annotated scans, geometric features contribute most, compact 9-channel feature stacks capture nearly all discriminative power with mIoU plateauing at 0.76, and method generalizes to other datasets (ForestSemantic, Semantic3D).

Conclusion: The proposed pipeline enables scalable, high-quality TLS point cloud segmentation with reduced annotation effort, demonstrated through the Mangrove3D dataset and cross-dataset validation, providing empirical guidance for ecological monitoring applications.

Abstract: Accurate semantic segmentation of terrestrial laser scanning (TLS) point clouds is limited by costly manual annotation. We propose a semi-automated, uncertainty-aware pipeline that integrates spherical projection, feature enrichment, ensemble learning, and targeted annotation to reduce labeling effort, while sustaining high accuracy. Our approach projects 3D points to a 2D spherical grid, enriches pixels with multi-source features, and trains an ensemble of segmentation networks to produce pseudo-labels and uncertainty maps, the latter guiding annotation of ambiguous regions. The 2D outputs are back-projected to 3D, yielding densely annotated point clouds supported by a three-tier visualization suite (2D feature maps, 3D colorized point clouds, and compact virtual spheres) for rapid triage and reviewer guidance. Using this pipeline, we build Mangrove3D, a semantic segmentation TLS dataset for mangrove forests. We further evaluate data efficiency and feature importance to address two key questions: (1) how much annotated data are needed and (2) which features matter most. Results show that performance saturates after ~12 annotated scans, geometric features contribute the most, and compact nine-channel stacks capture nearly all discriminative power, with the mean Intersection over Union (mIoU) plateauing at around 0.76. Finally, we confirm the generalization of our feature-enrichment strategy through cross-dataset tests on ForestSemantic and Semantic3D. Our contributions include: (i) a robust, uncertainty-aware TLS annotation pipeline with visualization tools; (ii) the Mangrove3D dataset; and (iii) empirical guidance on data efficiency and feature importance, thus enabling scalable, high-quality segmentation of TLS point clouds for ecological monitoring and beyond. The dataset and processing scripts are publicly available at https://fz-rit.github.io/through-the-lidars-eye/.

[202] Improving Artifact Robustness for CT Deep Learning Models Without Labeled Artifact Images via Domain Adaptation

Justin Cheung, Samuel Savine, Calvin Nguyen, Lin Lu, Alhassan S. Yasin

Main category: cs.CV

TL;DR: Domain adaptation using DANN maintains CT image classification accuracy when ring artifacts are introduced, without needing labeled artifact data, achieving performance comparable to models trained with labeled artifacts.

Details

Motivation: Deep learning models degrade when applied to new CT scanner distributions with artifacts not seen during training. Labeling new artifact distributions is costly, so domain adaptation offers a more accessible alternative.

Method: Simulated ring artifacts from detector gain error in sinogram space, evaluated domain adversarial neural networks (DANN) against baseline and augmentation approaches on OrganAMNIST abdominal CT dataset.

Result: Baseline models failed on ring artifact images, traditional augmentation provided no improvement. DANN maintained high classification accuracy using only unlabeled artifact data, achieving performance comparable to models trained with labeled artifacts.

Conclusion: Domain adaptation effectively addresses distribution shift in medical imaging without expensive expert labeling, showing promise for clinical deployment where novel artifacts may emerge.

Abstract: Deep learning models which perform well on images from their training distribution can degrade substantially when applied to new distributions. If a CT scanner introduces a new artifact not present in the training labels, the model may misclassify the images. Although modern CT scanners include design features which mitigate these artifacts, unanticipated or difficult-to-mitigate artifacts can still appear in practice. The direct solution of labeling images from this new distribution can be costly. As a more accessible alternative, this study evaluates domain adaptation as an approach for training models that maintain classification performance despite new artifacts, even without corresponding labels. We simulate ring artifacts from detector gain error in sinogram space and evaluate domain adversarial neural networks (DANN) against baseline and augmentation-based approaches on the OrganAMNIST abdominal CT dataset. Our results demonstrate that baseline models trained only on clean images fail to generalize to images with ring artifacts, and traditional augmentation with other distortion types provides no improvement on unseen artifact domains. In contrast, the DANN approach successfully maintains high classification accuracy on ring artifact images using only unlabeled artifact data during training, demonstrating the viability of domain adaptation for artifact robustness. The domain-adapted model achieved classification performance on ring artifact test data comparable to models explicitly trained with labeled artifact images, while also showing unexpected generalization to uniform noise. These findings provide empirical evidence that domain adaptation can effectively address distribution shift in medical imaging without requiring expensive expert labeling of new artifact distributions, suggesting promise for deployment in clinical settings where novel artifacts may emerge.

[203] Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer

Ziyuan Huang, DanDan Zheng, Cheng Zou, Rui Liu, Xiaolong Wang, Kaixiang Ji, Weilong Chai, Jianxin Sun, Libin Wang, Yongjie Lv, Taozhi Huang, Jiajia Liu, Qingpei Guo, Ming Yang, Jingdong Chen, Jun Zhou

Main category: cs.CV

TL;DR: MingTok introduces continuous visual tokenization to unify vision understanding and generation, overcoming quantization errors in discrete methods. Ming-UniVision uses this to handle diverse vision-language tasks through autoregressive prediction.

Details

Motivation: Existing visual tokenizers use discrete latent spaces that cause quantization errors, limiting semantic expressiveness and degrading vision-language understanding capabilities.

Method: MingTok uses a three-stage sequential architecture: low-level encoding, semantic expansion, and visual reconstruction. Ming-UniVision builds on this to unify vision-language tasks under a single autoregressive prediction paradigm in continuous space.

Result: The unified continuous visual representation reconciles competing requirements of understanding and generation tasks, achieving state-of-the-art performance across both domains.

Conclusion: Continuous visual tokenization effectively unifies visual understanding and generation, supporting multi-round, in-context tasks like iterative understanding, generation, and editing.

Abstract: Visual tokenization remains a core challenge in unifying visual understanding and generation within the autoregressive paradigm. Existing methods typically employ tokenizers in discrete latent spaces to align with the tokens from large language models, where the quantization errors can limit semantic expressiveness and degrade the capability of vision-language understanding. To address this, we introduce MingTok, a new family of visual tokenizers with a continuous latent space, for unified autoregressive generation and understanding. While understanding tasks favor discriminative high-dimensional features, generation tasks prefer compact low-level codes. Thus, to reconcile these competing demands, MingTok adopts a three-stage sequential architecture involving low-level encoding, semantic expansion, and visual reconstruction. Built on top of it, Ming-UniVision eliminates the need for task-specific visual representations, and unifies diverse vision-language tasks under a single autoregrsssive prediction paradigm. By formulating both understanding and generation as next-token prediction in a shared continuous space, it seamlessly supports multi-round, in-context tasks such as iterative understanding, generation and editing. Empirically, we find that using a unified continuous visual representation reconciles the competing requirements on the tokenizers by the understanding and generation tasks, thereby leading to state-of-the-art level performance across both domains. We hope our findings will facilitate unified visual tokenization in the continuous domain. Inference code and model weights are released to benefit community.

[204] Adaptive Stain Normalization for Cross-Domain Medical Histology

Tianyue Xu, Yanlin Wu, Abhai K. Tripathi, Matthew M. Ippolito, Benjamin D. Haeffele

Main category: cs.CV

TL;DR: Proposes BeerLaNet, a trainable color normalization model based on physics-inspired NMF unrolling that addresses domain shift in digital pathology by extracting stain-invariant structural information, outperforming state-of-the-art methods.

Details

Motivation: Color variability in digital pathology due to different staining protocols and imaging conditions causes domain shift, reducing deep learning model performance when deployed on data from different conditions than training data. Existing color normalization methods have drawbacks like introducing artifacts or requiring careful template selection.

Method: Trainable color normalization model integrated with backbone networks, derived via algorithmic unrolling of nonnegative matrix factorization (NMF) based on Beer-Lambert law physics. Extracts stain-invariant structural information from pathology images for downstream tasks like object detection and classification.

Result: Evaluated on public pathology datasets and internal malaria blood smears for cross-domain object detection and classification. Outperformed many state-of-the-art stain normalization methods.

Conclusion: The proposed BeerLaNet effectively addresses color variability challenges in digital pathology through physics-inspired trainable normalization, improving model robustness across different imaging conditions and outperforming existing methods.

Abstract: Deep learning advances have revolutionized automated digital pathology analysis. However, differences in staining protocols and imaging conditions can introduce significant color variability. In deep learning, such color inconsistency often reduces performance when deploying models on data acquired under different conditions from the training data, a challenge known as domain shift. Many existing methods attempt to address this problem via color normalization but suffer from several notable drawbacks such as introducing artifacts or requiring careful choice of a template image for stain mapping. To address these limitations, we propose a trainable color normalization model that can be integrated with any backbone network for downstream tasks such as object detection and classification. Based on the physics of the imaging process per the Beer-Lambert law, our model architecture is derived via algorithmic unrolling of a nonnegative matrix factorization (NMF) model to extract stain-invariant structural information from the original pathology images, which serves as input for further processing. Experimentally, we evaluate the method on publicly available pathology datasets and an internally curated collection of malaria blood smears for cross-domain object detection and classification, where our method outperforms many state-of-the-art stain normalization methods. Our code is available at https://github.com/xutianyue/BeerLaNet.

[205] SDQM: Synthetic Data Quality Metric for Object Detection Dataset Evaluation

Ayush Zenith, Arnold Zumbrun, Neel Raut, Jing Lin

Main category: cs.CV

TL;DR: SDQM is a new metric for evaluating synthetic dataset quality in object detection tasks without requiring model training, showing strong correlation with mAP scores.

Details

Motivation: Addressing the challenge of evaluating synthetic data quality efficiently, as current methods require costly iterative training and show weak correlation with model performance.

Method: Introduces Synthetic Dataset Quality Metric (SDQM) that assesses data quality for object detection without model training convergence, enabling efficient dataset selection.

Result: SDQM demonstrated strong correlation with YOLOv11 mAP scores, outperforming previous metrics that showed only moderate or weak correlations.

Conclusion: SDQM provides a scalable and efficient standard for synthetic data evaluation, offering actionable insights for dataset improvement while minimizing training costs.

Abstract: The performance of machine learning models depends heavily on training data. The scarcity of large-scale, well-annotated datasets poses significant challenges in creating robust models. To address this, synthetic data generated through simulations and generative models has emerged as a promising solution, enhancing dataset diversity and improving the performance, reliability, and resilience of models. However, evaluating the quality of this generated data requires an effective metric. This paper introduces the Synthetic Dataset Quality Metric (SDQM) to assess data quality for object detection tasks without requiring model training to converge. This metric enables more efficient generation and selection of synthetic datasets, addressing a key challenge in resource-constrained object detection tasks. In our experiments, SDQM demonstrated a strong correlation with the mean Average Precision (mAP) scores of YOLOv11, a leading object detection model, while previous metrics only exhibited moderate or weak correlations. Additionally, it provides actionable insights for improving dataset quality, minimizing the need for costly iterative training. This scalable and efficient metric sets a new standard for evaluating synthetic data. The code for SDQM is available at https://github.com/ayushzenith/SDQM

[206] AIM 2025 Challenge on Real-World RAW Image Denoising

Feiran Li, Jiacheng Li, Marcos V. Conde, Beril Besbinar, Vlad Hosu, Daisuke Iso, Radu Timofte

Main category: cs.CV

TL;DR: The AIM 2025 Real-World RAW Image Denoising Challenge advances efficient denoising techniques using data synthesis, featuring a new benchmark with low-light images from five DSLR cameras.

Details

Motivation: To advance camera-agnostic low-light RAW image denoising trained on synthetic data and promote robust models for practical applications in digital photography and autonomous driving.

Method: Participants develop novel noise synthesis pipelines, network architectures, and training methodologies to achieve high performance across different camera models.

Result: Winners are determined based on performance metrics including full-reference (PSNR, SSIM, LPIPS) and non-reference (ARNIQA, TOPIQ) measures.

Conclusion: The competition promotes development of practical denoising models and is expected to influence domains from image restoration to night-time autonomous driving.

Abstract: We introduce the AIM 2025 Real-World RAW Image Denoising Challenge, aiming to advance efficient and effective denoising techniques grounded in data synthesis. The competition is built upon a newly established evaluation benchmark featuring challenging low-light noisy images captured in the wild using five different DSLR cameras. Participants are tasked with developing novel noise synthesis pipelines, network architectures, and training methodologies to achieve high performance across different camera models. Winners are determined based on a combination of performance metrics, including full-reference measures (PSNR, SSIM, LPIPS), and non-reference ones (ARNIQA, TOPIQ). By pushing the boundaries of camera-agnostic low-light RAW image denoising trained on synthetic data, the competition promotes the development of robust and practical models aligned with the rapid progress in digital photography. We expect the competition outcomes to influence multiple domains, from image restoration to night-time autonomous driving.

[207] Self-supervised Physics-guided Model with Implicit Representation Regularization for Fast MRI Reconstruction

Jingran Xu, Yuanyuan Liu, Yanjie Zhu

Main category: cs.CV

TL;DR: UnrollINR is a zero-shot self-supervised MRI reconstruction framework that combines physics-guided unrolled iterative reconstruction with Implicit Neural Representation as regularization, achieving superior performance at high acceleration rates without external training data.

Details

Motivation: MRI scan times are prolonged, limiting widespread clinical use. While deep learning methods show promise, they often require fully sampled data which is difficult to obtain. There's a need for scan-specific reconstruction methods that don't rely on external training datasets.

Method: Proposes UnrollINR framework combining physics-guided unrolled iterative reconstruction architecture with Implicit Neural Representation (INR) as regularization prior. Uses zero-shot self-supervised learning without external training data.

Result: Achieves superior reconstruction performance compared to supervised learning methods, even at high acceleration rate of 10. Validates the method’s superiority in scan-specific MRI reconstruction.

Conclusion: The combination of deep unrolled structure with INR’s implicit representation capability enhances both interpretability and reconstruction performance, providing an effective solution for fast MRI reconstruction without requiring external training datasets.

Abstract: Magnetic Resonance Imaging (MRI) is a vital clinical diagnostic tool, yet its widespread application is limited by prolonged scan times. Fast MRI reconstruction techniques effectively reduce acquisition duration by reconstructing high-fidelity MR images from undersampled k-space data. In recent years, deep learning-based methods have demonstrated remarkable progress in this field, with self-supervised and unsupervised learning approaches proving particularly valuable in scenarios where fully sampled data are difficult to obtain. This paper proposes a novel zero-shot self-supervised reconstruction framework named UnrollINR, which enables scan-specific MRI reconstruction without relying on external training data. The method adopts a physics-guided unrolled iterative reconstruction architecture and introduces Implicit Neural Representation (INR) as a regularization prior to effectively constrain the solution space. By combining a deep unrolled structure with the powerful implicit representation capability of INR, the model’s interpretability and reconstruction performance are enhanced. Experimental results demonstrate that even at a high acceleration rate of 10, UnrollINR achieves superior reconstruction performance compared to the supervised learning method, validating the superiority of the proposed method.

[208] A Bridge from Audio to Video: Phoneme-Viseme Alignment Allows Every Face to Speak Multiple Languages

Zibo Su, Kun Wei, Jiahua Li, Xu Yang, Cheng Deng

Main category: cs.CV

TL;DR: MuEx is a multilingual talking face synthesis framework that uses phonemes and visemes as universal intermediaries to bridge audio and video modalities, achieving superior performance across multiple languages and zero-shot generalization to unseen languages.

Details

Motivation: Current talking face synthesis models perform well in English but poorly in non-English languages due to English-dominated training datasets and lack of cross-language generalization abilities.

Method: Proposes Phoneme-Guided Mixture-of-Experts (PG-MoE) architecture using phonemes and visemes as basic units, with Phoneme-Viseme Alignment Mechanism (PV-Align) for cross-modal synchronization. Also builds Multilingual Talking Face Benchmark (MTFB) with 12 languages and 95.04 hours of videos.

Result: Extensive experiments show MuEx achieves superior performance across all languages in MTFB and exhibits effective zero-shot generalization to unseen languages without additional training.

Conclusion: MuEx successfully addresses multilingual talking face synthesis challenges by using phonemes and visemes as universal intermediaries, achieving lifelike facial animations across diverse languages with strong generalization capabilities.

Abstract: Speech-driven talking face synthesis (TFS) focuses on generating lifelike facial animations from audio input. Current TFS models perform well in English but unsatisfactorily in non-English languages, producing wrong mouth shapes and rigid facial expressions. The terrible performance is caused by the English-dominated training datasets and the lack of cross-language generalization abilities. Thus, we propose Multilingual Experts (MuEx), a novel framework featuring a Phoneme-Guided Mixture-of-Experts (PG-MoE) architecture that employs phonemes and visemes as universal intermediaries to bridge audio and video modalities, achieving lifelike multilingual TFS. To alleviate the influence of linguistic differences and dataset bias, we extract audio and video features as phonemes and visemes respectively, which are the basic units of speech sounds and mouth movements. To address audiovisual synchronization issues, we introduce the Phoneme-Viseme Alignment Mechanism (PV-Align), which establishes robust cross-modal correspondences between phonemes and visemes. In addition, we build a Multilingual Talking Face Benchmark (MTFB) comprising 12 diverse languages with 95.04 hours of high-quality videos for training and evaluating multilingual TFS performance. Extensive experiments demonstrate that MuEx achieves superior performance across all languages in MTFB and exhibits effective zero-shot generalization to unseen languages without additional training.

[209] MSITrack: A Challenging Benchmark for Multispectral Single Object Tracking

Tao Feng, Tingfa Xu, Haolin Qin, Tianhao Li, Shuaihao Han, Xuyang Zou, Zhan Lv, Jianan Li

Main category: cs.CV

TL;DR: MSITrack is the largest multispectral single object tracking dataset with 300 videos, 129k frames, and 55 object categories, designed to address RGB tracking limitations through enhanced spectral data.

Details

Motivation: RGB-based trackers struggle with real-world challenges like occlusion, similar object interference, and complex backgrounds. Limited multispectral tracking datasets hinder progress in this area.

Method: Created MSITrack dataset with 300 videos across 55 categories, featuring challenging attributes, natural scenes, and meticulous manual annotation with multi-stage verification.

Result: Extensive evaluations show multispectral data in MSITrack significantly improves tracking performance over RGB-only baselines.

Conclusion: MSITrack provides a comprehensive benchmark that demonstrates the value of multispectral data for object tracking and will drive future advancements in the field.

Abstract: Visual object tracking in real-world scenarios presents numerous challenges including occlusion, interference from similar objects and complex backgrounds-all of which limit the effectiveness of RGB-based trackers. Multispectral imagery, which captures pixel-level spectral reflectance, enhances target discriminability. However, the availability of multispectral tracking datasets remains limited. To bridge this gap, we introduce MSITrack, the largest and most diverse multispectral single object tracking dataset to date. MSITrack offers the following key features: (i) More Challenging Attributes-including interference from similar objects and similarity in color and texture between targets and backgrounds in natural scenarios, along with a wide range of real-world tracking challenges; (ii) Richer and More Natural Scenes-spanning 55 object categories and 300 distinct natural scenes, MSITrack far exceeds the scope of existing benchmarks. Many of these scenes and categories are introduced to the multispectral tracking domain for the first time; (iii) Larger Scale-300 videos comprising over 129k frames of multispectral imagery. To ensure annotation precision, each frame has undergone meticulous processing, manual labeling and multi-stage verification. Extensive evaluations using representative trackers demonstrate that the multispectral data in MSITrack significantly improves performance over RGB-only baselines, highlighting its potential to drive future advancements in the field. The MSITrack dataset is publicly available at: https://github.com/Fengtao191/MSITrack.

[210] StaR-KVQA: Structured Reasoning Traces for Implicit-Knowledge Visual Question Answering

Zhihao Wen, Wenkang Wei, Yuan Fang, Xingtong Yu, Hui Zhang, Weicheng Zhu, Xin Zhang

Main category: cs.CV

TL;DR: StaR-KVQA improves implicit-knowledge visual question answering by supervising structured reasoning traces with dual symbolic relation paths and natural-language explanations, enabling transparent reasoning without external knowledge sources.

Details

Motivation: MLLMs lack explicit reasoning supervision and produce inconsistent justifications in IK-KVQA, leading to poor generalization after standard supervised fine-tuning.

Method: Constructs and selects path-grounded reasoning traces to form a trace-enriched dataset, then fine-tunes via structured self-distillation to align generation with supervision. Uses no external retrievers, verifiers, or curated KBs.

Result: Achieves up to +11.3% higher answer accuracy on OK-VQA over the strongest baseline while exhibiting robust cross-domain generalization.

Conclusion: StaR-KVQA improves both accuracy and interpretability in IK-KVQA by making reasoning transparent and verifiable through structured reasoning traces.

Abstract: Knowledge-based Visual Question Answering (KVQA) requires models to ground entities in images and reason over factual knowledge. We study its implicit-knowledge variant, IK-KVQA, where a multimodal large language model (MLLM) is the sole knowledge source, without external retrieval. Yet, MLLMs lack explicit reasoning supervision and produce inconsistent justifications, and generalize poorly after standard supervised fine-tuning (SFT). We present StaR-KVQA (Structured Reasoning Traces for IK-KVQA), which supervises structured traces - dual symbolic relation paths plus path-grounded natural-language explanations - so that reasoning becomes transparent and verifiable. With one open-source MLLM, StaR-KVQA constructs and selects path-grounded reasoning traces to form a trace-enriched dataset, then fine-tunes via structured self-distillation to align generation with supervision; no external retrievers, verifiers, or curated knowledge bases (KBs) are used, traces are built offline, and inference is a single autoregressive pass. Across benchmarks, StaR-KVQA improves both accuracy and interpretability, achieving up to +11.3% higher answer accuracy on OK-VQA over the strongest baseline while exhibiting robust cross-domain generalization.

[211] Automated Neural Architecture Design for Industrial Defect Detection

Yuxi Liu, Yunfeng Ma, Yi Tang, Min Liu, Shuai Jiang, Yaonan Wang

Main category: cs.CV

TL;DR: AutoNAD is an automated neural architecture design framework for surface defect detection that searches over convolutions, transformers, and MLPs to address intraclass differences and interclass similarity challenges.

Details

Motivation: Existing manual methods for surface defect detection require extensive trial and error and struggle with intraclass differences and interclass similarity. An automated approach is needed to reduce manual design costs while effectively handling these challenges.

Method: Proposes AutoNAD framework that jointly searches over convolutions, transformers, and MLPs using cross weight sharing strategy for efficient training and searchable multi-level feature aggregation module for multi-scale learning. Incorporates latency-aware prior for runtime efficiency.

Result: Validated on three industrial defect datasets and applied within a defect imaging and detection platform. The framework addresses both intraclass differences and interclass similarity while reducing manual network design costs.

Conclusion: AutoNAD provides an effective automated solution for industrial surface defect detection that captures both local variations and long-range context, with practical deployment considerations through latency-aware architecture selection.

Abstract: Industrial surface defect detection (SDD) is critical for ensuring product quality and manufacturing reliability. Due to the diverse shapes and sizes of surface defects, SDD faces two main challenges: intraclass difference and interclass similarity. Existing methods primarily utilize manually designed models, which require extensive trial and error and often struggle to address both challenges effectively. To overcome this, we propose AutoNAD, an automated neural architecture design framework for SDD that jointly searches over convolutions, transformers, and multi-layer perceptrons. This hybrid design enables the model to capture both fine-grained local variations and long-range semantic context, addressing the two key challenges while reducing the cost of manual network design. To support efficient training of such a diverse search space, AutoNAD introduces a cross weight sharing strategy, which accelerates supernet convergence and improves subnet performance. Additionally, a searchable multi-level feature aggregation module (MFAM) is integrated to enhance multi-scale feature learning. Beyond detection accuracy, runtime efficiency is essential for industrial deployment. To this end, AutoNAD incorporates a latency-aware prior to guide the selection of efficient architectures. The effectiveness of AutoNAD is validated on three industrial defect datasets and further applied within a defect imaging and detection platform. Code will be available at https://github.com/Yuxi104/AutoNAD.

[212] Heptapod: Language Modeling on Visual Signals

Yongxin Zhu, Jiawei Chen, Yuanzhe Chen, Zhuo Chen, Dongya Jia, Jian Cong, Xiaobin Zhuang, Yuping Wang, Yuxuan Wang

Main category: cs.CV

TL;DR: Heptapod is an image autoregressive model that uses causal attention, eliminates CFG reliance, and avoids semantic tokenizers. It introduces next 2D distribution prediction to unify sequential modeling with holistic self-supervised learning.

Details

Motivation: To rethink language modeling principles for visual signals by developing a more principled approach that avoids common practices like CFG and semantic tokenizers.

Method: Uses causal Transformer with reconstruction-focused visual tokenizer and next 2D distribution prediction, which predicts the distribution over the entire 2D spatial grid at each timestep.

Result: Achieves FID of 2.70 on ImageNet generation benchmark, significantly outperforming previous causal autoregressive approaches.

Conclusion: The work demonstrates a principled approach to visual language modeling and hopes to inspire rethinking of language modeling for visual signals and beyond.

Abstract: We introduce Heptapod, an image autoregressive model that adheres to the foundational principles of language modeling. Heptapod employs \textbf{causal attention}, \textbf{eliminates reliance on CFG}, and \textbf{eschews the trend of semantic tokenizers}. Our key innovation is \textit{next 2D distribution prediction}: a causal Transformer with reconstruction-focused visual tokenizer, learns to predict the distribution over the entire 2D spatial grid of images at each timestep. This learning objective unifies the sequential modeling of autoregressive framework with the holistic self-supervised learning of masked autoencoding, enabling the model to capture comprehensive image semantics via generative training. On the ImageNet generation benchmark, Heptapod achieves an FID of $2.70$, significantly outperforming previous causal autoregressive approaches. We hope our work inspires a principled rethinking of language modeling on visual signals and beyond.

[213] DreamOmni2: Multimodal Instruction-based Editing and Generation

Bin Xia, Bohao Peng, Yuechen Zhang, Junjia Huang, Jiyang Liu, Jingyao Li, Haoru Tan, Sitong Wu, Chengyao Wang, Yitong Wang, Xinglong Wu, Bei Yu, Jiaya Jia

Main category: cs.CV

TL;DR: DreamOmni2 addresses limitations in current image editing/generation by introducing multimodal instruction-based editing and generation tasks that support both text/image instructions and handle both concrete/abstract concepts.

Details

Motivation: Current instruction-based image editing relies only on language which fails to capture specific details, while subject-driven generation only handles concrete objects, missing abstract concepts. Both approaches have practical limitations.

Method: Proposes data synthesis pipeline with feature mixing for concept extraction, generates multimodal editing training data, and introduces index encoding with position encoding shift to handle multi-image input without pixel confusion. Uses joint training with VLM.

Result: DreamOmni2 achieves impressive results on the proposed multimodal instruction-based editing and generation tasks, with comprehensive benchmarks established for future development.

Conclusion: The proposed approach successfully addresses limitations of current methods by supporting multimodal instructions and handling both concrete and abstract concepts, significantly enhancing practical applications in image editing and generation.

Abstract: Recent advancements in instruction-based image editing and subject-driven generation have garnered significant attention, yet both tasks still face limitations in meeting practical user needs. Instruction-based editing relies solely on language instructions, which often fail to capture specific editing details, making reference images necessary. Meanwhile, subject-driven generation is limited to combining concrete objects or people, overlooking broader, abstract concepts. To address these challenges, we propose two novel tasks: multimodal instruction-based editing and generation. These tasks support both text and image instructions and extend the scope to include both concrete and abstract concepts, greatly enhancing their practical applications. We introduce DreamOmni2, tackling two primary challenges: data creation and model framework design. Our data synthesis pipeline consists of three steps: (1) using a feature mixing method to create extraction data for both abstract and concrete concepts, (2) generating multimodal instruction-based editing training data using the editing and extraction models, and (3) further applying the extraction model to create training data for multimodal instruction-based editing. For the framework, to handle multi-image input, we propose an index encoding and position encoding shift scheme, which helps the model distinguish images and avoid pixel confusion. Additionally, we introduce joint training with the VLM and our generation/editing model to better process complex instructions. In addition, we have proposed comprehensive benchmarks for these two new tasks to drive their development. Experiments show that DreamOmni2 has achieved impressive results. Models and codes will be released.

[214] Semantic Segmentation Algorithm Based on Light Field and LiDAR Fusion

Jie Luo, Yuxuan Jiang, Xin Jin, Mingyu Liu, Yihui Fan

Main category: cs.CV

TL;DR: Proposes a multimodal semantic segmentation method (Mlpfseg) that fuses light field and LiDAR point cloud data with feature completion and depth perception modules to improve segmentation under occlusion.

Details

Motivation: Semantic segmentation faces challenges under complex conditions like occlusion. Light field and LiDAR provide complementary visual and spatial cues, but their integration is hindered by limited viewpoint diversity and modality discrepancies.

Method: Created a multimodal dataset with light field and point cloud data. Proposed Mlpfseg network with two modules: feature completion (differential reconstruction of point-cloud feature maps to address density mismatch) and depth perception (reinforcing attention scores for better occlusion awareness).

Result: Outperforms image-only segmentation by 1.71 mIoU and point cloud-only segmentation by 2.38 mIoU, demonstrating effectiveness in multimodal fusion.

Conclusion: The proposed multimodal fusion approach with feature completion and depth perception modules successfully improves semantic segmentation performance, particularly for handling occlusion challenges in autonomous driving scenarios.

Abstract: Semantic segmentation serves as a cornerstone of scene understanding in autonomous driving but continues to face significant challenges under complex conditions such as occlusion. Light field and LiDAR modalities provide complementary visual and spatial cues that are beneficial for robust perception; however, their effective integration is hindered by limited viewpoint diversity and inherent modality discrepancies. To address these challenges, the first multimodal semantic segmentation dataset integrating light field data and point cloud data is proposed. Based on this dataset, we proposed a multi-modal light field point-cloud fusion segmentation network(Mlpfseg), incorporating feature completion and depth perception to segment both camera images and LiDAR point clouds simultaneously. The feature completion module addresses the density mismatch between point clouds and image pixels by performing differential reconstruction of point-cloud feature maps, enhancing the fusion of these modalities. The depth perception module improves the segmentation of occluded objects by reinforcing attention scores for better occlusion awareness. Our method outperforms image-only segmentation by 1.71 Mean Intersection over Union(mIoU) and point cloud-only segmentation by 2.38 mIoU, demonstrating its effectiveness.

[215] SCas4D: Structural Cascaded Optimization for Boosting Persistent 4D Novel View Synthesis

Jipeng Lyu, Jiahua Dong, Yu-Xiong Wang

Main category: cs.CV

TL;DR: SCas4D is a cascaded optimization framework that uses hierarchical deformation patterns in 3D Gaussian Splatting for efficient dynamic scene modeling, achieving comparable results with 20x fewer training iterations.

Details

Motivation: Persistent dynamic scene modeling is challenging due to the difficulty of capturing accurate deformations while maintaining computational efficiency. Real-world deformations often exhibit hierarchical patterns that can be leveraged for optimization.

Method: A cascaded optimization framework that progressively refines deformations from coarse part-level to fine point-level, leveraging structural patterns in 3D Gaussian Splatting where groups of Gaussians share similar transformations.

Result: Achieves convergence within 100 iterations per time frame and produces results comparable to existing methods with only one-twentieth of the training iterations. Effective in self-supervised articulated object segmentation, novel view synthesis, and dense point tracking.

Conclusion: SCas4D provides an efficient approach for dynamic scene modeling by exploiting hierarchical deformation patterns, significantly reducing computational requirements while maintaining performance across multiple tasks.

Abstract: Persistent dynamic scene modeling for tracking and novel-view synthesis remains challenging due to the difficulty of capturing accurate deformations while maintaining computational efficiency. We propose SCas4D, a cascaded optimization framework that leverages structural patterns in 3D Gaussian Splatting for dynamic scenes. The key idea is that real-world deformations often exhibit hierarchical patterns, where groups of Gaussians share similar transformations. By progressively refining deformations from coarse part-level to fine point-level, SCas4D achieves convergence within 100 iterations per time frame and produces results comparable to existing methods with only one-twentieth of the training iterations. The approach also demonstrates effectiveness in self-supervised articulated object segmentation, novel view synthesis, and dense point tracking tasks.

[216] Evaluating LLMs for Historical Document OCR: A Methodological Framework for Digital Humanities

Maria Levchenko

Main category: cs.CV

TL;DR: The paper presents an evaluation framework for LLM-based OCR in historical document digitization, addressing temporal biases and period-specific errors that traditional metrics miss.

Details

Motivation: Digital humanities scholars lack appropriate evaluation frameworks for LLM-based OCR, as traditional metrics fail to capture temporal biases and period-specific errors crucial for historical corpus creation.

Method: Developed evaluation methodology using 18th-century Russian Civil font texts, introducing novel metrics (Historical Character Preservation Rate and Archaic Insertion Rate) with protocols for contamination control and stability testing. Evaluated 12 multimodal LLMs.

Result: Gemini and Qwen models outperform traditional OCR but exhibit over-historicization - inserting archaic characters from incorrect historical periods. Post-OCR correction degrades rather than improves performance.

Conclusion: The methodology provides digital humanities practitioners with guidelines for model selection and quality assessment in historical corpus digitization.

Abstract: Digital humanities scholars increasingly use Large Language Models for historical document digitization, yet lack appropriate evaluation frameworks for LLM-based OCR. Traditional metrics fail to capture temporal biases and period-specific errors crucial for historical corpus creation. We present an evaluation methodology for LLM-based historical OCR, addressing contamination risks and systematic biases in diplomatic transcription. Using 18th-century Russian Civil font texts, we introduce novel metrics including Historical Character Preservation Rate (HCPR) and Archaic Insertion Rate (AIR), alongside protocols for contamination control and stability testing. We evaluate 12 multimodal LLMs, finding that Gemini and Qwen models outperform traditional OCR while exhibiting over-historicization: inserting archaic characters from incorrect historical periods. Post-OCR correction degrades rather than improves performance. Our methodology provides digital humanities practitioners with guidelines for model selection and quality assessment in historical corpus digitization.

[217] DeRainMamba: A Frequency-Aware State Space Model with Detail Enhancement for Image Deraining

Zhiliang Zhu, Tao Zeng, Tao Yang, Guoliang Luo, Jiyong Zeng

Main category: cs.CV

TL;DR: DeRainMamba integrates frequency-aware state-space modeling and multi-directional convolution for improved image deraining, achieving better performance with lower computational costs.

Details

Motivation: Existing Mamba-based models have limited ability to capture fine-grained details and lack frequency-domain awareness, restricting further improvements in image deraining.

Method: Proposes DeRainMamba with Frequency-Aware State-Space Module (FASSM) using Fourier transform to distinguish rain streaks from high-frequency details, and Multi-Directional Perception Convolution (MDPConv) for restoring local structures by capturing anisotropic gradient features.

Result: Extensive experiments on four public benchmarks show DeRainMamba consistently outperforms state-of-the-art methods in PSNR and SSIM metrics, while requiring fewer parameters and lower computational costs.

Conclusion: Combining frequency-domain modeling and spatial detail enhancement within a state-space framework is effective for single image deraining.

Abstract: Image deraining is crucial for improving visual quality and supporting reliable downstream vision tasks. Although Mamba-based models provide efficient sequence modeling, their limited ability to capture fine-grained details and lack of frequency-domain awareness restrict further improvements. To address these issues, we propose DeRainMamba, which integrates a Frequency-Aware State-Space Module (FASSM) and Multi-Directional Perception Convolution (MDPConv). FASSM leverages Fourier transform to distinguish rain streaks from high-frequency image details, balancing rain removal and detail preservation. MDPConv further restores local structures by capturing anisotropic gradient features and efficiently fusing multiple convolution branches. Extensive experiments on four public benchmarks demonstrate that DeRainMamba consistently outperforms state-of-the-art methods in PSNR and SSIM, while requiring fewer parameters and lower computational costs. These results validate the effectiveness of combining frequency-domain modeling and spatial detail enhancement within a state-space framework for single image deraining.

[218] OBS-Diff: Accurate Pruning For Diffusion Models in One-Shot

Junhan Zhu, Hesong Wang, Mingluo Su, Zefang Wang, Huan Wang

Main category: cs.CV

TL;DR: OBS-Diff is a novel one-shot pruning framework that enables accurate and training-free compression of large-scale text-to-image diffusion models by adapting Optimal Brain Surgeon (OBS) to diffusion model architectures with timestep-aware Hessian construction.

Details

Motivation: Large-scale text-to-image diffusion models suffer from prohibitive computational cost, and existing one-shot network pruning methods cannot be directly applied due to the iterative denoising nature of diffusion models.

Method: OBS-Diff revitalizes Optimal Brain Surgeon (OBS) for diffusion models, supports diverse pruning granularity, uses timestep-aware Hessian construction with logarithmic-decrease weighting, and employs group-wise sequential pruning strategy.

Result: Extensive experiments show OBS-Diff achieves state-of-the-art one-shot pruning for diffusion models, delivering inference acceleration with minimal degradation in visual quality.

Conclusion: OBS-Diff successfully bridges the gap between traditional pruning methods and diffusion models, providing an effective training-free compression solution for large-scale text-to-image diffusion models.

Abstract: Large-scale text-to-image diffusion models, while powerful, suffer from prohibitive computational cost. Existing one-shot network pruning methods can hardly be directly applied to them due to the iterative denoising nature of diffusion models. To bridge the gap, this paper presents OBS-Diff, a novel one-shot pruning framework that enables accurate and training-free compression of large-scale text-to-image diffusion models. Specifically, (i) OBS-Diff revitalizes the classic Optimal Brain Surgeon (OBS), adapting it to the complex architectures of modern diffusion models and supporting diverse pruning granularity, including unstructured, N:M semi-structured, and structured (MHA heads and FFN neurons) sparsity; (ii) To align the pruning criteria with the iterative dynamics of the diffusion process, by examining the problem from an error-accumulation perspective, we propose a novel timestep-aware Hessian construction that incorporates a logarithmic-decrease weighting scheme, assigning greater importance to earlier timesteps to mitigate potential error accumulation; (iii) Furthermore, a computationally efficient group-wise sequential pruning strategy is proposed to amortize the expensive calibration process. Extensive experiments show that OBS-Diff achieves state-of-the-art one-shot pruning for diffusion models, delivering inference acceleration with minimal degradation in visual quality.

[219] Transforming Noise Distributions with Histogram Matching: Towards a Single Denoiser for All

Sheng Fu, Junchao Zhang, Kailun Yang

Main category: cs.CV

TL;DR: A histogram matching approach transforms arbitrary noise to Gaussian distribution, enabling a single Gaussian denoiser to handle various out-of-distribution noises through a mutually reinforcing cycle between noise transformation and denoising.

Details

Motivation: Supervised Gaussian denoisers have limited generalization to out-of-distribution noise due to diverse noise distribution characteristics across different noise types.

Method: Propose histogram matching to transform arbitrary noise towards target Gaussian distribution, establish mutually reinforcing cycle between noise transformation and denoising, use local histogram matching for signal-dependent noise, intrapatch permutation for channel-related noise, and frequency-domain histogram matching with pixel-shuffle down-sampling for spatial correlation.

Result: Single Gaussian denoiser gains remarkable capability to handle various out-of-distribution noises including Poisson, salt-and-pepper, repeating pattern noises, and complex real-world noises.

Conclusion: Extensive experiments demonstrate superior generalization and effectiveness of the proposed method for handling diverse noise types with a single denoiser.

Abstract: Supervised Gaussian denoisers exhibit limited generalization when confronted with out-of-distribution noise, due to the diverse distributional characteristics of different noise types. To bridge this gap, we propose a histogram matching approach that transforms arbitrary noise towards a target Gaussian distribution with known intensity. Moreover, a mutually reinforcing cycle is established between noise transformation and subsequent denoising. This cycle progressively refines the noise to be converted, making it approximate the real noise, thereby enhancing the noise transformation effect and further improving the denoising performance. We tackle specific noise complexities: local histogram matching handles signal-dependent noise, intrapatch permutation processes channel-related noise, and frequency-domain histogram matching coupled with pixel-shuffle down-sampling breaks spatial correlation. By applying these transformations, a single Gaussian denoiser gains remarkable capability to handle various out-of-distribution noises, including synthetic noises such as Poisson, salt-and-pepper and repeating pattern noises, as well as complex real-world noises. Extensive experiments demonstrate the superior generalization and effectiveness of our method.

[220] A deep multiple instance learning approach based on coarse labels for high-resolution land-cover mapping

Gianmarco Perantoni, Lorenzo Bruzzone

Main category: cs.CV

TL;DR: Proposes a Deep Multiple Instance Learning method to train high-resolution land-cover classifiers using weak low-resolution labels, addressing label quantity and quality issues in remote sensing.

Details

Motivation: Address the problem of limited high-quality training labels for land-cover mapping by leveraging abundant but low-resolution reference data like MODIS-derived maps.

Method: Uses flexible pooling layers to link pixel-level semantics to patch-level weak labels, reframing MIL in multi-class and multi-label settings with Positive-Unlabeled Learning for multi-label cases.

Result: Experimental results on IEEE GRSS Data Fusion Contest dataset show the framework outperforms standard training strategies.

Conclusion: The proposed DMIL approach effectively trains high-resolution land-cover classifiers using only weak low-resolution supervision, solving the label scarcity problem.

Abstract: The quantity and the quality of the training labels are central problems in high-resolution land-cover mapping with machine-learning-based solutions. In this context, weak labels can be gathered in large quantities by leveraging on existing low-resolution or obsolete products. In this paper, we address the problem of training land-cover classifiers using high-resolution imagery (e.g., Sentinel-2) and weak low-resolution reference data (e.g., MODIS -derived land-cover maps). Inspired by recent works in Deep Multiple Instance Learning (DMIL), we propose a method that trains pixel-level multi-class classifiers and predicts low-resolution labels (i.e., patch-level classification), where the actual high-resolution labels are learned implicitly without direct supervision. This is achieved with flexible pooling layers that are able to link the semantics of the pixels in the high-resolution imagery to the low-resolution reference labels. Then, the Multiple Instance Learning (MIL) problem is re-framed in a multi-class and in a multi-label setting. In the former, the low-resolution annotation represents the majority of the pixels in the patch. In the latter, the annotation only provides us information on the presence of one of the land-cover classes in the patch and thus multiple labels can be considered valid for a patch at a time, whereas the low-resolution labels provide us only one label. Therefore, the classifier is trained with a Positive-Unlabeled Learning (PUL) strategy. Experimental results on the 2020 IEEE GRSS Data Fusion Contest dataset show the effectiveness of the proposed framework compared to standard training strategies.

[221] TTRV: Test-Time Reinforcement Learning for Vision Language Models

Akshit Singh, Shyam Marjit, Wei Lin, Paul Gavrikov, Serena Yeung-Levy, Hilde Kuehne, Rogerio Feris, Sivan Doveh, James Glass, M. Jehanzeb Mirza

Main category: cs.CV

TL;DR: TTRV enhances vision language models through test-time reinforcement learning without labeled data, using frequency-based rewards and entropy control to improve object recognition and VQA performance.

Details

Motivation: Existing RL methods require labeled data and training splits, unlike human learning which adapts directly from the environment. The goal is to enable models to learn at inference time without labeled data.

Method: Enhances GRPO framework by designing rewards based on output frequency and controlling output diversity through entropy minimization. Performs multiple inferences on each test sample during adaptation.

Result: Achieves up to 52.4% improvement in object recognition and 29.8% in VQA, with average boosts of 24.6% and 10.0% across 16 datasets. TTRV on InternVL 8B surpasses GPT-4o by 2.3% on image recognition benchmarks.

Conclusion: Test-time reinforcement learning can match or exceed proprietary models, works even in extremely data-constrained scenarios with single unlabeled examples, and demonstrates practical viability for real-world applications.

Abstract: Existing methods for extracting reward signals in Reinforcement Learning typically rely on labeled data and dedicated training splits, a setup that contrasts with how humans learn directly from their environment. In this work, we propose TTRV to enhance vision language understanding by adapting the model on the fly at inference time, without the need for any labeled data. Concretely, we enhance the Group Relative Policy Optimization (GRPO) framework by designing rewards based on the frequency of the base model’s output, while inferring on each test sample multiple times. Further, we also propose to control the diversity of the model’s output by simultaneously rewarding the model for obtaining low entropy of the output empirical distribution. Our approach delivers consistent gains across both object recognition and visual question answering (VQA), with improvements of up to 52.4% and 29.8%, respectively, and average boosts of 24.6% and 10.0% across 16 datasets.Remarkably, on image recognition, TTRV applied to InternVL 8B surpasses GPT-4o by an average of 2.3% over 8 benchmarks, while remaining highly competitive on VQA, demonstrating that test-time reinforcement learning can match or exceed the strongest proprietary models. Finally, we find many interesting properties of test-time RL for VLMs: for example, even in extremely data-constrained scenarios, where adaptation is performed on a single randomly chosen unlabeled test example, TTRV still yields non-trivial improvements of up to 5.5% in recognition tasks.

[222] Extreme Amodal Face Detection

Changlin Song, Yunzhong Hou, Michael Randall Barnes, Rahul Shome, Dylan Campbell

Main category: cs.CV

TL;DR: A sample-free, heatmap-based approach for extreme amodal face detection that uses contextual cues from single images to infer unseen faces outside the frame.

Details

Motivation: Addressing safety and privacy applications by detecting objects not visible in the input image but present in an expanded field-of-view, focusing on face detection as a motivating class.

Method: Heatmap-based extreme amodal object detector with selective coarse-to-fine decoder that efficiently predicts out-of-frame regions from single images using contextual cues.

Result: Establishes strong performance for extreme amodal detection, even outperforming less efficient generative approaches on this new task.

Conclusion: Proposed method provides an efficient, sample-free solution for single-image extreme amodal detection that effectively leverages contextual information without requiring image sequences or generative models.

Abstract: Extreme amodal detection is the task of inferring the 2D location of objects that are not fully visible in the input image but are visible within an expanded field-of-view. This differs from amodal detection, where the object is partially visible within the input image, but is occluded. In this paper, we consider the sub-problem of face detection, since this class provides motivating applications involving safety and privacy, but do not tailor our method specifically to this class. Existing approaches rely on image sequences so that missing detections may be interpolated from surrounding frames or make use of generative models to sample possible completions. In contrast, we consider the single-image task and propose a more efficient, sample-free approach that makes use of the contextual cues from the image to infer the presence of unseen faces. We design a heatmap-based extreme amodal object detector that addresses the problem of efficiently predicting a lot (the out-of-frame region) from a little (the image) with a selective coarse-to-fine decoder. Our method establishes strong results for this new task, even outperforming less efficient generative approaches.

[223] VA-Adapter: Adapting Ultrasound Foundation Model to Echocardiography Probe Guidance

Teng Wang, Haojun Jiang, Yuxuan Wang, Zhenguo Sun, Shiji Song, Gao Huang

Main category: cs.CV

TL;DR: This paper proposes a parameter-efficient Vision-Action Adapter (VA-Adapter) that enables ultrasound foundation models to provide real-time probe guidance for junior sonographers to acquire high-quality cardiac ultrasound images, addressing the shortage of skilled personnel.

Details

Motivation: There is a shortage of highly skilled sonographers for cardiac ultrasound due to high operational difficulty, which hinders timely patient examinations. While ultrasound foundation models show strong capabilities in image analysis, obtaining quality images is a prerequisite for accurate diagnosis.

Method: The authors design a parameter-efficient Vision-Action Adapter (VA-Adapter) that enables foundation models’ image encoders to encode vision-action sequences, allowing the model to learn from past explorations like human experts. This adapter provides sequential reasoning capabilities while fine-tuning only a small subset of parameters.

Result: Extensive experiments demonstrate that the VA-Adapter can surpass strong probe guidance models in performance.

Conclusion: The proposed VA-Adapter successfully adapts medical knowledge from foundation models to the probe guidance task, providing an effective solution for real-time operational assistance to junior sonographers in acquiring high-quality ultrasound images.

Abstract: Echocardiography is a critical tool for detecting heart diseases. Recently, ultrasound foundation models have demonstrated remarkable capabilities in cardiac ultrasound image analysis. However, obtaining high-quality ultrasound images is a prerequisite for accurate diagnosis. Due to the exceptionally high operational difficulty of cardiac ultrasound, there is a shortage of highly skilled personnel, which hinders patients from receiving timely examination services. In this paper, we aim to adapt the medical knowledge learned by foundation models from vast datasets to the probe guidance task, which is designed to provide real-time operational recommendations for junior sonographers to acquire high-quality ultrasound images. Moreover, inspired by the practice where experts optimize action decisions based on past explorations, we meticulously design a parameter-efficient Vision-Action Adapter (VA-Adapter) to enable foundation model’s image encoder to encode vision-action sequences, thereby enhancing guidance performance. With built-in sequential reasoning capabilities in a compact design, the VA-Adapter enables a pre-trained ultrasound foundation model to learn precise probe adjustment strategies by fine-tuning only a small subset of parameters. Extensive experiments demonstrate that the VA-Adapter can surpass strong probe guidance models. Our code will be released after acceptance.

[224] Efficient Discriminative Joint Encoders for Large Scale Vision-Language Reranking

Mitchell Keren Taraday, Shahaf Wagner, Chaim Baskin

Main category: cs.CV

TL;DR: EDJE is an efficient vision-language reranker that precomputes and compresses visual tokens offline, enabling fast joint encoding with minimal online computation while maintaining strong retrieval performance.

Details

Motivation: Existing vision-language joint encoders like BLIP are bottlenecked by expensive visual feature extraction, making them impractical for large-scale deployment. There's a lack of efficient rerankers comparable to those in text retrieval.

Method: EDJE precomputes vision tokens offline and compresses them using a lightweight attention-based adapter. Online inference runs only a compact joint encoder over a small set of visual tokens plus text.

Result: EDJE achieves 50k image-text pairs/second throughput with only 49kB storage per image, matching state-of-the-art performance on Flickr (zero-shot) and COCO (fine-tuned) retrieval benchmarks.

Conclusion: EDJE enables practical deployment of vision-language rerankers at scale by drastically reducing storage and computational requirements while preserving retrieval performance.

Abstract: Multimodal retrieval still leans on embedding-based models like CLIP for fast vector search over pre-computed image embeddings. Yet, unlike text retrieval, where joint-encoder rerankers are standard, comparable vision–language rerankers are largely absent. We find that seminal joint encoders such as BLIP are severely bottlenecked by an expensive visual feature-extraction stage, preventing practical deployment at scale. Motivated by this bottleneck, we introduce EDJE, an Efficient Discriminative Joint Encoder that precomputes vision tokens offline and compresses them via a lightweight attention-based adapter, so online inference runs only a compact joint encoder over a small set of visual tokens plus the text. EDJE preserves strong retrieval performance while drastically reducing storage and online compute, enabling high-throughput inference. Specifically, EDJE processes 50k image–text pairs/second while requiring 49kB of disk storage per image, matching prior art on Flickr (zero-shot) and COCO (fine-tuned) retrieval. The implementation and checkpoints will be made publicly available shortly.

[225] StyleKeeper: Prevent Content Leakage using Negative Visual Query Guidance

Jaeseok Jeong, Junho Kim, Gayoung Lee, Yunjey Choi, Youngjung Uh

Main category: cs.CV

TL;DR: A new method called negative visual query guidance (NVQG) is proposed to reduce content leakage in visual style prompting for text-to-image generation, achieving better style transfer while preserving text prompt alignment.

Details

Motivation: Existing visual prompting methods in text-to-image generation suffer from content leakage, where unwanted elements from visual style prompts are transferred along with the intended style.

Method: Extends classifier-free guidance with swapping self-attention and proposes negative visual query guidance (NVQG) that uses negative scores by simulating content leakage scenarios through query swapping in self-attention layers.

Result: The method significantly reduces content leakage and demonstrates superiority over existing approaches across various styles and text prompts, effectively reflecting reference styles while matching text prompts.

Conclusion: NVQG provides a simple yet effective solution for reducing content leakage in visual style prompting, with additional solutions for using real images as visual style prompts.

Abstract: In the domain of text-to-image generation, diffusion models have emerged as powerful tools. Recently, studies on visual prompting, where images are used as prompts, have enabled more precise control over style and content. However, existing methods often suffer from content leakage, where undesired elements of the visual style prompt are transferred along with the intended style. To address this issue, we 1) extend classifier-free guidance (CFG) to utilize swapping self-attention and propose 2) negative visual query guidance (NVQG) to reduce the transfer of unwanted contents. NVQG employs negative score by intentionally simulating content leakage scenarios that swap queries instead of key and values of self-attention layers from visual style prompts. This simple yet effective method significantly reduces content leakage. Furthermore, we provide careful solutions for using a real image as visual style prompts. Through extensive evaluation across various styles and text prompts, our method demonstrates superiority over existing approaches, reflecting the style of the references, and ensuring that resulting images match the text prompts. Our code is available \href{https://github.com/naver-ai/StyleKeeper}{here}.

[226] Lattice-allocated Real-time Line Segment Feature Detection and Tracking Using Only an Event-based Camera

Mikihiro Ikura, Arren Glover, Masayoshi Mizuno, Chiara Bartolozzi

Main category: cs.CV

TL;DR: Real-time line segment detection and tracking using only a high-resolution event-based camera, achieving higher accuracy than state-of-the-art methods without needing frame cameras.

Details

Motivation: Event-based cameras efficiently capture geometric features in human-made environments but existing methods either require additional frame cameras or struggle with high event rates.

Method: A lattice-allocated pipeline with velocity-invariant event representation, line segment detection using fitting scores, and line segment tracking through endpoint perturbation.

Result: Demonstrated real-time performance and higher accuracy compared to state-of-the-art event-only and event-frame hybrid baselines on both custom and public datasets.

Conclusion: Enables fully stand-alone event camera operation in real-world settings by providing efficient line segment extraction without additional hardware.

Abstract: Line segment extraction is effective for capturing geometric features of human-made environments. Event-based cameras, which asynchronously respond to contrast changes along edges, enable efficient extraction by reducing redundant data. However, recent methods often rely on additional frame cameras or struggle with high event rates. This research addresses real-time line segment detection and tracking using only a modern, high-resolution (i.e., high event rate) event-based camera. Our lattice-allocated pipeline consists of (i) velocity-invariant event representation, (ii) line segment detection based on a fitting score, (iii) and line segment tracking by perturbating endpoints. Evaluation using ad-hoc recorded dataset and public datasets demonstrates real-time performance and higher accuracy compared to state-of-the-art event-only and event-frame hybrid baselines, enabling fully stand-alone event camera operation in real-world settings.

[227] Continual Action Quality Assessment via Adaptive Manifold-Aligned Graph Regularization

Kanglei Zhou, Qingyi Pan, Xingxing Zhang, Hubert P. H. Shum, Frederick W. B. Li, Xiaohui Liang, Liyuan Wang

Main category: cs.CV

TL;DR: The paper introduces Continual AQA (CAQA) to handle evolving quality distributions in action assessment, proposing MAGR++ method that combines full-parameter fine-tuning with adaptive manifold-aligned graph regularization to prevent catastrophic forgetting while maintaining performance.

Details

Motivation: Conventional AQA methods struggle with non-stationary quality distributions in real-world scenarios, limiting generalization. Continual Learning capabilities are needed to handle evolving distributions while avoiding catastrophic forgetting.

Method: Proposed MAGR++ method with full-parameter fine-tuning for effective representation learning, coupled with a two-step feature rectification pipeline: manifold projector to translate historical features and graph regularizer to align distributions.

Result: MAGR++ achieves state-of-the-art performance with average correlation gains of 3.6% offline and 12.2% online over strongest baselines across four CAQA benchmarks from three datasets.

Conclusion: The proposed MAGR++ effectively addresses the challenges of Continual AQA by balancing representation learning with forgetting mitigation through adaptive manifold alignment and graph regularization.

Abstract: Action Quality Assessment (AQA) quantifies human actions in videos, supporting applications in sports scoring, rehabilitation, and skill evaluation. A major challenge lies in the non-stationary nature of quality distributions in real-world scenarios, which limits the generalization ability of conventional methods. We introduce Continual AQA (CAQA), which equips AQA with Continual Learning (CL) capabilities to handle evolving distributions while mitigating catastrophic forgetting. Although parameter-efficient fine-tuning of pretrained models has shown promise in CL for image classification, we find it insufficient for CAQA. Our empirical and theoretical analyses reveal two insights: (i) Full-Parameter Fine-Tuning (FPFT) is necessary for effective representation learning; yet (ii) uncontrolled FPFT induces overfitting and feature manifold shift, thereby aggravating forgetting. To address this, we propose Adaptive Manifold-Aligned Graph Regularization (MAGR++), which couples backbone fine-tuning that stabilizes shallow layers while adapting deeper ones with a two-step feature rectification pipeline: a manifold projector to translate deviated historical features into the current representation space, and a graph regularizer to align local and global distributions. We construct four CAQA benchmarks from three datasets with tailored evaluation protocols and strong baselines, enabling systematic cross-dataset comparison. Extensive experiments show that MAGR++ achieves state-of-the-art performance, with average correlation gains of 3.6% offline and 12.2% online over the strongest baseline, confirming its robustness and effectiveness. Our code is available at https://github.com/ZhouKanglei/MAGRPP.

[228] Explaining raw data complexity to improve satellite onboard processing

Adrien Dorise, Marjorie Bellizzi, Adrien Girard, Benjamin Francesconi, Stéphane May

Main category: cs.CV

TL;DR: This paper investigates using raw sensor data instead of preprocessed images for onboard satellite AI, finding that while performance is similar at low confidence, raw data models struggle with object boundaries at high confidence levels.

Details

Motivation: With increasing processing power, deploying AI models directly onboard satellites is becoming feasible, but current solutions primarily use preprocessed data rather than raw sensor data.

Method: Introduced a simulation workflow to generate raw-like products from high-resolution L1 imagery, trained two object detection models (YOLOv11s and YOLOX-S) on both raw and L1 datasets, and compared performance using standard metrics and explainability tools.

Result: Both models performed similarly at low to medium confidence thresholds, but the model trained on raw data struggled with object boundary identification at high confidence levels.

Conclusion: Adapting AI architectures with improved contouring methods can enhance object detection on raw images, improving onboard AI for remote sensing.

Abstract: With increasing processing power, deploying AI models for remote sensing directly onboard satellites is becoming feasible. However, new constraints arise, mainly when using raw, unprocessed sensor data instead of preprocessed ground-based products. While current solutions primarily rely on preprocessed sensor images, few approaches directly leverage raw data. This study investigates the effects of utilising raw data on deep learning models for object detection and classification tasks. We introduce a simulation workflow to generate raw-like products from high-resolution L1 imagery, enabling systemic evaluation. Two object detection models (YOLOv11s and YOLOX-S) are trained on both raw and L1 datasets, and their performance is compared using standard detection metrics and explainability tools. Results indicate that while both models perform similarly at low to medium confidence thresholds, the model trained on raw data struggles with object boundary identification at high confidence levels. It suggests that adapting AI architectures with improved contouring methods can enhance object detection on raw images, improving onboard AI for remote sensing.

[229] HARP-NeXt: High-Speed and Accurate Range-Point Fusion Network for 3D LiDAR Semantic Segmentation

Samir Abou Haidar, Alexandre Chariot, Mehdi Darouich, Cyril Joly, Jean-Emmanuel Deschaud

Main category: cs.CV

TL;DR: HARP-NeXt is a high-speed LiDAR semantic segmentation network that achieves superior speed-accuracy trade-off through novel pre-processing, efficient feature extraction, and multi-scale fusion, running 24x faster than top methods while maintaining comparable accuracy.

Details

Motivation: Address the trade-off between accuracy and speed in LiDAR semantic segmentation, where existing methods are either accurate but slow (point-based/sparse convolution) or fast but lose geometric information (projection-based), while also reducing computational overhead from pre-processing and avoiding test-time augmentation.

Method: Proposes novel pre-processing methodology to reduce computational overhead, Conv-SE-NeXt feature extraction blocks for efficient representation capture without deep stacking, and multi-scale range-point fusion backbone to preserve geometric details at multiple abstraction levels.

Result: Achieves superior speed-accuracy trade-off on nuScenes and SemanticKITTI benchmarks, comparable to top-ranked PTv3 without ensemble models or TTA, while running 24x faster.

Conclusion: HARP-NeXt successfully addresses the speed-accuracy trade-off in LiDAR semantic segmentation through efficient pre-processing and architecture design, making it suitable for resource-constrained embedded systems in autonomous vehicles and mobile robots.

Abstract: LiDAR semantic segmentation is crucial for autonomous vehicles and mobile robots, requiring high accuracy and real-time processing, especially on resource-constrained embedded systems. Previous state-of-the-art methods often face a trade-off between accuracy and speed. Point-based and sparse convolution-based methods are accurate but slow due to the complexity of neighbor searching and 3D convolutions. Projection-based methods are faster but lose critical geometric information during the 2D projection. Additionally, many recent methods rely on test-time augmentation (TTA) to improve performance, which further slows the inference. Moreover, the pre-processing phase across all methods increases execution time and is demanding on embedded platforms. Therefore, we introduce HARP-NeXt, a high-speed and accurate LiDAR semantic segmentation network. We first propose a novel pre-processing methodology that significantly reduces computational overhead. Then, we design the Conv-SE-NeXt feature extraction block to efficiently capture representations without deep layer stacking per network stage. We also employ a multi-scale range-point fusion backbone that leverages information at multiple abstraction levels to preserve essential geometric details, thereby enhancing accuracy. Experiments on the nuScenes and SemanticKITTI benchmarks show that HARP-NeXt achieves a superior speed-accuracy trade-off compared to all state-of-the-art methods, and, without relying on ensemble models or TTA, is comparable to the top-ranked PTv3, while running 24$\times$ faster. The code is available at https://github.com/SamirAbouHaidar/HARP-NeXt

[230] Lung Infection Severity Prediction Using Transformers with Conditional TransMix Augmentation and Cross-Attention

Bouthaina Slika, Fadi Dornaika, Fares Bougourzi, Karim Hammoudi

Main category: cs.CV

TL;DR: A novel AI method for predicting lung infection severity from CT scans and chest X-rays using a Transformer-based architecture with cross-gated attention and custom data augmentation to handle dataset imbalance.

Details

Motivation: Lung infections like pneumonia pose serious health risks that can escalate rapidly, especially during pandemics. Accurate AI-based severity prediction from medical imaging is essential for timely clinical decisions and optimized patient outcomes.

Method: QCross-Att-PVT: Transformer-based architecture with parallel encoders, cross-gated attention mechanism, and feature aggregator to capture multi-scale features. Conditional Online TransMix: Custom data augmentation strategy to address dataset imbalance by generating mixed-label image patches during training.

Result: Outperforms several state-of-the-art deep learning models on two benchmark datasets (RALO CXR and Per-COVID-19 CT). Emphasizes the critical role of data augmentation and gated attention in improving robustness and predictive accuracy.

Conclusion: Provides a reliable, adaptable tool to support clinical diagnosis, disease monitoring, and personalized treatment planning. The approach demonstrates effectiveness across both CT scans and chest X-rays for lung infection severity assessment.

Abstract: Lung infections, particularly pneumonia, pose serious health risks that can escalate rapidly, especially during pandemics. Accurate AI-based severity prediction from medical imaging is essential to support timely clinical decisions and optimize patient outcomes. In this work, we present a novel method applicable to both CT scans and chest X-rays for assessing lung infection severity. Our contributions are twofold: (i) QCross-Att-PVT, a Transformer-based architecture that integrates parallel encoders, a cross-gated attention mechanism, and a feature aggregator to capture rich multi-scale features; and (ii) Conditional Online TransMix, a custom data augmentation strategy designed to address dataset imbalance by generating mixed-label image patches during training. Evaluated on two benchmark datasets, RALO CXR and Per-COVID-19 CT, our method consistently outperforms several state-of-the-art deep learning models. The results emphasize the critical role of data augmentation and gated attention in improving both robustness and predictive accuracy. This approach offers a reliable, adaptable tool to support clinical diagnosis, disease monitoring, and personalized treatment planning. The source code of this work is available at https://github.com/bouthainas/QCross-Att-PVT.

[231] Label-frugal satellite image change detection with generative virtual exemplar learning

Hichem Sahbi

Main category: cs.CV

TL;DR: A novel active learning approach for change detection in remote sensing that selects the most critical unlabeled samples for labeling using an invertible graph convnet and adversarial loss.

Details

Motivation: Existing deep learning methods for change detection rely heavily on hand-labeled training data, which is expensive and subjective. Active learning can reduce labeling costs by selecting only the most informative samples.

Method: Proposes an active learning framework with an invertible graph convnet that generates virtual exemplars. Uses adversarial loss to measure representativity, diversity, and ambiguity of samples, selecting those that most challenge current change detection criteria.

Result: Extensive experiments show the proposed label-efficient learning model outperforms comparative methods, demonstrating positive impact on change detection performance.

Conclusion: The proposed active learning approach effectively reduces labeling costs while maintaining high change detection performance by strategically selecting the most critical samples for annotation.

Abstract: Change detection is a major task in remote sensing which consists in finding all the occurrences of changes in multi-temporal satellite or aerial images. The success of existing methods, and particularly deep learning ones, is tributary to the availability of hand-labeled training data that capture the acquisition conditions and the subjectivity of the user (oracle). In this paper, we devise a novel change detection algorithm, based on active learning. The main contribution of our work resides in a new model that measures how important is each unlabeled sample, and provides an oracle with only the most critical samples (also referred to as virtual exemplars) for further labeling. These exemplars are generated, using an invertible graph convnet, as the optimum of an adversarial loss that (i) measures representativity, diversity and ambiguity of the data, and thereby (ii) challenges (the most) the current change detection criteria, leading to a better re-estimate of these criteria in the subsequent iterations of active learning. Extensive experiments show the positive impact of our label-efficient learning model against comparative methods.

[232] IAR2: Improving Autoregressive Visual Generation with Semantic-Detail Associated Token Prediction

Ran Yi, Teng Hu, Zihan Su, Lizhuang Ma

Main category: cs.CV

TL;DR: IAR2 is an advanced autoregressive framework that improves image generation through hierarchical semantic-detail synthesis using a dual codebook system and progressive attention-guided conditioning.

Details

Motivation: To overcome limitations of previous autoregressive models that overlook structural properties of visual data and are constrained by rigid pre-trained codebooks and inaccurate hard clustering.

Method: Uses Semantic-Detail Associated Dual Codebook to decouple image representations, Semantic-Detail Autoregressive Prediction with Local-Context Enhanced Autoregressive Head, and Progressive Attention-Guided Adaptive CFG for conditional generation.

Result: Achieves state-of-the-art performance with FID of 1.50 on ImageNet, surpassing previous methods while demonstrating superior computational efficiency.

Conclusion: The structured, coarse-to-fine generation strategy effectively enhances autoregressive image generation performance and expressiveness.

Abstract: Autoregressive models have emerged as a powerful paradigm for visual content creation, but often overlook the intrinsic structural properties of visual data. Our prior work, IAR, initiated a direction to address this by reorganizing the visual codebook based on embedding similarity, thereby improving generation robustness. However, it is constrained by the rigidity of pre-trained codebooks and the inaccuracies of hard, uniform clustering. To overcome these limitations, we propose IAR2, an advanced autoregressive framework that enables a hierarchical semantic-detail synthesis process. At the core of IAR2 is a novel Semantic-Detail Associated Dual Codebook, which decouples image representations into a semantic codebook for global semantic information and a detail codebook for fine-grained refinements. It expands the quantization capacity from a linear to a polynomial scale, significantly enhancing expressiveness. To accommodate this dual representation, we propose a Semantic-Detail Autoregressive Prediction scheme coupled with a Local-Context Enhanced Autoregressive Head, which performs hierarchical prediction-first the semantic token, then the detail token-while leveraging a local context window to enhance spatial coherence. Furthermore, for conditional generation, we introduce a Progressive Attention-Guided Adaptive CFG mechanism that dynamically modulates the guidance scale for each token based on its relevance to the condition and its temporal position in the generation sequence, improving conditional alignment without sacrificing realism. Extensive experiments demonstrate that IAR2 sets a new state-of-the-art for autoregressive image generation, achieving a FID of 1.50 on ImageNet. Our model not only surpasses previous methods in performance but also demonstrates superior computational efficiency, highlighting the effectiveness of our structured, coarse-to-fine generation strategy.

[233] OBJVanish: Physically Realizable Text-to-3D Adv. Generation of LiDAR-Invisible Objects

Bing Li, Wuqi Wang, Yanan Zhang, Jingzheng Li, Haigen Min, Wei Feng, Xingyu Zhao, Jie Zhang, Qing Guo

Main category: cs.CV

TL;DR: A text-to-3D adversarial generation method called Phy3DAdvGen creates physically realizable 3D objects that evade LiDAR detectors by systematically optimizing text prompts and using real object combinations.

Details

Motivation: Existing 3D adversarial attacks rarely cause complete object disappearance and are difficult to implement physically. There's a need for physically realizable attacks to thoroughly test LiDAR-based 3D object detectors in autonomous driving.

Method: Phy3DAdvGen systematically optimizes text prompts by refining verbs, objects, and poses to generate LiDAR-invisible pedestrians. It uses a constrained object pool of 13 real 3D models and combines insights from empirical studies on detection vulnerability factors.

Result: The approach successfully generates 3D pedestrians that evade six state-of-the-art LiDAR 3D detectors in both CARLA simulation and physical environments.

Conclusion: The method demonstrates significant vulnerabilities in safety-critical LiDAR detection systems and provides a physically realizable approach for adversarial testing.

Abstract: LiDAR-based 3D object detectors are fundamental to autonomous driving, where failing to detect objects poses severe safety risks. Developing effective 3D adversarial attacks is essential for thoroughly testing these detection systems and exposing their vulnerabilities before real-world deployment. However, existing adversarial attacks that add optimized perturbations to 3D points have two critical limitations: they rarely cause complete object disappearance and prove difficult to implement in physical environments. We introduce the text-to-3D adversarial generation method, a novel approach enabling physically realizable attacks that can generate 3D models of objects truly invisible to LiDAR detectors and be easily realized in the real world. Specifically, we present the first empirical study that systematically investigates the factors influencing detection vulnerability by manipulating the topology, connectivity, and intensity of individual pedestrian 3D models and combining pedestrians with multiple objects within the CARLA simulation environment. Building on the insights, we propose the physically-informed text-to-3D adversarial generation (Phy3DAdvGen) that systematically optimizes text prompts by iteratively refining verbs, objects, and poses to produce LiDAR-invisible pedestrians. To ensure physical realizability, we construct a comprehensive object pool containing 13 3D models of real objects and constrain Phy3DAdvGen to generate 3D objects based on combinations of objects in this set. Extensive experiments demonstrate that our approach can generate 3D pedestrians that evade six state-of-the-art (SOTA) LiDAR 3D detectors in both CARLA simulation and physical environments, thereby highlighting vulnerabilities in safety-critical applications.

[234] Generating Surface for Text-to-3D using 2D Gaussian Splatting

Huanning Dong, Fan Li, Ping Kuang, Jianwen Min

Main category: cs.CV

TL;DR: DirectGaussian is a novel Text-to-3D method that generates 3D object surfaces using surfels with 2D Gaussian splatting, incorporating curvature constraints for geometric consistency.

Details

Motivation: Current Text-to-3D methods struggle with complex geometric shapes in natural objects - existing approaches either use 2D diffusion priors or train on specific 3D representations, both having limitations.

Method: Uses surfels to represent 3D object surfaces, employs conditional text generation models with 2D Gaussian splatting for rendering, and incorporates curvature constraints during optimization to ensure multi-view geometric consistency.

Result: The framework achieves diverse and high-fidelity 3D content creation, as demonstrated through extensive experiments.

Conclusion: DirectGaussian provides an effective approach for Text-to-3D modeling by focusing on surface generation with proper geometric constraints, enabling better handling of complex natural object shapes.

Abstract: Recent advancements in Text-to-3D modeling have shown significant potential for the creation of 3D content. However, due to the complex geometric shapes of objects in the natural world, generating 3D content remains a challenging task. Current methods either leverage 2D diffusion priors to recover 3D geometry, or train the model directly based on specific 3D representations. In this paper, we propose a novel method named DirectGaussian, which focuses on generating the surfaces of 3D objects represented by surfels. In DirectGaussian, we utilize conditional text generation models and the surface of a 3D object is rendered by 2D Gaussian splatting with multi-view normal and texture priors. For multi-view geometric consistency problems, DirectGaussian incorporates curvature constraints on the generated surface during optimization process. Through extensive experiments, we demonstrate that our framework is capable of achieving diverse and high-fidelity 3D content creation.

[235] Learning Global Representation from Queries for Vectorized HD Map Construction

Shoumeng Qiu, Xinrun Li, Yang Long, Xiangyang Xue, Varun Ojha, Jian Pu

Main category: cs.CV

TL;DR: MapGR introduces global representation learning for HD map construction, using synergistic modules to align queries with global map context and improve instance detection performance.

Details

Motivation: Existing DETR-based approaches for HD map construction rely on independent object queries with local perspectives, neglecting the inherent global representation within HD maps.

Method: Proposes MapGR with two modules: Global Representation Learning (GRL) module that aligns query distribution with global map through holistic segmentation, and Global Representation Guidance (GRG) module that provides explicit global-level contextual information to individual queries.

Result: Substantial improvements in mean Average Precision (mAP) on nuScenes and Argoverse2 datasets compared to leading baselines.

Conclusion: Learning and utilizing global representations from queries significantly enhances HD map construction performance.

Abstract: The online construction of vectorized high-definition (HD) maps is a cornerstone of modern autonomous driving systems. State-of-the-art approaches, particularly those based on the DETR framework, formulate this as an instance detection problem. However, their reliance on independent, learnable object queries results in a predominantly local query perspective, neglecting the inherent global representation within HD maps. In this work, we propose \textbf{MapGR} (\textbf{G}lobal \textbf{R}epresentation learning for HD \textbf{Map} construction), an architecture designed to learn and utilize a global representations from queries. Our method introduces two synergistic modules: a Global Representation Learning (GRL) module, which encourages the distribution of all queries to better align with the global map through a carefully designed holistic segmentation task, and a Global Representation Guidance (GRG) module, which endows each individual query with explicit, global-level contextual information to facilitate its optimization. Evaluations on the nuScenes and Argoverse2 datasets validate the efficacy of our approach, demonstrating substantial improvements in mean Average Precision (mAP) compared to leading baselines.

[236] Addressing the ID-Matching Challenge in Long Video Captioning

Zhantao Yang, Huangji Wang, Ruili Feng, Han Zhang, Yuting Hu, Shangwen Zhu, Junyan Li, Yu Liu, Fan Cheng

Main category: cs.CV

TL;DR: The paper proposes RICE, a method that leverages Large Vision-Language Models’ inherent ID-Matching capabilities to improve long video captioning by enhancing image information usage and individual description quantity, achieving significant improvements in ID-Matching precision and recall.

Details

Motivation: Addressing the ID-Matching problem in long video captioning, where accurately recognizing the same individuals across different frames is critical but challenging, with few prior works focusing on this issue effectively.

Method: Built upon LVLMs to leverage their powerful priors, introduced a new benchmark for assessing ID-Matching capabilities, and proposed RICE method that enhances image information usage and increases individual description quantity.

Result: RICE significantly improves ID-Matching performance on GPT-4o, increasing precision from 50% to 90% and recall from 15% to 80% compared to baseline, enabling continuous tracking of different individuals in long video captions.

Conclusion: The proposed RICE method effectively unlocks LVLMs’ inherent ID-Matching capabilities, making it possible to continuously track individuals in long video captions and demonstrating superior performance in both caption quality and ID-Matching.

Abstract: Generating captions for long and complex videos is both critical and challenging, with significant implications for the growing fields of text-to-video generation and multi-modal understanding. One key challenge in long video captioning is accurately recognizing the same individuals who appear in different frames, which we refer to as the ID-Matching problem. Few prior works have focused on this important issue. Those that have, usually suffer from limited generalization and depend on point-wise matching, which limits their overall effectiveness. In this paper, unlike previous approaches, we build upon LVLMs to leverage their powerful priors. We aim to unlock the inherent ID-Matching capabilities within LVLMs themselves to enhance the ID-Matching performance of captions. Specifically, we first introduce a new benchmark for assessing the ID-Matching capabilities of video captions. Using this benchmark, we investigate LVLMs containing GPT-4o, revealing key insights that the performance of ID-Matching can be improved through two methods: 1) enhancing the usage of image information and 2) increasing the quantity of information of individual descriptions. Based on these insights, we propose a novel video captioning method called Recognizing Identities for Captioning Effectively (RICE). Extensive experiments including assessments of caption quality and ID-Matching performance, demonstrate the superiority of our approach. Notably, when implemented on GPT-4o, our RICE improves the precision of ID-Matching from 50% to 90% and improves the recall of ID-Matching from 15% to 80% compared to baseline. RICE makes it possible to continuously track different individuals in the captions of long videos.

[237] No MoCap Needed: Post-Training Motion Diffusion Models with Reinforcement Learning using Only Textual Prompts

Girolamo Macaluso, Lorenzo Mandelli, Mirko Bicchierai, Stefano Berretti, Andrew D. Bagdanov

Main category: cs.CV

TL;DR: A reinforcement learning framework that fine-tunes pretrained motion diffusion models using only textual prompts, enabling adaptation to new actions/styles without motion ground truth data.

Details

Motivation: Current diffusion models for human motion generation require costly retraining with motion capture data to adapt to unseen actions or styles, which is difficult to scale.

Method: Uses Reinforcement Learning with a pretrained text-motion retrieval network as reward signal, optimizing diffusion policy via Denoising Diffusion Policy Optimization without paired motion data.

Result: Consistently improves quality and diversity of generated motions in cross-dataset adaptation and leave-one-out experiments on HumanML3D and KIT-ML datasets, while preserving original distribution performance.

Conclusion: Provides a flexible, data-efficient, and privacy-preserving solution for motion adaptation that works with both latent- and joint-space diffusion architectures.

Abstract: Diffusion models have recently advanced human motion generation, producing realistic and diverse animations from textual prompts. However, adapting these models to unseen actions or styles typically requires additional motion capture data and full retraining, which is costly and difficult to scale. We propose a post-training framework based on Reinforcement Learning that fine-tunes pretrained motion diffusion models using only textual prompts, without requiring any motion ground truth. Our approach employs a pretrained text-motion retrieval network as a reward signal and optimizes the diffusion policy with Denoising Diffusion Policy Optimization, effectively shifting the model’s generative distribution toward the target domain without relying on paired motion data. We evaluate our method on cross-dataset adaptation and leave-one-out motion experiments using the HumanML3D and KIT-ML datasets across both latent- and joint-space diffusion architectures. Results from quantitative metrics and user studies show that our approach consistently improves the quality and diversity of generated motions, while preserving performance on the original distribution. Our approach is a flexible, data-efficient, and privacy-preserving solution for motion adaptation.

[238] Bayesian Modelling of Multi-Year Crop Type Classification Using Deep Neural Networks and Hidden Markov Models

Gianmarco Perantoni, Giulio Weikmann, Lorenzo Bruzzone

Main category: cs.CV

TL;DR: A novel approach combining deep learning with Bayesian modeling using Hidden Markov Models integrated with Transformer Encoder DNNs for temporal consistency in yearly land-cover classification.

Details

Motivation: To improve temporal consistency in yearly land-cover maps for better modeling of land cover evolution and change over years, particularly for multiyear crop type sequences.

Method: Combines deep learning with Bayesian modeling using Hidden Markov Models (HMMs) integrated with Transformer Encoder (TE) based DNNs to capture temporal correlations and crop type patterns.

Result: Validation on multiyear crop type classification dataset with 47 crop types and six years of Sentinel-2 data shows HMMs enhance overall performance and F1 scores.

Conclusion: The proposed approach effectively models temporal consistency in predicted labels, demonstrating the importance of temporal modeling for land-cover classification.

Abstract: The temporal consistency of yearly land-cover maps is of great importance to model the evolution and change of the land cover over the years. In this paper, we focus the attention on a novel approach to classification of yearly satellite image time series (SITS) that combines deep learning with Bayesian modelling, using Hidden Markov Models (HMMs) integrated with Transformer Encoder (TE) based DNNs. The proposed approach aims to capture both i) intricate temporal correlations in yearly SITS and ii) specific patterns in multiyear crop type sequences. It leverages the cascade classification of an HMM layer built on top of the TE, discerning consistent yearly crop-type sequences. Validation on a multiyear crop type classification dataset spanning 47 crop types and six years of Sentinel-2 acquisitions demonstrates the importance of modelling temporal consistency in the predicted labels. HMMs enhance the overall performance and F1 scores, emphasising the effectiveness of the proposed approach.

[239] U-Bench: A Comprehensive Understanding of U-Net through 100-Variant Benchmarking

Fenghe Tang, Chengqi Dong, Wenxin Ma, Zikang Xu, Heqin Zhu, Zihang Jiang, Rongsheng Wang, Yuhao Wang, Chenxu Wu, Shaohua Kevin Zhou

Main category: cs.CV

TL;DR: U-Bench is the first comprehensive benchmark evaluating 100 U-Net variants across 28 datasets and 10 imaging modalities, addressing gaps in statistical validation and efficiency considerations in medical image segmentation.

Details

Motivation: Despite U-Net's dominance in medical image segmentation for over a decade, there has been no comprehensive benchmark to systematically evaluate performance and utility of thousands of U-shaped variants, due to insufficient statistical validation and limited consideration of efficiency and generalization.

Method: U-Bench evaluates models along three dimensions: statistical robustness, zero-shot generalization, and computational efficiency. It introduces a novel U-Score metric that captures performance-efficiency trade-off and provides a model advisor agent for selecting suitable models.

Result: The benchmark systematically analyzes 100 U-Net variants across diverse datasets and imaging modalities, providing key findings on dataset characteristics and architectural paradigms’ impact on performance.

Conclusion: U-Bench establishes a foundation for fair, reproducible, and practically relevant benchmarking in U-Net-based segmentation models, exposing gaps in previous evaluations and enabling community extension through publicly available code, models, and protocols.

Abstract: Over the past decade, U-Net has been the dominant architecture in medical image segmentation, leading to the development of thousands of U-shaped variants. Despite its widespread adoption, there is still no comprehensive benchmark to systematically evaluate their performance and utility, largely because of insufficient statistical validation and limited consideration of efficiency and generalization across diverse datasets. To bridge this gap, we present U-Bench, the first large-scale, statistically rigorous benchmark that evaluates 100 U-Net variants across 28 datasets and 10 imaging modalities. Our contributions are threefold: (1) Comprehensive Evaluation: U-Bench evaluates models along three key dimensions: statistical robustness, zero-shot generalization, and computational efficiency. We introduce a novel metric, U-Score, which jointly captures the performance-efficiency trade-off, offering a deployment-oriented perspective on model progress. (2) Systematic Analysis and Model Selection Guidance: We summarize key findings from the large-scale evaluation and systematically analyze the impact of dataset characteristics and architectural paradigms on model performance. Based on these insights, we propose a model advisor agent to guide researchers in selecting the most suitable models for specific datasets and tasks. (3) Public Availability: We provide all code, models, protocols, and weights, enabling the community to reproduce our results and extend the benchmark with future methods. In summary, U-Bench not only exposes gaps in previous evaluations but also establishes a foundation for fair, reproducible, and practically relevant benchmarking in the next decade of U-Net-based segmentation models. The project can be accessed at: https://fenghetan9.github.io/ubench. Code is available at: https://github.com/FengheTan9/U-Bench.

[240] Concept Retrieval – What and How?

Ori nizan, Oren Shrout, Ayellet Tal

Main category: cs.CV

TL;DR: This paper introduces a novel approach for retrieving images that share central concepts with a query image, focusing on underlying narratives rather than just visual or semantic similarity.

Details

Motivation: To go beyond conventional image retrieval methods that emphasize visual or semantic similarity, and instead capture the central concepts and underlying narrative of images.

Method: Proposes an approach based on two key observations: (1) neighbors in embedding space share at least one concept with query but not necessarily with each other, and (2) modeling the neighborhood with a bimodal Gaussian distribution reveals meaningful structure for concept identification.

Result: Qualitative, quantitative, and human evaluations confirm the effectiveness of the proposed approach for concept-based image retrieval.

Conclusion: The method successfully identifies and retrieves images sharing central concepts, providing a more nuanced approach than traditional similarity-based retrieval methods.

Abstract: A concept may reflect either a concrete or abstract idea. Given an input image, this paper seeks to retrieve other images that share its central concepts, capturing aspects of the underlying narrative. This goes beyond conventional retrieval or clustering methods, which emphasize visual or semantic similarity. We formally define the problem, outline key requirements, and introduce appropriate evaluation metrics. We propose a novel approach grounded in two key observations: (1) While each neighbor in the embedding space typically shares at least one concept with the query, not all neighbors necessarily share the same concept with one another. (2) Modeling this neighborhood with a bimodal Gaussian distribution uncovers meaningful structure that facilitates concept identification. Qualitative, quantitative, and human evaluations confirm the effectiveness of our approach. See the package on PyPI: https://pypi.org/project/coret/

[241] DADO: A Depth-Attention framework for Object Discovery

Federico Gonzalez, Estefania Talavera, Petia Radeva

Main category: cs.CV

TL;DR: DADO is a novel self-supervised model that combines attention mechanisms and depth models to discover objects in images without human annotations, using dynamic weighting to adapt between attention and depth features based on image characteristics.

Details

Motivation: Unsupervised object discovery remains a significant challenge in computer vision, as existing methods struggle with noisy attention maps and complex scenes with varying depth planes.

Method: DADO combines attention mechanisms with depth models and employs dynamic weighting to adaptively emphasize attention or depth features based on global image characteristics.

Result: DADO outperforms state-of-the-art methods on standard benchmarks for object discovery accuracy and robustness without requiring fine-tuning.

Conclusion: The proposed DADO model effectively addresses challenges in unsupervised object discovery by leveraging complementary attention and depth features through adaptive weighting.

Abstract: Unsupervised object discovery, the task of identifying and localizing objects in images without human-annotated labels, remains a significant challenge and a growing focus in computer vision. In this work, we introduce a novel model, DADO (Depth-Attention self-supervised technique for Discovering unseen Objects), which combines an attention mechanism and a depth model to identify potential objects in images. To address challenges such as noisy attention maps or complex scenes with varying depth planes, DADO employs dynamic weighting to adaptively emphasize attention or depth features based on the global characteristics of each image. We evaluated DADO on standard benchmarks, where it outperforms state-of-the-art methods in object discovery accuracy and robustness without the need for fine-tuning.

[242] Enhancing Concept Localization in CLIP-based Concept Bottleneck Models

Rémi Kazmierczak, Steve Azzolin, Eloïse Berthier, Goran Frehse, Gianni Franchi

Main category: cs.CV

TL;DR: This paper identifies concept hallucination in CLIP-based Concept Bottleneck Models and introduces CHILI to mitigate this issue by localizing concept pixels and improving explanation faithfulness.

Details

Motivation: To address the problem of concept hallucination in CLIP-based explainable AI systems, where CLIP incorrectly predicts concept presence/absence, undermining explanation faithfulness in Concept Bottleneck Models.

Method: Introduces CHILI (Concept Hallucination Inhibition via Localized Interpretability), which disentangles image embeddings and localizes pixels corresponding to target concepts to reduce hallucination.

Result: The approach mitigates concept hallucination in CLIP-based systems and supports generation of more interpretable saliency-based explanations.

Conclusion: CHILI effectively addresses concept hallucination in CLIP-based CBMs, improving explanation faithfulness and interpretability through better concept localization.

Abstract: This paper addresses explainable AI (XAI) through the lens of Concept Bottleneck Models (CBMs) that do not require explicit concept annotations, relying instead on concepts extracted using CLIP in a zero-shot manner. We show that CLIP, which is central in these techniques, is prone to concept hallucination, incorrectly predicting the presence or absence of concepts within an image in scenarios used in numerous CBMs, hence undermining the faithfulness of explanations. To mitigate this issue, we introduce Concept Hallucination Inhibition via Localized Interpretability (CHILI), a technique that disentangles image embeddings and localizes pixels corresponding to target concepts. Furthermore, our approach supports the generation of saliency-based explanations that are more interpretable.

Dongki Jung, Jaehoon Choi, Yonghan Lee, Sungmin Eum, Heesung Kwon, Dinesh Manocha

Main category: cs.CV

TL;DR: MoRe is a training-free monocular geometry refinement method that improves cross-view consistency and scale alignment using feature matching and graph-based optimization with local planar approximation.

Details

Motivation: Monocular 3D foundation models provide extensible solutions for perception tasks but suffer from scale ambiguity and cross-view inconsistency issues that limit their broader application in 3D vision.

Method: Uses feature matching between frames to establish correspondences, then applies graph-based optimization with local planar approximation using estimated 3D points and surface normals from monocular foundation models.

Result: The method addresses scale ambiguity while preserving 3D structure, enhances 3D reconstruction quality, and improves novel view synthesis especially in sparse view rendering scenarios.

Conclusion: MoRe provides an effective training-free approach for refining monocular geometry that improves both reconstruction and view synthesis without requiring additional training.

Abstract: Monocular 3D foundation models offer an extensible solution for perception tasks, making them attractive for broader 3D vision applications. In this paper, we propose MoRe, a training-free Monocular Geometry Refinement method designed to improve cross-view consistency and achieve scale alignment. To induce inter-frame relationships, our method employs feature matching between frames to establish correspondences. Rather than applying simple least squares optimization on these matched points, we formulate a graph-based optimization framework that performs local planar approximation using the estimated 3D points and surface normals estimated by monocular foundation models. This formulation addresses the scale ambiguity inherent in monocular geometric priors while preserving the underlying 3D structure. We further demonstrate that MoRe not only enhances 3D reconstruction but also improves novel view synthesis, particularly in sparse view rendering scenarios.

[244] Validation of Various Normalization Methods for Brain Tumor Segmentation: Can Federated Learning Overcome This Heterogeneity?

Jan Fiszer, Dominika Ciupek, Maciej Malawski

Main category: cs.CV

TL;DR: Federated learning shows resilience to non-IID data caused by different MRI normalization methods, achieving 92% Dice score for brain tumor segmentation comparable to centralized training while preserving data privacy.

Details

Motivation: Deep learning in medical imaging faces data privacy, storage, and transfer challenges. Federated learning addresses these but may be affected by non-IID data distributions, particularly from different MRI intensity normalization methods.

Method: Simulated non-IID conditions by applying different MRI intensity normalization techniques to separate data subsets, then trained and tested brain tumor segmentation models using federated learning approaches.

Result: FL methods demonstrated resilience to inconsistently normalized data, achieving 92% 3D Dice score comparable to centralized models. Different normalization methods significantly influence segmentation model performance.

Conclusion: Federated learning is an effective solution for training high-performing medical imaging models without violating data privacy, even when dealing with heterogeneous data from different normalization techniques.

Abstract: Deep learning (DL) has been increasingly applied in medical imaging, however, it requires large amounts of data, which raises many challenges related to data privacy, storage, and transfer. Federated learning (FL) is a training paradigm that overcomes these issues, though its effectiveness may be reduced when dealing with non-independent and identically distributed (non-IID) data. This study simulates non-IID conditions by applying different MRI intensity normalization techniques to separate data subsets, reflecting a common cause of heterogeneity. These subsets are then used for training and testing models for brain tumor segmentation. The findings provide insights into the influence of the MRI intensity normalization methods on segmentation models, both training and inference. Notably, the FL methods demonstrated resilience to inconsistently normalized data across clients, achieving the 3D Dice score of 92%, which is comparable to a centralized model (trained using all data). These results indicate that FL is a solution to effectively train high-performing models without violating data privacy, a crucial concern in medical applications. The code is available at: https://github.com/SanoScience/fl-varying-normalization.

[245] Graph Conditioned Diffusion for Controllable Histopathology Image Generation

Sarah Cechnicka, Matthew Baugh, Weitong Zhang, Mischa Dombrowski, Zhe Li, Johannes C. Paetzold, Candice Roufosse, Bernhard Kainz

Main category: cs.CV

TL;DR: The paper proposes Graph-Conditioned-Diffusion using graph-based object-level representations to enable fine-grained control in medical image generation, particularly for histopathology images.

Details

Motivation: Existing Diffusion Probabilistic Models (DPMs) operate in noisy latent spaces lacking semantic structure, making controlled generation challenging in sensitive medical imaging where structural consistency is critical for diagnosis.

Method: The approach generates graph nodes for each major structure in medical images, capturing individual features and relationships. These graph representations are processed by a transformer and integrated into diffusion models via text-conditioning mechanisms.

Result: Evaluation on real-world histopathology data shows that generated data can reliably substitute for annotated patient data in downstream segmentation tasks.

Conclusion: Graph-based object-level representations enable meaningful control over medical image generation, addressing the limitations of traditional DPMs in structured domains like medical imaging.

Abstract: Recent advances in Diffusion Probabilistic Models (DPMs) have set new standards in high-quality image synthesis. Yet, controlled generation remains challenging, particularly in sensitive areas such as medical imaging. Medical images feature inherent structure such as consistent spatial arrangement, shape or texture, all of which are critical for diagnosis. However, existing DPMs operate in noisy latent spaces that lack semantic structure and strong priors, making it difficult to ensure meaningful control over generated content. To address this, we propose graph-based object-level representations for Graph-Conditioned-Diffusion. Our approach generates graph nodes corresponding to each major structure in the image, encapsulating their individual features and relationships. These graph representations are processed by a transformer module and integrated into a diffusion model via the text-conditioning mechanism, enabling fine-grained control over generation. We evaluate this approach using a real-world histopathology use case, demonstrating that our generated data can reliably substitute for annotated patient data in downstream segmentation tasks. The code is available here.

[246] Few-Shot Adaptation Benchmark for Remote Sensing Vision-Language Models

Karim El Khoury, Maxime Zanella, Christophe De Vleeschouwer, Benoit Macq

Main category: cs.CV

TL;DR: First benchmark for evaluating few-shot adaptation methods on Remote Sensing Vision-Language Models (RSVLMs), showing varying adaptation capabilities across models with similar zero-shot performance.

Details

Motivation: To explore the insufficiently studied generalization ability of RSVLMs in low-data regimes like few-shot learning, and provide a structured evaluation framework.

Method: Comprehensive experiments across 10 remote sensing datasets using 5 few-shot adaptation strategies on 3 state-of-the-art RSVLMs with different backbones.

Result: Models with similar zero-shot performance show markedly different few-shot adaptation behavior, with some RSVLMs being inherently more adaptable than others. No clear winner among existing methods.

Conclusion: Need for developing more robust few-shot adaptation methods tailored to remote sensing, with provided benchmarking framework and open-source code to facilitate future research.

Abstract: Remote Sensing Vision-Language Models (RSVLMs) have shown remarkable potential thanks to large-scale pretraining, achieving strong zero-shot performance on various tasks. However, their ability to generalize in low-data regimes, such as few-shot learning, remains insufficiently explored. In this work, we present the first structured benchmark for evaluating few-shot adaptation methods on RSVLMs. We conduct comprehensive experiments across ten remote sensing scene classification datasets, applying five widely used few-shot adaptation strategies to three state-of-the-art RSVLMs with varying backbones. Our findings reveal that models with similar zero-shot performance can exhibit markedly different behavior under few-shot adaptation, with some RSVLMs being inherently more amenable to such adaptation than others. The variability of performance and the absence of a clear winner among existing methods highlight the need for the development of more robust methods for few-shot adaptation tailored to RS. To facilitate future research, we provide a reproducible benchmarking framework and open-source code to systematically evaluate RSVLMs under few-shot conditions. The source code is publicly available on Github: https://github.com/elkhouryk/fewshot_RSVLMs

[247] Are We Using the Right Benchmark: An Evaluation Framework for Visual Token Compression Methods

Chenfei Liao, Wensong Wang, Zichen Wen, Xu Zheng, Yiyu Wang, Haocong He, Yuanhuiyi Lyu, Lutao Jiang, Xin Zou, Yuqian Fu, Bin Ren, Linfeng Zhang, Xuming Hu

Main category: cs.CV

TL;DR: Current benchmarks for evaluating visual token compression in MLLMs suffer from task mismatch, and simple image downsampling often outperforms advanced compression methods. The authors introduce VTC-Bench with data filtering to enable fairer evaluation.

Details

Motivation: Existing benchmarks designed for MLLM perception/reasoning assessment are not suitable for evaluating visual token compression methods, leading to unfair comparisons and task mismatch issues.

Method: Extensive experiments revealing that simple downsampling outperforms advanced compression methods, and developing VTC-Bench framework with data filtering mechanism to denoise existing benchmarks.

Result: Found that current benchmarks are noisy for visual token compression task, and downsampling can serve as a data filter to evaluate sample difficulty. Created VTC-Bench for fairer assessment.

Conclusion: Proposed VTC-Bench addresses the task mismatch in visual token compression evaluation by incorporating data filtering, enabling more accurate and fair comparison of compression methods.

Abstract: Recent endeavors to accelerate inference in Multimodal Large Language Models (MLLMs) have primarily focused on visual token compression. The effectiveness of these methods is typically assessed by measuring the accuracy drop on established benchmarks, comparing model performance before and after compression. However, these benchmarks are originally designed to assess the perception and reasoning capabilities of MLLMs, rather than to evaluate compression techniques. As a result, directly applying them to visual token compression introduces a task mismatch. Strikingly, our investigation reveals that simple image downsampling consistently outperforms many advanced compression methods across multiple widely used benchmarks. Through extensive experiments, we make the following observations: (i) Current benchmarks are noisy for the visual token compression task. (ii) Down-sampling is able to serve as a data filter to evaluate the difficulty of samples in the visual token compression task. Motivated by these findings, we introduce VTC-Bench, an evaluation framework that incorporates a data filtering mechanism to denoise existing benchmarks, thereby enabling fairer and more accurate assessment of visual token compression methods. All data and code are available at https://github.com/Chenfei-Liao/VTC-Bench.

[248] MV-Performer: Taming Video Diffusion Model for Faithful and Synchronized Multi-view Performer Synthesis

Yihao Zhi, Chenghong Li, Hongjie Liao, Xihe Yang, Zhengwentai Sun, Jiahao Chang, Xiaodong Cun, Wensen Feng, Xiaoguang Han

Main category: cs.CV

TL;DR: MV-Performer is a framework for generating synchronized 360-degree novel view videos from monocular human captures using camera-dependent normal maps and multi-view diffusion models.

Details

Motivation: Current video generation methods struggle with 360-degree viewpoint changes and primarily focus on front-view camera trajectory redirection, especially in human-centric scenarios.

Method: Uses MVHumanNet dataset with camera-dependent normal maps from oriented partial point clouds, and a multi-view human-centric video diffusion model that fuses reference video, partial rendering, and viewpoint information.

Result: Extensive experiments on three datasets demonstrate state-of-the-art effectiveness and robustness for human-centric 4D novel view synthesis.

Conclusion: MV-Performer sets a strong benchmark for human-centric 4D novel view synthesis with robust inference for in-the-wild videos.

Abstract: Recent breakthroughs in video generation, powered by large-scale datasets and diffusion techniques, have shown that video diffusion models can function as implicit 4D novel view synthesizers. Nevertheless, current methods primarily concentrate on redirecting camera trajectory within the front view while struggling to generate 360-degree viewpoint changes. In this paper, we focus on human-centric subdomain and present MV-Performer, an innovative framework for creating synchronized novel view videos from monocular full-body captures. To achieve a 360-degree synthesis, we extensively leverage the MVHumanNet dataset and incorporate an informative condition signal. Specifically, we use the camera-dependent normal maps rendered from oriented partial point clouds, which effectively alleviate the ambiguity between seen and unseen observations. To maintain synchronization in the generated videos, we propose a multi-view human-centric video diffusion model that fuses information from the reference video, partial rendering, and different viewpoints. Additionally, we provide a robust inference procedure for in-the-wild video cases, which greatly mitigates the artifacts induced by imperfect monocular depth estimation. Extensive experiments on three datasets demonstrate our MV-Performer’s state-of-the-art effectiveness and robustness, setting a strong model for human-centric 4D novel view synthesis.

[249] Resolution scaling governs DINOv3 transfer performance in chest radiograph classification

Soroosh Tayebi Arasteh, Mina Shaigan, Christiane Kuhl, Jakob Nikolas Kather, Sven Nebelung, Daniel Truhn

Main category: cs.CV

TL;DR: DINOv3 SSL model with 512x512 resolution and ConvNeXt-B backbone provides optimal performance for chest radiography, outperforming DINOv2 and ImageNet initialization, with greatest benefits for detecting subtle abnormalities.

Details

Motivation: To systematically evaluate whether DINOv3's design improvements over earlier SSL models provide better transfer learning performance for chest radiography, a high-volume imaging modality with fine-grained findings.

Method: Benchmarked DINOv3 against DINOv2 and ImageNet initialization across 7 datasets (n>814,000) using ViT-B/16 and ConvNeXt-B backbones at multiple resolutions (224x224, 512x512, 1024x1024), and assessed frozen features from a 7B model.

Result: DINOv3 with 512x512 resolution showed consistent improvements over DINOv2 and ImageNet. ConvNeXt-B outperformed ViT-B/16 across all settings. Frozen 7B features underperformed finetuned models. Scaling to 1024x1024 provided no further accuracy gains.

Conclusion: 512x512 resolution represents the practical upper limit where DINOv3-initialized ConvNeXt-B networks provide optimal performance for chest radiography, with greatest clinical benefits for detecting subtle or boundary-centered lesions.

Abstract: Self-supervised learning (SSL) has advanced visual representation learning, but its value in chest radiography, a high-volume imaging modality with fine-grained findings, remains unclear. Meta’s DINOv3 extends earlier SSL models through Gram-anchored self-distillation. Whether these design choices improve transfer learning for chest radiography has not been systematically tested. We benchmarked DINOv3 against DINOv2 and ImageNet initialization across seven datasets (n>814,000). Two representative backbones were evaluated: ViT-B/16 and ConvNeXt-B. Images were analyzed at 224x224, 512x512, and 1024x1024 pixels. We additionally assessed frozen features from a 7B model. The primary outcome was mean AUROC across labels. At 224x224, DINOv3 and DINOv2 achieved comparable performance on adult datasets. Increasing resolution to 512x512 yielded consistent improvements for DINOv3 over both DINOv2 and ImageNet. In contrast, results in pediatric cohort showed no differences across initializations. Across all settings, ConvNeXt-B outperformed ViT-B/16. Models using frozen DINOv3-7B features underperformed relative to fully finetuned 86-89M-parameter backbones, highlighting the importance of domain adaptation. Scaling to 1024x1024 did not further improve accuracy. Resolution-related gains were most evident for boundary-dependent and small focal abnormalities. In chest radiography, higher input resolution is critical for leveraging the benefits of modern self-supervised models. 512x512 pixels represent a practical upper limit where DINOv3-initialized ConvNeXt-B networks provide the strongest performance, while larger inputs offer minimal return on cost. Clinically, these findings support use of finetuned, mid-sized backbones at 512x512 for chest radiograph interpretation, with the greatest gains expected in detecting subtle or boundary-centered lesions relevant to emergency and critical care settings.

[250] EigenScore: OOD Detection using Covariance in Diffusion Models

Shirin Shoushtari, Yi Wang, Xiao Shi, M. Salman Asif, Ulugbek S. Kamilov

Main category: cs.CV

TL;DR: EigenScore is a new OOD detection method that uses the eigenvalue spectrum of posterior covariance from diffusion models to detect distribution shifts, achieving state-of-the-art performance with up to 5% AUROC improvement.

Details

Motivation: Out-of-distribution detection is critical for safe deployment of ML systems in safety-sensitive domains. Diffusion models have shown promise for OOD detection, but existing methods often fail in near-OOD settings.

Method: Leverages eigenvalue spectrum of posterior covariance induced by diffusion models. Uses Jacobian-free subspace iteration to estimate leading eigenvalues using only forward evaluations of the denoiser for tractability.

Result: Achieves SOTA performance with up to 5% AUROC improvement over best baselines. Remains robust in near-OOD settings like CIFAR-10 vs CIFAR-100 where existing diffusion-based methods often fail.

Conclusion: Posterior covariance provides a consistent signal for distribution shift and serves as a reliable OOD detection method, with EigenScore demonstrating strong performance across various settings.

Abstract: Out-of-distribution (OOD) detection is critical for the safe deployment of machine learning systems in safety-sensitive domains. Diffusion models have recently emerged as powerful generative models, capable of capturing complex data distributions through iterative denoising. Building on this progress, recent work has explored their potential for OOD detection. We propose EigenScore, a new OOD detection method that leverages the eigenvalue spectrum of the posterior covariance induced by a diffusion model. We argue that posterior covariance provides a consistent signal of distribution shift, leading to larger trace and leading eigenvalues on OOD inputs, yielding a clear spectral signature. We further provide analysis explicitly linking posterior covariance to distribution mismatch, establishing it as a reliable signal for OOD detection. To ensure tractability, we adopt a Jacobian-free subspace iteration method to estimate the leading eigenvalues using only forward evaluations of the denoiser. Empirically, EigenScore achieves SOTA performance, with up to 5% AUROC improvement over the best baseline. Notably, it remains robust in near-OOD settings such as CIFAR-10 vs CIFAR-100, where existing diffusion-based methods often fail.

[251] GenPilot: A Multi-Agent System for Test-Time Prompt Optimization in Image Generation

Wen Ye, Zhaocheng Liu, Yuwei Gui, Tingyu Yuan, Yunyue Su, Bowen Fang, Chaoyang Zhao, Qiang Liu, Liang Wang

Main category: cs.CV

TL;DR: GenPilot is a plug-and-play multi-agent system for test-time prompt optimization that improves text-to-image synthesis by analyzing errors, exploring alternatives, and refining prompts iteratively without model retraining.

Details

Motivation: Existing text-to-image models struggle with complex prompts, leading to semantic inconsistencies and missing details. Current solutions like fine-tuning are model-specific and require training, while automatic prompt optimization methods lack systematic error analysis and refinement strategies.

Method: Proposes GenPilot - a multi-agent system with error analysis, clustering-based adaptive exploration, fine-grained verification, and memory module for iterative optimization. It operates directly on input text and is model-agnostic.

Result: Experiments on DPG-bench and Geneval show improvements of up to 16.9% and 5.7% in text-image consistency and structural coherence. The approach is effective for handling long and complex prompts.

Conclusion: GenPilot provides a flexible, efficient, and interpretable test-time prompt optimization strategy that enhances text-to-image synthesis quality without requiring model retraining, while also establishing error patterns and refinement strategies for future research.

Abstract: Text-to-image synthesis has made remarkable progress, yet accurately interpreting complex and lengthy prompts remains challenging, often resulting in semantic inconsistencies and missing details. Existing solutions, such as fine-tuning, are model-specific and require training, while prior automatic prompt optimization (APO) approaches typically lack systematic error analysis and refinement strategies, resulting in limited reliability and effectiveness. Meanwhile, test-time scaling methods operate on fixed prompts and on noise or sample numbers, limiting their interpretability and adaptability. To solve these, we introduce a flexible and efficient test-time prompt optimization strategy that operates directly on the input text. We propose a plug-and-play multi-agent system called GenPilot, integrating error analysis, clustering-based adaptive exploration, fine-grained verification, and a memory module for iterative optimization. Our approach is model-agnostic, interpretable, and well-suited for handling long and complex prompts. Simultaneously, we summarize the common patterns of errors and the refinement strategy, offering more experience and encouraging further exploration. Experiments on DPG-bench and Geneval with improvements of up to 16.9% and 5.7% demonstrate the strong capability of our methods in enhancing the text and image consistency and structural coherence of generated images, revealing the effectiveness of our test-time prompt optimization strategy. The code is available at https://github.com/27yw/GenPilot.

[252] TalkCuts: A Large-Scale Dataset for Multi-Shot Human Speech Video Generation

Jiaben Chen, Zixin Wang, Ailing Zeng, Yang Fu, Xueyang Yu, Siyuan Cen, Julian Tanke, Yihang Chen, Koichi Saito, Yuki Mitsufuji, Chuang Gan

Main category: cs.CV

TL;DR: TalkCuts is a large-scale dataset for multi-shot human speech video generation with 164k clips, 500+ hours of diverse camera shots, and multimodal annotations. The authors also present Orator, an LLM-guided framework that generates coherent long-form videos.

Details

Motivation: To address the limitation of existing datasets that focus on single-shot, static viewpoints by providing a comprehensive resource for multi-shot human speech video generation research.

Method: Created TalkCuts dataset with 164k clips featuring diverse camera shots (close-up, half-body, full-body) and multimodal annotations. Developed Orator framework using LLM as director to orchestrate camera transitions, gestures, and vocal modulation.

Result: Training on TalkCuts significantly enhances cinematographic coherence and visual appeal of generated multi-shot speech videos in both pose-guided and audio-driven settings.

Conclusion: TalkCuts provides a strong foundation for future work in controllable, multi-shot speech video generation and broader multimodal learning.

Abstract: In this work, we present TalkCuts, a large-scale dataset designed to facilitate the study of multi-shot human speech video generation. Unlike existing datasets that focus on single-shot, static viewpoints, TalkCuts offers 164k clips totaling over 500 hours of high-quality human speech videos with diverse camera shots, including close-up, half-body, and full-body views. The dataset includes detailed textual descriptions, 2D keypoints and 3D SMPL-X motion annotations, covering over 10k identities, enabling multimodal learning and evaluation. As a first attempt to showcase the value of the dataset, we present Orator, an LLM-guided multi-modal generation framework as a simple baseline, where the language model functions as a multi-faceted director, orchestrating detailed specifications for camera transitions, speaker gesticulations, and vocal modulation. This architecture enables the synthesis of coherent long-form videos through our integrated multi-modal video generation module. Extensive experiments in both pose-guided and audio-driven settings show that training on TalkCuts significantly enhances the cinematographic coherence and visual appeal of generated multi-shot speech videos. We believe TalkCuts provides a strong foundation for future work in controllable, multi-shot speech video generation and broader multimodal learning.

[253] Evaluating Fundus-Specific Foundation Models for Diabetic Macular Edema Detection

Franco Javier Arellano, José Ignacio Orlando

Main category: cs.CV

TL;DR: Foundation Models (RETFound and FLAIR) don’t consistently outperform fine-tuned EfficientNet-B0 for Diabetic Macular Edema detection, with lightweight CNNs remaining strong baselines in data-scarce settings.

Details

Motivation: To determine if Foundation Models can effectively detect Diabetic Macular Edema (DME) from fundus images, given the challenges of limited annotated data and the need for reliable automated detection methods.

Method: Systematic comparison of two popular retinal Foundation Models (RETFound and FLAIR) against EfficientNet-B0 backbone across different training regimes and evaluation settings using IDRiD, MESSIDOR-2 and OCT-and-Eye-Fundus-Images datasets.

Result: EfficientNet-B0 ranked first or second in most evaluation settings for ROC and precision/recall curves. RETFound showed promise only in OEFI dataset, while FLAIR demonstrated competitive zero-shot performance with appropriate prompting.

Conclusion: Foundation Models may not be suitable for fine-grained ophthalmic tasks like DME detection even after fine-tuning, suggesting lightweight CNNs remain effective baselines in data-scarce environments.

Abstract: Diabetic Macular Edema (DME) is a leading cause of vision loss among patients with Diabetic Retinopathy (DR). While deep learning has shown promising results for automatically detecting this condition from fundus images, its application remains challenging due the limited availability of annotated data. Foundation Models (FM) have emerged as an alternative solution. However, it is unclear if they can cope with DME detection in particular. In this paper, we systematically compare different FM and standard transfer learning approaches for this task. Specifically, we compare the two most popular FM for retinal images–RETFound and FLAIR–and an EfficientNet-B0 backbone, across different training regimes and evaluation settings in IDRiD, MESSIDOR-2 and OCT-and-Eye-Fundus-Images (OEFI). Results show that despite their scale, FM do not consistently outperform fine-tuned CNNs in this task. In particular, an EfficientNet-B0 ranked first or second in terms of area under the ROC and precision/recall curves in most evaluation settings, with RETFound only showing promising results in OEFI. FLAIR, on the other hand, demonstrated competitive zero-shot performance, achieving notable AUC-PR scores when prompted appropriately. These findings reveal that FM might not be a good tool for fine-grained ophthalmic tasks such as DME detection even after fine-tuning, suggesting that lightweight CNNs remain strong baselines in data-scarce environments.

[254] SpecGuard: Spectral Projection-based Advanced Invisible Watermarking

Inzamamul Alam, Md Tanvir Islam, Khan Muhammad, Simon S. Woo

Main category: cs.CV

TL;DR: SpecGuard is a robust invisible image watermarking method that embeds messages in hidden convolution layers using frequency domain transformation via spectral projection and wavelet decomposition, achieving superior performance against various attacks.

Details

Motivation: Existing watermarking methods lack robustness against transformations like distortions, image regeneration, and adversarial perturbations, creating challenges for real-world authenticity verification.

Method: Embeds messages in hidden convolution layers by converting spatial domain to frequency domain using spectral projection of higher frequency bands decomposed by wavelet projection. Uses Fast Fourier Transform for efficient transformation and includes a strength factor for enhanced resilience against attacks.

Result: Comprehensive experiments demonstrate SpecGuard outperforms state-of-the-art models in terms of watermark invisibility, capacity, and robustness against diverse attacks including adversarial, geometric, and regeneration-based distortions.

Conclusion: SpecGuard provides an effective solution for robust and invisible image watermarking, with the full code released on GitHub to ensure reproducibility.

Abstract: Watermarking embeds imperceptible patterns into images for authenticity verification. However, existing methods often lack robustness against various transformations primarily including distortions, image regeneration, and adversarial perturbation, creating real-world challenges. In this work, we introduce SpecGuard, a novel watermarking approach for robust and invisible image watermarking. Unlike prior approaches, we embed the message inside hidden convolution layers by converting from the spatial domain to the frequency domain using spectral projection of a higher frequency band that is decomposed by wavelet projection. Spectral projection employs Fast Fourier Transform approximation to transform spatial data into the frequency domain efficiently. In the encoding phase, a strength factor enhances resilience against diverse attacks, including adversarial, geometric, and regeneration-based distortions, ensuring the preservation of copyrighted information. Meanwhile, the decoder leverages Parseval’s theorem to effectively learn and extract the watermark pattern, enabling accurate retrieval under challenging transformations. We evaluate the proposed SpecGuard based on the embedded watermark’s invisibility, capacity, and robustness. Comprehensive experiments demonstrate the proposed SpecGuard outperforms the state-of-the-art models. To ensure reproducibility, the full code is released on \href{https://github.com/inzamamulDU/SpecGuard_ICCV_2025}{\textcolor{blue}{\textbf{GitHub}}}.

[255] MATRIX: Mask Track Alignment for Interaction-aware Video Generation

Siyoon Jin, Seongchan Kim, Dahyun Chung, Jaeho Lee, Hyunwook Choi, Jisu Nam, Jiyoung Kim, Seungryong Kim

Main category: cs.CV

TL;DR: The paper analyzes how video diffusion transformers (DiTs) represent interactions, introduces MATRIX-11K dataset with interaction-aware captions, and proposes MATRIX regularization to improve interaction modeling in video DiTs.

Details

Motivation: Video DiTs struggle with multi-instance and subject-object interactions, raising questions about how these models internally represent interactions.

Method: Created MATRIX-11K dataset with interaction-aware captions and multi-instance mask tracks. Analyzed video DiTs through semantic grounding (video-to-text attention) and semantic propagation (video-to-video attention). Proposed MATRIX regularization to align attention with multi-instance mask tracks.

Result: Found interaction effects concentrate in a small subset of layers. MATRIX improves interaction fidelity, semantic alignment while reducing drift and hallucination.

Conclusion: MATRIX regularization effectively enhances video DiTs’ ability to model interactions through attention alignment with multi-instance mask tracks.

Abstract: Video DiTs have advanced video generation, yet they still struggle to model multi-instance or subject-object interactions. This raises a key question: How do these models internally represent interactions? To answer this, we curate MATRIX-11K, a video dataset with interaction-aware captions and multi-instance mask tracks. Using this dataset, we conduct a systematic analysis that formalizes two perspectives of video DiTs: semantic grounding, via video-to-text attention, which evaluates whether noun and verb tokens capture instances and their relations; and semantic propagation, via video-to-video attention, which assesses whether instance bindings persist across frames. We find both effects concentrate in a small subset of interaction-dominant layers. Motivated by this, we introduce MATRIX, a simple and effective regularization that aligns attention in specific layers of video DiTs with multi-instance mask tracks from the MATRIX-11K dataset, enhancing both grounding and propagation. We further propose InterGenEval, an evaluation protocol for interaction-aware video generation. In experiments, MATRIX improves both interaction fidelity and semantic alignment while reducing drift and hallucination. Extensive ablations validate our design choices. Codes and weights will be released.

[256] WristWorld: Generating Wrist-Views via 4D World Models for Robotic Manipulation

Zezhong Qian, Xiaowei Chi, Yuming Li, Shizun Wang, Zhiyuan Qin, Xiaozhu Ju, Sirui Han, Shanghang Zhang

Main category: cs.CV

TL;DR: WristWorld is a 4D world model that generates wrist-view videos from anchor views alone, addressing the scarcity of wrist-view recordings in large-scale datasets. It uses geometric reconstruction and video generation to create spatially consistent wrist-view videos that improve VLA manipulation performance.

Details

Motivation: Wrist-view observations are crucial for VLA models to capture fine-grained hand-object interactions, but large-scale datasets rarely include such recordings, creating a substantial gap between abundant anchor views and scarce wrist views. Existing world models cannot bridge this gap as they require wrist-view first frames.

Method: Two-stage approach: (1) Reconstruction stage extends VGGT with Spatial Projection Consistency Loss to estimate geometrically consistent wrist-view poses and 4D point clouds; (2) Generation stage uses a video generation model to synthesize temporally coherent wrist-view videos from the reconstructed perspective.

Result: State-of-the-art video generation with superior spatial consistency on Droid, Calvin, and Franka Panda datasets. Improves VLA performance by raising average task completion length on Calvin by 3.81% and closing 42.4% of the anchor-wrist view gap.

Conclusion: WristWorld successfully bridges the gap between anchor and wrist views by leveraging geometric priors and video generation, enabling wrist-view video synthesis from anchor views alone and significantly enhancing VLA manipulation capabilities.

Abstract: Wrist-view observations are crucial for VLA models as they capture fine-grained hand-object interactions that directly enhance manipulation performance. Yet large-scale datasets rarely include such recordings, resulting in a substantial gap between abundant anchor views and scarce wrist views. Existing world models cannot bridge this gap, as they require a wrist-view first frame and thus fail to generate wrist-view videos from anchor views alone. Amid this gap, recent visual geometry models such as VGGT emerge with geometric and cross-view priors that make it possible to address extreme viewpoint shifts. Inspired by these insights, we propose WristWorld, the first 4D world model that generates wrist-view videos solely from anchor views. WristWorld operates in two stages: (i) Reconstruction, which extends VGGT and incorporates our Spatial Projection Consistency (SPC) Loss to estimate geometrically consistent wrist-view poses and 4D point clouds; (ii) Generation, which employs our video generation model to synthesize temporally coherent wrist-view videos from the reconstructed perspective. Experiments on Droid, Calvin, and Franka Panda demonstrate state-of-the-art video generation with superior spatial consistency, while also improving VLA performance, raising the average task completion length on Calvin by 3.81% and closing 42.4% of the anchor-wrist view gap.

[257] Pixel-Perfect Depth with Semantics-Prompted Diffusion Transformers

Gangwei Xu, Haotong Lin, Hongcheng Luo, Xianqi Wang, Jingfeng Yao, Lianghui Zhu, Yuechuan Pu, Cheng Chi, Haiyang Sun, Bing Wang, Guang Chen, Hangjun Ye, Sida Peng, Xin Yang

Main category: cs.CV

TL;DR: Pixel-Perfect Depth is a monocular depth estimation model that uses pixel-space diffusion generation to create high-quality, flying-pixel-free point clouds, outperforming existing generative models.

Details

Motivation: Current generative depth estimation models using Stable Diffusion require VAE compression which introduces flying pixels at edges and details, degrading depth map quality.

Method: Direct pixel-space diffusion generation using Semantics-Prompted Diffusion Transformers (SP-DiT) that incorporate semantic representations from vision foundation models, and Cascade DiT Design that progressively increases tokens for efficiency and accuracy.

Result: Achieves best performance among all published generative models across five benchmarks and significantly outperforms all other models in edge-aware point cloud evaluation.

Conclusion: Pixel-space diffusion generation with semantic prompting and cascade design effectively eliminates VAE-induced artifacts and produces superior depth estimation results.

Abstract: This paper presents Pixel-Perfect Depth, a monocular depth estimation model based on pixel-space diffusion generation that produces high-quality, flying-pixel-free point clouds from estimated depth maps. Current generative depth estimation models fine-tune Stable Diffusion and achieve impressive performance. However, they require a VAE to compress depth maps into latent space, which inevitably introduces \textit{flying pixels} at edges and details. Our model addresses this challenge by directly performing diffusion generation in the pixel space, avoiding VAE-induced artifacts. To overcome the high complexity associated with pixel-space generation, we introduce two novel designs: 1) Semantics-Prompted Diffusion Transformers (SP-DiT), which incorporate semantic representations from vision foundation models into DiT to prompt the diffusion process, thereby preserving global semantic consistency while enhancing fine-grained visual details; and 2) Cascade DiT Design that progressively increases the number of tokens to further enhance efficiency and accuracy. Our model achieves the best performance among all published generative models across five benchmarks, and significantly outperforms all other models in edge-aware point cloud evaluation.

[258] Quantum-enhanced Computer Vision: Going Beyond Classical Algorithms

Natacha Kuete Meli, Shuteng Wang, Marcel Seelbach Benkner, Michele Sasdelli, Tat-Jun Chin, Tolga Birdal, Michael Moeller, Vladislav Golyanik

Main category: cs.CV

TL;DR: This survey provides a comprehensive review of Quantum-enhanced Computer Vision (QeCV), covering quantum computing fundamentals, methodologies compatible with quantum hardware, available tools, and discussing challenges and social implications.

Details

Motivation: To bridge the gap between computer vision and quantum computing communities by providing a holistic reference that introduces QeCV concepts, methodologies, and tools for computer vision researchers and students.

Method: The survey reviews QeCV through two main quantum computational paradigms: gate-based quantum computing and quantum annealing. It covers operational principles of quantum computers, available programming/simulation tools, and methodologies for quantum-compatible computer vision formulations.

Result: Provides a comprehensive introduction to QeCV, its specifics, and methodologies for quantum-compatible computer vision approaches, serving as a reference for the computer vision community to understand and engage with quantum computing.

Conclusion: QeCV has high potential to transform visual signal processing, but requires development of fundamentally new quantum-compatible algorithms and faces open challenges that need to be addressed for practical implementation and social impact assessment.

Abstract: Quantum-enhanced Computer Vision (QeCV) is a new research field at the intersection of computer vision, optimisation theory, machine learning and quantum computing. It has high potential to transform how visual signals are processed and interpreted with the help of quantum computing that leverages quantum-mechanical effects in computations inaccessible to classical (i.e. non-quantum) computers. In scenarios where existing non-quantum methods cannot find a solution in a reasonable time or compute only approximate solutions, quantum computers can provide, among others, advantages in terms of better time scalability for multiple problem classes. Parametrised quantum circuits can also become, in the long term, a considerable alternative to classical neural networks in computer vision. However, specialised and fundamentally new algorithms must be developed to enable compatibility with quantum hardware and unveil the potential of quantum computational paradigms in computer vision. This survey contributes to the existing literature on QeCV with a holistic review of this research field. It is designed as a quantum computing reference for the computer vision community, targeting computer vision students, scientists and readers with related backgrounds who want to familiarise themselves with QeCV. We provide a comprehensive introduction to QeCV, its specifics, and methodologies for formulations compatible with quantum hardware and QeCV methods, leveraging two main quantum computational paradigms, i.e. gate-based quantum computing and quantum annealing. We elaborate on the operational principles of quantum computers and the available tools to access, program and simulate them in the context of QeCV. Finally, we review existing quantum computing tools and learning materials and discuss aspects related to publishing and reviewing QeCV papers, open challenges and potential social implications.

[259] Temporal Prompting Matters: Rethinking Referring Video Object Segmentation

Ci-Siang Lin, Min-Hung Chen, I-Jieh Liu, Chien-Yi Wang, Sifei Liu, Yu-Chiang Frank Wang

Main category: cs.CV

TL;DR: The paper proposes Tenet, a framework that decomposes RVOS into referring, video, and segmentation factors, using temporal prompts and preference learning to adapt foundation segmentation models for efficient referring video object segmentation.

Details

Motivation: Existing RVOS methods require end-to-end training with dense mask annotations, which is computation-consuming and less scalable. The authors aim to investigate the key to RVOS and enable efficient adaptation of foundation models.

Method: Decompose RVOS into three factors; use object detectors and trackers to produce temporal prompts; propose Prompt Preference Learning to evaluate prompt quality; leverage image-based foundation segmentation models with these prompts.

Result: Experiments on RVOS benchmarks demonstrate the effectiveness of the Tenet framework in producing high-quality masks for referred objects.

Conclusion: The Tenet framework enables efficient adaptation of foundation segmentation models to RVOS by addressing referring and video factors while leveraging existing segmentation capabilities, achieving strong performance without dense end-to-end training.

Abstract: Referring Video Object Segmentation (RVOS) aims to segment the object referred to by the query sentence in the video. Most existing methods require end-to-end training with dense mask annotations, which could be computation-consuming and less scalable. In this work, we rethink the RVOS problem and aim to investigate the key to this task. Based on existing foundation segmentation models, we decompose the RVOS task into referring, video, and segmentation factors, and propose a Temporal Prompt Generation and Selection (Tenet) framework to address the referring and video factors while leaving the segmentation problem to foundation models. To efficiently adapt image-based foundation segmentation models to referring video object segmentation, we leverage off-the-shelf object detectors and trackers to produce temporal prompts associated with the referring sentence. While high-quality temporal prompts could be produced, they can not be easily identified from confidence scores. To tackle this issue, we propose Prompt Preference Learning to evaluate the quality of the produced temporal prompts. By taking such prompts to instruct image-based foundation segmentation models, we would be able to produce high-quality masks for the referred object, enabling efficient model adaptation to referring video object segmentation. Experiments on RVOS benchmarks demonstrate the effectiveness of the Tenet framework.

[260] LLMVA-GEBC: Large Language Model with Video Adapter for Generic Event Boundary Captioning

Yolo Yunlong Tang, Jinrui Zhang, Xiangchen Wang, Teng Wang, Feng Zheng

Main category: cs.CV

TL;DR: This paper presents LLMVA-GEBC, a winning solution for CVPR 2023 GEBC competition that uses a large language model with video adapter to generate captions for event boundaries in videos.

Details

Motivation: Generic Event Boundary Captioning (GEBC) requires understanding immediate status changes around video boundaries, making it more challenging than conventional video captioning tasks.

Method: Proposes LLMVA-GEBC: (1) Uses pretrained LLM for high-quality human-like captions, (2) Employs video Q-former as adapter trained with frozen visual feature extractors and LLM.

Result: Achieved 76.14 score on test set and won first place in the CVPR 2023 GEBC competition.

Conclusion: The proposed LLMVA-GEBC method effectively addresses the GEBC task by combining pretrained LLMs with video adapters, demonstrating state-of-the-art performance in the competition.

Abstract: Our winning entry for the CVPR 2023 Generic Event Boundary Captioning (GEBC) competition is detailed in this paper. Unlike conventional video captioning tasks, GEBC demands that the captioning model possess an understanding of immediate changes in status around the designated video boundary, making it a difficult task. This paper proposes an effective model LLMVA-GEBC (Large Language Model with Video Adapter for Generic Event Boundary Captioning): (1) We utilize a pretrained LLM for generating human-like captions with high quality. (2) To adapt the model to the GEBC task, we take the video Q-former as an adapter and train it with the frozen visual feature extractors and LLM. Our proposed method achieved a 76.14 score on the test set and won the first place in the challenge. Our code is available at https://github.com/zjr2000/LLMVA-GEBC .

[261] Is My Data in Your AI? Membership Inference Test (MINT) applied to Face Biometrics

Daniel DeAlcala, Aythami Morales, Julian Fierrez, Gonzalo Mancera, Ruben Tolosana, Javier Ortega-Garcia

Main category: cs.CV

TL;DR: MINT is a novel approach to detect if specific data was used in training AI models, achieving up to 90% accuracy in face recognition systems.

Details

Motivation: To enforce privacy and fairness in AI applications by revealing if sensitive or private data was used for training models like LLMs.

Method: Two architectures based on MLPs and CNNs that learn distinct activation patterns when models are exposed to their training data.

Result: Achieved up to 90% accuracy in detecting training data usage across three state-of-the-art face recognition systems using six databases with over 22 million images.

Conclusion: MINT shows potential for recognizing if AI models were trained with specific data, serving privacy and fairness enforcement in AI applications.

Abstract: This article introduces the Membership Inference Test (MINT), a novel approach that aims to empirically assess if given data was used during the training of AI/ML models. Specifically, we propose two MINT architectures designed to learn the distinct activation patterns that emerge when an Audited Model is exposed to data used during its training process. These architectures are based on Multilayer Perceptrons (MLPs) and Convolutional Neural Networks (CNNs). The experimental framework focuses on the challenging task of Face Recognition, considering three state-of-the-art Face Recognition systems. Experiments are carried out using six publicly available databases, comprising over 22 million face images in total. Different experimental scenarios are considered depending on the context of the AI model to test. Our proposed MINT approach achieves promising results, with up to 90% accuracy, indicating the potential to recognize if an AI model has been trained with specific data. The proposed MINT approach can serve to enforce privacy and fairness in several AI applications, e.g., revealing if sensitive or private data was used for training or tuning Large Language Models (LLMs).

[262] Unlocking Dataset Distillation with Diffusion Models

Brian B. Moser, Federico Raue, Sebastian Palacio, Stanislav Frolov, Andreas Dengel

Main category: cs.CV

TL;DR: LD3M is the first method that enables gradient-based dataset distillation using pre-trained latent diffusion models by introducing a linearly decaying skip connection to preserve gradients across denoising steps.

Details

Motivation: Current dataset distillation methods avoid diffusion models due to vanishing gradient problems during backpropagation through long denoising chains, limiting their use despite diffusion models' superior generative performance.

Method: LD3M learns gradient-based distilled latents and class embeddings end-to-end through a pre-trained latent diffusion model using a linearly decaying skip connection from the initial noisy state into every reverse step to preserve gradient signals.

Result: LD3M improves downstream accuracy by up to 4.8 percentage points (1 IPC) and 4.2 points (10 IPC) over prior state-of-the-art methods across multiple ImageNet subsets at 128x128 and 256x256 resolutions.

Conclusion: The proposed LD3M method successfully enables effective dataset distillation using diffusion models by solving the gradient vanishing problem, achieving significant improvements in downstream task performance.

Abstract: Dataset distillation seeks to condense datasets into smaller but highly representative synthetic samples. While diffusion models now lead all generative benchmarks, current distillation methods avoid them and rely instead on GANs or autoencoders, or, at best, sampling from a fixed diffusion prior. This trend arises because naive backpropagation through the long denoising chain leads to vanishing gradients, which prevents effective synthetic sample optimization. To address this limitation, we introduce Latent Dataset Distillation with Diffusion Models (LD3M), the first method to learn gradient-based distilled latents and class embeddings end-to-end through a pre-trained latent diffusion model. A linearly decaying skip connection, injected from the initial noisy state into every reverse step, preserves the gradient signal across dozens of timesteps without requiring diffusion weight fine-tuning. Across multiple ImageNet subsets at 128x128 and 256x256, LD3M improves downstream accuracy by up to 4.8 percentage points (1 IPC) and 4.2 points (10 IPC) over the prior state-of-the-art. The code for LD3M is provided at https://github.com/Brian-Moser/prune_and_distill.

[263] LoDisc: Learning Global-Local Discriminative Features for Self-Supervised Fine-Grained Visual Recognition

Jialu Shi, Zhiqiang Wei, Jie Nie, Lei Huang

Main category: cs.CV

TL;DR: The paper proposes a self-supervised global-local fine-grained contrastive learning framework that enhances fine-grained visual recognition by incorporating local discrimination to focus on pivotal regions, improving both fine-grained and general object recognition.

Details

Motivation: Current contrastive learning methods learn global coarse-grained representations that are insufficient for fine-grained visual recognition, which requires attention to subtle local features.

Method: Introduces a local discrimination (LoDisc) pretext task with location-wise mask sampling to supervise the model’s focus on local pivotal regions, combined with global self-supervised contrastive learning.

Result: Extensive experiments show decent improvements on fine-grained object recognition tasks across different evaluation settings, and the method is also effective for general object recognition.

Conclusion: The proposed global-local framework effectively refines fine-grained feature representations by enhancing local region focus, demonstrating broad applicability across recognition tasks.

Abstract: The self-supervised contrastive learning strategy has attracted considerable attention due to its exceptional ability in representation learning. However, current contrastive learning tends to learn global coarse-grained representations of the image that benefit generic object recognition, whereas such coarse-grained features are insufficient for fine-grained visual recognition. In this paper, we incorporate subtle local fine-grained feature learning into global self-supervised contrastive learning through a pure self-supervised global-local fine-grained contrastive learning framework. Specifically, a novel pretext task called local discrimination (LoDisc) is proposed to explicitly supervise the self-supervised model’s focus toward local pivotal regions, which are captured by a simple but effective location-wise mask sampling strategy. We show that the LoDisc pretext task can effectively enhance fine-grained clues in important local regions and that the global-local framework further refines the fine-grained feature representations of images. Extensive experimental results on different fine-grained object recognition tasks demonstrate that the proposed method can lead to a decent improvement in different evaluation settings. The proposed method is also effective for general object recognition tasks.

[264] Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding

Yolo Yunlong Tang, Daiki Shimada, Jing Bi, Mingqian Feng, Hang Hua, Chenliang Xu

Main category: cs.CV

TL;DR: PU-VALOR is a large audio-visual dataset with temporal annotations used to train AVicuna, a multimodal LLM that excels at temporal localization and time-aware dialogue in videos.

Details

Motivation: There's a lack of untrimmed audio-visual datasets with precise temporal annotations, which hinders LLMs from learning alignment between time, audio-visual events, and text tokens for temporal localization tasks.

Method: Created PU-VALOR dataset from VALOR through event-based video clustering, random temporal scaling, and permutation. Fine-tuned multimodal LLM on this dataset to develop AVicuna model.

Result: AVicuna achieves state-of-the-art performance on open-ended video QA, audio-visual QA, and audio-visual event dense localization tasks, demonstrating effective temporal understanding in audio-visual videos.

Conclusion: The proposed approach successfully addresses the temporal annotation gap in audio-visual datasets and enables multimodal LLMs to perform accurate temporal localization and time-aware dialogue in videos.

Abstract: Large language models (LLMs) have demonstrated remarkable capabilities in natural language and multimodal domains. By fine-tuning multimodal LLMs with temporal annotations from well-annotated datasets, e.g., dense video captioning datasets, their temporal understanding capacity in video-language tasks can be obtained. However, there is a notable lack of untrimmed audio-visual video datasets with precise temporal annotations for events. This deficiency hinders LLMs from learning the alignment between time, audio-visual events, and text tokens, thus impairing their ability to temporally localize audio-visual events in videos. To address this gap, we introduce PU-VALOR, a comprehensive audio-visual dataset comprising over 114,000 pseudo-untrimmed videos with detailed temporal annotations. PU-VALOR is derived from the large-scale but coarse-annotated audio-visual dataset VALOR, through a subtle method involving event-based video clustering, random temporal scaling, and permutation. By fine-tuning a multimodal LLM on PU-VALOR, we developed AVicuna, a model capable of aligning audio-visual events with temporal intervals and corresponding text tokens. AVicuna excels in temporal localization and time-aware dialogue capabilities. Our experiments demonstrate that AVicuna effectively handles temporal understanding in audio-visual videos and achieves state-of-the-art performance on open-ended video QA, audio-visual QA, and audio-visual event dense localization tasks.

Hang Hua, Yolo Yunlong Tang, Chenliang Xu, Jiebo Luo

Main category: cs.CV

TL;DR: The paper introduces Instruct-V2Xum, a large-scale cross-modal video summarization dataset with 30,000 YouTube videos, and proposes V2Xum-LLM, a unified framework that handles multiple video summarization tasks using LLMs with temporal prompts and task instructions.

Details

Motivation: Existing video summarization datasets have limited source videos, hindering training of large vision-language models, and focus mainly on video-to-video summarization while overlooking multimodal needs. Current multimodal datasets have inadequate textual summaries.

Method: Created Instruct-V2Xum dataset with 30,000 diverse YouTube videos (40-940 seconds) and paired video-text summaries with frame references. Proposed V2Xum-LLM framework that unifies different summarization tasks into LLM text decoder using temporal prompts and task instructions.

Result: V2Xum-LLaMA outperforms strong baseline models on multiple video summarization tasks. Also proposed enhanced evaluation metrics for V2V and V2VT summarization tasks.

Conclusion: The work addresses limitations in existing video summarization datasets and methods by providing a large-scale multimodal dataset and a unified framework that achieves state-of-the-art performance across different summarization tasks.

Abstract: Video summarization aims to create short, accurate, and cohesive summaries of longer videos. Despite the existence of various video summarization datasets, a notable limitation is their limited amount of source videos, which hampers the effective training of advanced large vision-language models (VLMs). Additionally, most existing datasets are created for video-to-video summarization, overlooking the contemporary need for multimodal video content summarization. Recent efforts have been made to expand from unimodal to multimodal video summarization, categorizing the task into three sub-tasks based on the summary’s modality: video-to-video (V2V), video-to-text (V2T), and a combination of video and text summarization (V2VT). However, the textual summaries in previous multimodal datasets are inadequate. To address these issues, we introduce Instruct-V2Xum, a cross-modal video summarization dataset featuring 30,000 diverse videos sourced from YouTube, with lengths ranging from 40 to 940 seconds and an average summarization ratio of 16.39%. Each video summary in Instruct-V2Xum is paired with a textual summary that references specific frame indexes, facilitating the generation of aligned video and textual summaries. In addition, we propose a new video summarization framework named V2Xum-LLM. V2Xum-LLM, specifically V2Xum-LLaMA in this study, is the first framework that unifies different video summarization tasks into one large language model’s (LLM) text decoder and achieves task-controllable video summarization with temporal prompts and task instructions. Experiments show that V2Xum-LLaMA outperforms strong baseline models on multiple video summarization tasks. Furthermore, we propose an enhanced evaluation metric for V2V and V2VT summarization tasks.

[266] Decomposed Global Optimization for Robust Point Matching with Low-Dimensional Branching

Wei Lian, Zhesen Cui, Fei Ma, Hang Pan, Wangmeng Zuo, Jianmei Zhang

Main category: cs.CV

TL;DR: A novel global optimization method for aligning partially overlapping point sets with transformation invariance, using a modified RPM objective and specialized BnB algorithm.

Details

Motivation: Need for robust point set alignment algorithms that maintain invariance to geometric transformations while handling partial overlaps, noise, and outliers.

Method: Transform RPM objective to quadratic form, derive tight lower bound using convex envelopes, decompose into linear assignment and quadratic program, and use specialized BnB algorithm focusing on transformation parameters.

Result: Method shows superior robustness to non-rigid deformations, positional noise, and outliers compared to state-of-the-art approaches, especially when outliers are distinct from inliers.

Conclusion: The proposed approach provides an efficient and robust solution for point set alignment with transformation invariance, outperforming existing methods in challenging scenarios.

Abstract: Numerous applications require algorithms that can align partially overlapping point sets while maintaining invariance to geometric transformations (e.g., similarity, affine, rigid). This paper introduces a novel global optimization method for this task by minimizing the objective function of the Robust Point Matching (RPM) algorithm. We first reveal that the original RPM objective is a cubic polynomial. Through a concise variable substitution, we transform this objective into a quadratic function. By leveraging the convex envelope of bilinear monomials, we derive a tight lower bound for this quadratic function. This lower bound problem conveniently and efficiently decomposes into two parts: a standard linear assignment problem (solvable in polynomial time) and a low-dimensional convex quadratic program. Furthermore, we devise a specialized Branch-and-Bound (BnB) algorithm that branches exclusively on the transformation parameters, which significantly accelerates convergence by confining the search space. Experiments on 2D and 3D synthetic and real-world data demonstrate that our method, compared to state-of-the-art approaches, exhibits superior robustness to non-rigid deformations, positional noise, and outliers, particularly in scenarios where outliers are distinct from inliers.

[267] CaRDiff: Video Salient Object Ranking Chain of Thought Reasoning for Saliency Prediction with Diffusion

Yolo Yunlong Tang, Gen Zhan, Li Yang, Yiting Liao, Chenliang Xu

Main category: cs.CV

TL;DR: CaRDiff is a novel framework that integrates multimodal large language models, grounding modules, and diffusion models to enhance video saliency prediction by incorporating language-guided reasoning through salient object ranking.

Details

Motivation: Existing video saliency prediction methods focus mainly on perceptual information while neglecting the reasoning process facilitated by language, where ranking cues from language interpretation play a crucial role in guiding human attention.

Method: Proposes CaRDiff framework with VSOR-CoT prompting method that uses MLLM with grounding module to caption video content, infer salient objects with rankings and positions, then leverages diffusion model to decode saliency maps from ranking maps.

Result: CaRDiff outperforms state-of-the-art models on MVS dataset and demonstrates cross-dataset capabilities on DHF1k dataset through zero-shot evaluation.

Conclusion: Integrating language-guided reasoning through salient object ranking significantly improves video saliency prediction performance and enables cross-dataset generalization.

Abstract: Video saliency prediction aims to identify the regions in a video that attract human attention and gaze, driven by bottom-up features from the video and top-down processes like memory and cognition. Among these top-down influences, language plays a crucial role in guiding attention by shaping how visual information is interpreted. Existing methods primarily focus on modeling perceptual information while neglecting the reasoning process facilitated by language, where ranking cues are crucial outcomes of this process and practical guidance for saliency prediction. In this paper, we propose CaRDiff (Caption, Rank, and generate with Diffusion), a framework that imitates the process by integrating a multimodal large language model (MLLM), a grounding module, and a diffusion model, to enhance video saliency prediction. Specifically, we introduce a novel prompting method VSOR-CoT (Video Salient Object Ranking Chain of Thought), which utilizes an MLLM with a grounding module to caption video content and infer salient objects along with their rankings and positions. This process derives ranking maps that can be sufficiently leveraged by the diffusion model to decode the saliency maps for the given video accurately. Extensive experiments show the effectiveness of VSOR-CoT in improving the performance of video saliency prediction. The proposed CaRDiff performs better than state-of-the-art models on the MVS dataset and demonstrates cross-dataset capabilities on the DHF1k dataset through zero-shot evaluation.

[268] Taming Diffusion Models for Image Restoration: A Review

Ziwei Luo, Fredrik K. Gustafsson, Zheng Zhao, Jens Sjölund, Thomas B. Schön

Main category: cs.CV

TL;DR: This review paper surveys diffusion models for image restoration tasks, covering key concepts, current techniques, challenges, and future directions.

Details

Motivation: Diffusion models have shown remarkable progress in generative modeling and image quality enhancement, making them promising for low-level computer vision tasks like image restoration.

Method: The paper introduces key constructions in diffusion models and surveys contemporary techniques that use diffusion models for solving general image restoration tasks.

Result: The review provides a comprehensive overview of diffusion-based approaches for image restoration, including denoising, deblurring, dehazing, and other IR tasks.

Conclusion: The paper identifies main challenges and limitations of existing diffusion-based IR frameworks and provides potential directions for future research in this area.

Abstract: Diffusion models have achieved remarkable progress in generative modelling, particularly in enhancing image quality to conform to human preferences. Recently, these models have also been applied to low-level computer vision for photo-realistic image restoration (IR) in tasks such as image denoising, deblurring, dehazing, etc. In this review paper, we introduce key constructions in diffusion models and survey contemporary techniques that make use of diffusion models in solving general IR tasks. Furthermore, we point out the main challenges and limitations of existing diffusion-based IR frameworks and provide potential directions for future work.

[269] VidComposition: Can MLLMs Analyze Compositions in Compiled Videos?

Yolo Yunlong Tang, Junjia Guo, Hang Hua, Susan Liang, Mingqian Feng, Xinyang Li, Rui Mao, Chao Huang, Jing Bi, Zeliang Zhang, Pooyan Fazli, Chenliang Xu

Main category: cs.CV

TL;DR: VidComposition is a new benchmark for evaluating Multimodal Large Language Models’ ability to understand video compositions, revealing significant performance gaps between humans and current models.

Details

Motivation: Existing MLLM benchmarks focus on abstract video comprehension but lack detailed assessment of video composition understanding - how visual elements combine and interact in compiled videos.

Method: Created VidComposition benchmark with 982 curated compiled videos and 1706 multiple-choice questions covering compositional aspects like camera movement, angle, shot size, narrative structure, character actions and emotions.

Result: Evaluation of 33 open-source and proprietary MLLMs showed significant performance gap between human and model capabilities in understanding complex video compositions.

Conclusion: Current MLLMs have limitations in understanding compiled video compositions, highlighting areas for improvement in multimodal video understanding capabilities.

Abstract: The advancement of Multimodal Large Language Models (MLLMs) has enabled significant progress in multimodal understanding, expanding their capacity to analyze video content. However, existing evaluation benchmarks for MLLMs primarily focus on abstract video comprehension, lacking a detailed assessment of their ability to understand video compositions, the nuanced interpretation of how visual elements combine and interact within highly compiled video contexts. We introduce VidComposition, a new benchmark specifically designed to evaluate the video composition understanding capabilities of MLLMs using carefully curated compiled videos and cinematic-level annotations. VidComposition includes 982 videos with 1706 multiple-choice questions, covering various compositional aspects such as camera movement, angle, shot size, narrative structure, character actions and emotions, etc. Our comprehensive evaluation of 33 open-source and proprietary MLLMs reveals a significant performance gap between human and model capabilities. This highlights the limitations of current MLLMs in understanding complex, compiled video compositions and offers insights into areas for further improvement. The leaderboard and evaluation code are available at https://yunlong10.github.io/VidComposition/.

[270] Sustainable Self-evolution Adversarial Training

Wenxuan Wang, Chenglei Wang, Huihui Qi, Menghao Ye, Xuelin Qian, Peng Wang, Yanning Zhang

Main category: cs.CV

TL;DR: SSEAT is a sustainable self-evolution adversarial training framework that addresses the limitations of existing adversarial training methods by enabling continual learning from evolving attacks while preventing catastrophic forgetting through data replay and consistency regularization.

Details

Motivation: Existing adversarial training defense models struggle to adapt to the dynamic and evolving nature of attack methods, as they rely on single or limited types of attacks under one-time learning processes, making them inadequate for long-term model security.

Method: Proposes a continual adversarial defense pipeline for learning from various adversarial examples across multiple stages, an adversarial data replay module to select diverse and key relearning data, and a consistency regularization strategy to retain past knowledge and maintain clean sample accuracy.

Result: Extensive experiments demonstrate superior defense performance and classification accuracy compared to competitors, showing the efficacy of the proposed SSEAT defense method.

Conclusion: SSEAT provides an effective framework for sustainable adversarial defense that can adapt to evolving attack methods while maintaining model performance on clean samples through continual learning and knowledge retention mechanisms.

Abstract: With the wide application of deep neural network models in various computer vision tasks, there has been a proliferation of adversarial example generation strategies aimed at deeply exploring model security. However, existing adversarial training defense models, which rely on single or limited types of attacks under a one-time learning process, struggle to adapt to the dynamic and evolving nature of attack methods. Therefore, to achieve defense performance improvements for models in long-term applications, we propose a novel Sustainable Self-Evolution Adversarial Training (SSEAT) framework. Specifically, we introduce a continual adversarial defense pipeline to realize learning from various kinds of adversarial examples across multiple stages. Additionally, to address the issue of model catastrophic forgetting caused by continual learning from ongoing novel attacks, we propose an adversarial data replay module to better select more diverse and key relearning data. Furthermore, we design a consistency regularization strategy to encourage current defense models to learn more from previously trained ones, guiding them to retain more past knowledge and maintain accuracy on clean samples. Extensive experiments have been conducted to verify the efficacy of the proposed SSEAT defense method, which demonstrates superior defense performance and classification accuracy compared to competitors.Code is available at https://github.com/aup520/SSEAT

[271] Towards a Multimodal Large Language Model with Pixel-Level Insight for Biomedicine

Xiaoshuang Huang, Lingdong Shen, Jia Liu, Fangxin Shang, Hongxiang Li, Haifeng Huang, Yehui Yang

Main category: cs.CV

TL;DR: MedPLIB is a biomedical multimodal LLM with pixel-level understanding that supports VQA, arbitrary pixel-level prompts, and grounding. It uses a novel MoE training strategy and achieves SOTA results on medical vision tasks.

Details

Motivation: Current biomedical MLLMs focus only on image-level understanding and textual interactions, limiting their capabilities and flexibility for medical applications.

Method: Proposes an end-to-end multimodal LLM with pixel-level understanding, using a novel Mixture-of-Experts multi-stage training strategy that separates visual-language and pixel-grounding expert models before MoE fine-tuning.

Result: Achieves state-of-the-art outcomes across multiple medical visual language tasks. In zero-shot pixel grounding, leads best small and large models by 19.7 and 15.6 mDice margins respectively.

Conclusion: MedPLIB demonstrates superior pixel-level understanding capabilities for biomedical applications and introduces the MeCoVQA dataset to advance biomedical MLLM research.

Abstract: In recent years, Multimodal Large Language Models (MLLM) have achieved notable advancements, demonstrating the feasibility of developing an intelligent biomedical assistant. However, current biomedical MLLMs predominantly focus on image-level understanding and restrict interactions to textual commands, thus limiting their capability boundaries and the flexibility of usage. In this paper, we introduce a novel end-to-end multimodal large language model for the biomedical domain, named MedPLIB, which possesses pixel-level understanding. Excitingly, it supports visual question answering (VQA), arbitrary pixel-level prompts (points, bounding boxes, and free-form shapes), and pixel-level grounding. We propose a novel Mixture-of-Experts (MoE) multi-stage training strategy, which divides MoE into separate training phases for a visual-language expert model and a pixel-grounding expert model, followed by fine-tuning using MoE. This strategy effectively coordinates multitask learning while maintaining the computational cost at inference equivalent to that of a single expert model. To advance the research of biomedical MLLMs, we introduce the Medical Complex Vision Question Answering Dataset (MeCoVQA), which comprises an array of 8 modalities for complex medical imaging question answering and image region understanding. Experimental results indicate that MedPLIB has achieved state-of-the-art outcomes across multiple medical visual language tasks. More importantly, in zero-shot evaluations for the pixel grounding task, MedPLIB leads the best small and large models by margins of 19.7 and 15.6 respectively on the mDice metric. The codes, data, and model checkpoints will be made publicly available at https://github.com/ShawnHuang497/MedPLIB.

[272] Generative AI for Cel-Animation: A Survey

Yolo Yunlong Tang, Junjia Guo, Pinxin Liu, Zhiyuan Wang, Hang Hua, Jia-Xing Zhong, Yunzhong Xiao, Chao Huang, Luchuan Song, Susan Liang, Yizhi Song, Liu He, Jing Bi, Mingqian Feng, Xinyang Li, Zeliang Zhang, Chenliang Xu

Main category: cs.CV

TL;DR: This survey explores how generative AI is transforming traditional Cel animation workflows by automating key production steps like inbetweening, colorization, and storyboarding, while addressing challenges like visual consistency and ethical considerations.

Details

Motivation: Traditional Cel animation production requires substantial manual effort, technical expertise, and time investment, which has historically limited efficiency and scalability. The rise of generative AI offers opportunities to automate labor-intensive tasks and make animation more accessible.

Method: The paper surveys the integration of generative AI technologies (large language models, multimodal models, diffusion models) into animation workflows, examining tools like AniDoc, ToonCrafter, and AniSora that automate tasks such as inbetween frame generation, colorization, and storyboard creation.

Result: GenAI integration is revolutionizing animation workflows by lowering technical barriers, broadening accessibility for creators, and enabling artists to focus more on creative expression and artistic innovation rather than manual technical work.

Conclusion: While generative AI shows significant potential for transforming animation production, challenges remain in visual consistency, stylistic coherence, and ethical considerations. The paper also explores future directions for AI-assisted animation development.

Abstract: Traditional Celluloid (Cel) Animation production pipeline encompasses multiple essential steps, including storyboarding, layout design, keyframe animation, inbetweening, and colorization, which demand substantial manual effort, technical expertise, and significant time investment. These challenges have historically impeded the efficiency and scalability of Cel-Animation production. The rise of generative artificial intelligence (GenAI), encompassing large language models, multimodal models, and diffusion models, offers innovative solutions by automating tasks such as inbetween frame generation, colorization, and storyboard creation. This survey explores how GenAI integration is revolutionizing traditional animation workflows by lowering technical barriers, broadening accessibility for a wider range of creators through tools like AniDoc, ToonCrafter, and AniSora, and enabling artists to focus more on creative expression and artistic innovation. Despite its potential, challenges like visual consistency, stylistic coherence, and ethical considerations persist. Additionally, this paper explores future directions and advancements in AI-assisted animation. For further exploration and resources, please visit our GitHub repository: https://github.com/yunlong10/Awesome-AI4Animation

[273] Erasing More Than Intended? How Concept Erasure Degrades the Generation of Non-Target Concepts

Ibtihel Amara, Ahmed Imtiaz Humayun, Ivana Kajic, Zarana Parekh, Natalie Harris, Sarah Young, Chirag Nagpal, Najoung Kim, Junfeng He, Cristina Nader Vasconcelos, Deepak Ramachandran, Golnoosh Farnadi, Katherine Heller, Mohammad Havaei, Negar Rostamzadeh

Main category: cs.CV

TL;DR: Concept erasure in text-to-image models often fails to properly evaluate performance across diverse concept dimensions, leading to unintended suppression of non-target concepts through concept entanglement.

Details

Motivation: To address the gap in evaluating sanitized models and systematically analyze failure modes of text-to-image models after concept erasure, particularly focusing on unintended consequences on non-target concepts.

Method: Introduce EraseBench, a comprehensive benchmark with over 100 curated concepts, targeted evaluation prompts, and robust metrics to assess both effectiveness and side effects of concept erasure.

Result: Revealed concept entanglement phenomenon where erasure leads to unintended suppression of non-target concepts, causing spillover degradation that manifests as distortions and decline in generation quality.

Conclusion: Current concept erasure techniques have significant limitations in real-world deployment due to unintended side effects on related concepts, highlighting the need for more comprehensive evaluation frameworks like EraseBench.

Abstract: Concept erasure techniques have recently gained significant attention for their potential to remove unwanted concepts from text-to-image models. While these methods often demonstrate promising results in controlled settings, their robustness in real-world applications and suitability for deployment remain uncertain. In this work, we (1) identify a critical gap in evaluating sanitized models, particularly in assessing their performance across diverse concept dimensions, and (2) systematically analyze the failure modes of text-to-image models post-erasure. We focus on the unintended consequences of concept removal on non-target concepts across different levels of interconnected relationships including visually similar, binomial, and semantically related concepts. To address this, we introduce EraseBench, a comprehensive benchmark for evaluating post-erasure performance. EraseBench includes over 100 curated concepts, targeted evaluation prompts, and a robust set of metrics to assess both effectiveness and side effects of erasure. Our findings reveal a phenomenon of concept entanglement, where erasure leads to unintended suppression of non-target concepts, causing spillover degradation that manifests as distortions and a decline in generation quality.

[274] GreedyPixel: Fine-Grained Black-Box Adversarial Attack Via Greedy Algorithm

Hanrui Wang, Ching-Chun Chang, Chun-Shien Lu, Christopher Leckie, Isao Echizen

Main category: cs.CV

TL;DR: GreedyPixel is a new black-box adversarial attack framework that combines surrogate-derived pixel priority with greedy per-pixel optimization to achieve query-and-transfer guidance, pixel-wise sparsity, and training-free direct optimization.

Details

Motivation: Existing black-box attacks fail to jointly achieve query-and-transfer guidance, pixel-wise sparsity, and training-free direct optimization, creating a gap between black-box flexibility and white-box precision.

Method: Combines surrogate-derived pixel priority map with greedy per-pixel optimization refined by query feedback, reducing exponential search space to linear procedure with guaranteed monotonic loss decrease.

Result: Achieves state-of-the-art attack success rates on CIFAR-10 and ImageNet under both white-box and black-box settings, producing visually imperceptible perturbations.

Conclusion: GreedyPixel bridges the precision gap between white-box and black-box attacks and provides a practical framework for fine-grained robustness evaluation.

Abstract: Deep neural networks are highly vulnerable to adversarial examples that inputs with small, carefully crafted perturbations that cause misclassification, making adversarial attacks an essential tool for robustness evaluation. Existing black-box attacks fall into three categories: query-only, transfer-only, and query-and-transfer, and vary in perturbation pattern and optimization strategy. However, no prior method jointly achieves query-and-transfer guidance, pixel-wise sparsity, and training-free direct optimization, leaving a gap between black-box flexibility and white-box precision. We present GreedyPixel, a new attack framework that fills this gap by combining a surrogate-derived pixel priority map with greedy, per-pixel optimization refined by query feedback. This design reduces the exponential brute-force search space to a tractable linear procedure, guarantees monotonic loss decrease and convergence to a coordinate-wise optimum, and concentrates perturbations on robust, semantically meaningful pixels to improve perceptual quality. Extensive experiments on CIFAR-10 and ImageNet under both white-box and black-box settings demonstrate that GreedyPixel achieves state-of-the-art attack success rates and produces visually imperceptible perturbations. Our results show that GreedyPixel bridges the precision gap between white-box and black-box attacks and provides a practical framework for fine-grained robustness evaluation. The implementation is available at https://github.com/azrealwang/greedypixel.

[275] Polyp-Gen: Realistic and Diverse Polyp Image Generation for Endoscopic Dataset Expansion

Shengyuan Liu, Zhen Chen, Qiushi Yang, Weihao Yu, Di Dong, Jiancong Hu, Yixuan Yuan

Main category: cs.CV

TL;DR: Polyp-Gen is a full-automatic diffusion-based framework for generating realistic and diverse endoscopic images to address data scarcity in automated diagnostic systems for polyp detection.

Details

Motivation: High annotation costs and privacy concerns make acquiring quality endoscopic images challenging. Existing generation methods fail to accurately capture polyp boundary details and require manual medical priors, limiting realism and diversity.

Method: Uses spatial-aware diffusion training with lesion-guided loss to enhance polyp boundary structure, and hierarchical retrieval-based sampling to capture medical priors for polyp localization without manual specification.

Result: Achieves state-of-the-art generation quality, improves downstream polyp detection performance, and demonstrates remarkable zero-shot generalizability across datasets.

Conclusion: Polyp-Gen provides an effective solution for generating realistic endoscopic images to build reliable automated diagnostic systems, addressing data scarcity while maintaining quality and diversity.

Abstract: Automated diagnostic systems (ADS) have shown significant potential in the early detection of polyps during endoscopic examinations, thereby reducing the incidence of colorectal cancer. However, due to high annotation costs and strict privacy concerns, acquiring high-quality endoscopic images poses a considerable challenge in the development of ADS. Despite recent advancements in generating synthetic images for dataset expansion, existing endoscopic image generation algorithms failed to accurately generate the details of polyp boundary regions and typically required medical priors to specify plausible locations and shapes of polyps, which limited the realism and diversity of the generated images. To address these limitations, we present Polyp-Gen, the first full-automatic diffusion-based endoscopic image generation framework. Specifically, we devise a spatial-aware diffusion training scheme with a lesion-guided loss to enhance the structural context of polyp boundary regions. Moreover, to capture medical priors for the localization of potential polyp areas, we introduce a hierarchical retrieval-based sampling strategy to match similar fine-grained spatial features. In this way, our Polyp-Gen can generate realistic and diverse endoscopic images for building reliable ADS. Extensive experiments demonstrate the state-of-the-art generation quality, and the synthetic images can improve the downstream polyp detection task. Additionally, our Polyp-Gen has shown remarkable zero-shot generalizability on other datasets. The source code is available at https://github.com/CUHK-AIM-Group/Polyp-Gen.

[276] AerialVG: A Challenging Benchmark for Aerial Visual Grounding by Exploring Positional Relations

Junli Liu, Qizhi Chen, Zhigang Wang, Yiwen Tang, Yiting Zhang, Chi Yan, Dong Wang, Xuelong Li, Bin Zhao

Main category: cs.CV

TL;DR: AerialVG is a new visual grounding task for aerial imagery that requires spatial reasoning to distinguish visually similar objects, with a new dataset of 5K images and 50K descriptions, and a model using hierarchical cross-attention and relation-aware grounding.

Details

Motivation: Traditional visual grounding struggles with aerial imagery due to multiple visually similar objects and the need for positional reasoning, which existing models don't handle well with high-resolution aerial images.

Method: Proposed AerialVG dataset with 5K aerial images, 50K descriptions, and 103K objects with spatial relations. Developed model with Hierarchical Cross-Attention for target regions and Relation-Aware Grounding module for positional inference.

Result: Experimental results validate the effectiveness of the dataset and method, demonstrating the importance of spatial reasoning in aerial visual grounding.

Conclusion: AerialVG addresses unique challenges in aerial visual grounding through spatial reasoning, with promising results from the proposed dataset and model architecture.

Abstract: Visual grounding (VG) aims to localize target objects in an image based on natural language descriptions. In this paper, we propose AerialVG, a new task focusing on visual grounding from aerial views. Compared to traditional VG, AerialVG poses new challenges, \emph{e.g.}, appearance-based grounding is insufficient to distinguish among multiple visually similar objects, and positional relations should be emphasized. Besides, existing VG models struggle when applied to aerial imagery, where high-resolution images cause significant difficulties. To address these challenges, we introduce the first AerialVG dataset, consisting of 5K real-world aerial images, 50K manually annotated descriptions, and 103K objects. Particularly, each annotation in AerialVG dataset contains multiple target objects annotated with relative spatial relations, requiring models to perform comprehensive spatial reasoning. Furthermore, we propose an innovative model especially for the AerialVG task, where a Hierarchical Cross-Attention is devised to focus on target regions, and a Relation-Aware Grounding module is designed to infer positional relations. Experimental results validate the effectiveness of our dataset and method, highlighting the importance of spatial reasoning in aerial visual grounding. The code and dataset will be released.

[277] RGS-DR: Deferred Reflections and Residual Shading in 2D Gaussian Splatting

Georgios Kouros, Minye Wu, Tinne Tuytelaars

Main category: cs.CV

TL;DR: The paper proposes a refinement stage for inverse rendering using 2D Gaussian splatting with deferred shading to improve specular detail, bridging the gap with reconstruction-only methods.

Details

Motivation: To address the limitations of per-Gaussian shading with shortest-axis normals and normal residuals, which result in noisy geometry and specular appearance, the authors aim to achieve sharper highlights, cleaner materials, and improved editability.

Method: The pipeline uses a pixel-deferred surfel formulation with specular residuals, estimating editable material properties and environment illumination while employing a directional residual pass for capturing view-dependent effects to refine novel view synthesis.

Result: The approach is evaluated on three popular datasets featuring glossy objects, demonstrating improved rendering and reconstruction quality, as well as high-quality relighting and material editing capabilities.

Conclusion: The proposed method with deferred shading and refinement stage effectively enhances specular appearance in inverse rendering, providing cleaner materials and better editability compared to previous approaches.

Abstract: In this work, we address specular appearance in inverse rendering using 2D Gaussian splatting with deferred shading and argue for a refinement stage to improve specular detail, thereby bridging the gap with reconstruction-only methods. Our pipeline estimates editable material properties and environment illumination while employing a directional residual pass that captures leftover view-dependent effects for further refining novel view synthesis. In contrast to per-Gaussian shading with shortest-axis normals and normal residuals, which tends to result in more noisy geometry and specular appearance, a pixel-deferred surfel formulation with specular residuals yields sharper highlights, cleaner materials, and improved editability. We evaluate our approach on rendering and reconstruction quality on three popular datasets featuring glossy objects, and also demonstrate high-quality relighting and material editing.

[278] SubGrapher: Visual Fingerprinting of Chemical Structures

Lucas Morin, Gerhard Ingmar Meijer, Valéry Weber, Luc Van Gool, Peter W. J. Staar

Main category: cs.CV

TL;DR: SubGrapher is a method for visual fingerprinting of chemical structure images that extracts molecular fingerprints directly from images using instance segmentation, focusing on functional groups and carbon backbones rather than full molecular graph reconstruction.

Details

Motivation: Automatic extraction of chemical structures from scientific literature is crucial for accelerating research in drug discovery and materials science. Patent documents contain molecular information in visual form that is inaccessible through traditional text-based searches.

Method: SubGrapher uses learning-based instance segmentation to identify functional groups and carbon backbones, constructing a substructure-based fingerprint that enables chemical structure retrieval, rather than attempting full molecular graph reconstruction like conventional OCSR models.

Result: SubGrapher demonstrates superior retrieval performance and robustness across diverse molecular depictions when evaluated against state-of-the-art OCSR and fingerprinting methods.

Conclusion: The approach provides an effective method for chemical structure retrieval from images, with the dataset, models, and code being publicly available.

Abstract: Automatic extraction of chemical structures from scientific literature plays a crucial role in accelerating research across fields ranging from drug discovery to materials science. Patent documents, in particular, contain molecular information in visual form, which is often inaccessible through traditional text-based searches. In this work, we introduce SubGrapher, a method for the visual fingerprinting of chemical structure images. Unlike conventional Optical Chemical Structure Recognition (OCSR) models that attempt to reconstruct full molecular graphs, SubGrapher focuses on extracting molecular fingerprints directly from chemical structure images. Using learning-based instance segmentation, SubGrapher identifies functional groups and carbon backbones, constructing a substructure-based fingerprint that enables chemical structure retrieval. Our approach is evaluated against state-of-the-art OCSR and fingerprinting methods, demonstrating superior retrieval performance and robustness across diverse molecular depictions. The dataset, models, and code are publicly available.

[279] Efficient Flow Matching using Latent Variables

Anirban Samaddar, Yixuan Sun, Viktor Nilsson, Sandeep Madireddy

Main category: cs.CV

TL;DR: Latent-CFM improves flow matching models by conditioning on features from pretrained latent variable models, enabling more efficient training and better generation quality across various datasets including images and physical fields.

Details

Motivation: Most flow matching models don't utilize the clustering structure in target data, leading to inefficient learning, especially for high-dimensional datasets that reside in low-dimensional manifolds.

Method: Condition flow matching on features extracted using pretrained deep latent variable models, adopting pretrained lightweight latent variable models for efficiency.

Result: Shows improved generation quality with significantly less training and computation than state-of-the-art flow matching models, generates more physically accurate samples for spatial fields, and enables conditional generation with interpretability.

Conclusion: Latent-CFM provides efficient training strategies that leverage pretrained latent representations, improving performance across diverse domains while adding interpretability through latent space conditioning.

Abstract: Flow matching models have shown great potential in image generation tasks among probabilistic generative models. However, most flow matching models in the literature do not explicitly utilize the underlying clustering structure in the target data when learning the flow from a simple source distribution like the standard Gaussian. This leads to inefficient learning, especially for many high-dimensional real-world datasets, which often reside in a low-dimensional manifold. To this end, we present $\texttt{Latent-CFM}$, which provides efficient training strategies by conditioning on the features extracted from data using pretrained deep latent variable models. Through experiments on synthetic data from multi-modal distributions and widely used image benchmark datasets, we show that $\texttt{Latent-CFM}$ exhibits improved generation quality with significantly less training and computation than state-of-the-art flow matching models by adopting pretrained lightweight latent variable models. Beyond natural images, we consider generative modeling of spatial fields stemming from physical processes. Using a 2d Darcy flow dataset, we demonstrate that our approach generates more physically accurate samples than competing approaches. In addition, through latent space analysis, we demonstrate that our approach can be used for conditional image generation conditioned on latent features, which adds interpretability to the generation process.

[280] Generative Pre-trained Autoregressive Diffusion Transformer

Yuan Zhang, Jiacheng Jiang, Guoqing Ma, Zhiying Lu, Haoyang Huang, Jianlong Yuan, Nan Duan, Daxin Jiang

Main category: cs.CV

TL;DR: GPDiT is a Generative Pre-trained Autoregressive Diffusion Transformer that combines diffusion and autoregressive modeling for long-range video synthesis in continuous latent space, achieving strong performance in video generation and representation.

Details

Motivation: To unify the strengths of diffusion and autoregressive modeling for long-range video synthesis, enabling natural modeling of motion dynamics and semantic consistency across frames while enhancing generation quality and representation capabilities.

Method: Autoregressively predicts future latent frames using diffusion loss instead of discrete tokens, with a lightweight causal attention variant and parameter-free rotation-based time-conditioning mechanism for improved efficiency.

Result: Achieves strong performance in video generation quality, video representation ability, and few-shot learning tasks, demonstrating effectiveness as a video modeling framework in continuous space.

Conclusion: GPDiT presents an effective framework for video modeling in continuous space that successfully combines diffusion and autoregressive approaches, offering improved generation quality, representation capabilities, and efficiency.

Abstract: In this work, we present GPDiT, a Generative Pre-trained Autoregressive Diffusion Transformer that unifies the strengths of diffusion and autoregressive modeling for long-range video synthesis, within a continuous latent space. Instead of predicting discrete tokens, GPDiT autoregressively predicts future latent frames using a diffusion loss, enabling natural modeling of motion dynamics and semantic consistency across frames. This continuous autoregressive framework not only enhances generation quality but also endows the model with representation capabilities. Additionally, we introduce a lightweight causal attention variant and a parameter-free rotation-based time-conditioning mechanism, improving both the training and inference efficiency. Extensive experiments demonstrate that GPDiT achieves strong performance in video generation quality, video representation ability, and few-shot learning tasks, highlighting its potential as an effective framework for video modeling in continuous space.

[281] PolyPose: Deformable 2D/3D Registration via Polyrigid Transformations

Vivek Gopalakrishnan, Neel Dey, Polina Golland

Main category: cs.CV

TL;DR: PolyPose is a deformable 2D/3D registration method that uses polyrigid transforms to align preoperative 3D volumes (CT/MRI) to intraoperative 2D X-rays, enabling 3D guidance with minimal imaging.

Details

Motivation: To provide 3D volumetric guidance during interventional procedures where only 2D X-rays are available, overcoming the limitation that CT/MRI cannot be acquired during procedures.

Method: Parameterizes complex 3D deformation fields as a composition of rigid transforms, leveraging the biological constraint that individual bones do not bend in typical motion (piecewise-rigid nature of human movement).

Result: Successfully aligns preoperative volumes to as few as two X-rays in challenging sparse-view and limited-angle settings where current methods fail, without needing expensive deformation regularizers.

Conclusion: PolyPose provides robust 3D guidance in interventional settings by enforcing anatomically plausible priors through polyrigid formulation, enabling effective 2D/3D registration with minimal imaging requirements.

Abstract: Determining the 3D pose of a patient from a limited set of 2D X-ray images is a critical task in interventional settings. While preoperative volumetric imaging (e.g., CT and MRI) provides precise 3D localization and visualization of anatomical targets, these modalities cannot be acquired during procedures, where fast 2D imaging (X-ray) is used instead. To integrate volumetric guidance into intraoperative procedures, we present PolyPose, a simple and robust method for deformable 2D/3D registration. PolyPose parameterizes complex 3D deformation fields as a composition of rigid transforms, leveraging the biological constraint that individual bones do not bend in typical motion. Unlike existing methods that either assume no inter-joint movement or fail outright in this under-determined setting, our polyrigid formulation enforces anatomically plausible priors that respect the piecewise-rigid nature of human movement. This approach eliminates the need for expensive deformation regularizers that require patient- and procedure-specific hyperparameter optimization. Across extensive experiments on diverse datasets from orthopedic surgery and radiotherapy, we show that this strong inductive bias enables PolyPose to successfully align the patient’s preoperative volume to as few as two X-rays, thereby providing crucial 3D guidance in challenging sparse-view and limited-angle settings where current registration methods fail. Additional visualizations, tutorials, and code are available at https://polypose.csail.mit.edu.

[282] MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness

Yolo Yunlong Tang, Pinxin Liu, Mingqian Feng, Zhangyun Tan, Rui Mao, Chao Huang, Jing Bi, Yunzhong Xiao, Susan Liang, Hang Hua, Ali Vosoughi, Luchuan Song, Zeliang Zhang, Chenliang Xu

Main category: cs.CV

TL;DR: MMPerspective is the first benchmark to systematically evaluate multimodal large language models’ understanding of perspective geometry through 10 tasks across perception, reasoning, and robustness dimensions.

Details

Motivation: To understand the extent to which multimodal large language models internalize perspective geometry, which is fundamental to human visual perception but remains unclear in current models.

Method: Created a benchmark with 2,711 real-world and synthetic images and 5,083 question-answer pairs across 10 tasks in three dimensions: Perspective Perception, Reasoning, and Robustness. Evaluated 43 state-of-the-art MLLMs on capabilities like vanishing point perception, perspective type reasoning, and line relationship understanding.

Result: Models show significant limitations - competent on surface-level perceptual tasks but struggle with compositional reasoning and maintaining spatial consistency under perturbations. Revealed patterns between model architecture, scale, and perspective capabilities, with benefits from chain-of-thought prompting.

Conclusion: MMPerspective establishes a valuable testbed for diagnosing and advancing spatial understanding in vision-language systems, highlighting current bottlenecks in perspective geometry comprehension.

Abstract: Understanding perspective is fundamental to human visual perception, yet the extent to which multimodal large language models (MLLMs) internalize perspective geometry remains unclear. We introduce MMPerspective, the first benchmark specifically designed to systematically evaluate MLLMs’ understanding of perspective through 10 carefully crafted tasks across three complementary dimensions: Perspective Perception, Reasoning, and Robustness. Our benchmark comprises 2,711 real-world and synthetic image instances with 5,083 question-answer pairs that probe key capabilities, such as vanishing point perception and counting, perspective type reasoning, line relationship understanding in 3D space, invariance to perspective-preserving transformations, etc. Through a comprehensive evaluation of 43 state-of-the-art MLLMs, we uncover significant limitations: while models demonstrate competence on surface-level perceptual tasks, they struggle with compositional reasoning and maintaining spatial consistency under perturbations. Our analysis further reveals intriguing patterns between model architecture, scale, and perspective capabilities, highlighting both robustness bottlenecks and the benefits of chain-of-thought prompting. MMPerspective establishes a valuable testbed for diagnosing and advancing spatial understanding in vision-language systems. Resources available at: https://yunlong10.github.io/MMPerspective/

[283] Roboflow100-VL: A Multi-Domain Object Detection Benchmark for Vision-Language Models

Peter Robicheaux, Matvei Popov, Anish Madan, Isaac Robinson, Joseph Nelson, Deva Ramanan, Neehar Peri

Main category: cs.CV

TL;DR: Vision-language models struggle with out-of-distribution classes and modalities. The paper introduces Roboflow100-VL, a benchmark of 100 multi-modal object detection datasets with diverse concepts, and finds VLMs achieve <2% accuracy on medical imaging, highlighting the need for few-shot concept alignment.

Details

Motivation: State-of-the-art VLMs fail to generalize to out-of-distribution classes, tasks, and imaging modalities not found in their pre-training data, requiring better alignment methods rather than just more training data.

Method: Created Roboflow100-VL benchmark with 100 multi-modal object detection datasets containing diverse concepts. Evaluated VLMs in zero-shot, few-shot, semi-supervised, and fully-supervised settings. Organized CVPR 2025 competition for few-shot object detection.

Result: VLMs like GroundingDINO and Qwen2.5-VL achieve less than 2% zero-shot accuracy on challenging medical imaging datasets. The winning competition team outperformed baseline by 17 mAP.

Conclusion: Few-shot concept alignment is crucial for VLMs to handle out-of-distribution domains. The benchmark enables systematic evaluation across data regimes, and community competition shows significant performance improvements are possible.

Abstract: Vision-language models (VLMs) trained on internet-scale data achieve remarkable zero-shot detection performance on common objects like car, truck, and pedestrian. However, state-of-the-art models still struggle to generalize to out-of-distribution classes, tasks and imaging modalities not typically found in their pre-training. Rather than simply re-training VLMs on more visual data, we argue that one should align VLMs to new concepts with annotation instructions containing a few visual examples and rich textual descriptions. To this end, we introduce Roboflow100-VL, a large-scale collection of 100 multi-modal object detection datasets with diverse concepts not commonly found in VLM pre-training. We evaluate state-of-the-art models on our benchmark in zero-shot, few-shot, semi-supervised, and fully-supervised settings, allowing for comparison across data regimes. Notably, we find that VLMs like GroundingDINO and Qwen2.5-VL achieve less than 2% zero-shot accuracy on challenging medical imaging datasets within Roboflow100-VL, demonstrating the need for few-shot concept alignment. Lastly, we discuss our recent CVPR 2025 Foundational FSOD competition and share insights from the community. Notably, the winning team significantly outperforms our baseline by 17 mAP! Our code and dataset are available at https://github.com/roboflow/rf100-vl and https://universe.roboflow.com/rf100-vl/.

[284] MetaSlot: Break Through the Fixed Number of Slots in Object-Centric Learning

Hongjia Liu, Rongzhen Zhao, Haohan Chen, Joni Pajarinen

Main category: cs.CV

TL;DR: MetaSlot is a plug-and-play Slot Attention variant that adapts to variable object counts by maintaining a codebook of object prototypes, removing duplicate slots through quantization, and injecting progressive noise to stabilize aggregation.

Details

Motivation: Existing Object-Centric Learning methods rely on a static slot count, causing objects to be represented as multiple parts when object counts vary. This limitation hinders proper object-level representation learning.

Method: MetaSlot maintains a codebook of object prototypes through vector quantization of slot representations, removes duplicate slots by quantizing them with the codebook, and injects progressively weaker noise during Slot Attention iterations to accelerate and stabilize aggregation.

Result: Models equipped with MetaSlot achieve significant performance gains across multiple public datasets and tasks (object discovery and recognition), producing markedly interpretable slot representations compared to existing Slot Attention variants.

Conclusion: MetaSlot is a general Slot Attention variant that can be seamlessly integrated into existing OCL architectures, providing adaptive object representation that handles variable object counts effectively.

Abstract: Learning object-level, structured representations is widely regarded as a key to better generalization in vision and underpins the design of next-generation Pre-trained Vision Models (PVMs). Mainstream Object-Centric Learning (OCL) methods adopt Slot Attention or its variants to iteratively aggregate objects’ super-pixels into a fixed set of query feature vectors, termed slots. However, their reliance on a static slot count leads to an object being represented as multiple parts when the number of objects varies. We introduce MetaSlot, a plug-and-play Slot Attention variant that adapts to variable object counts. MetaSlot (i) maintains a codebook that holds prototypes of objects in a dataset by vector-quantizing the resulting slot representations; (ii) removes duplicate slots from the traditionally aggregated slots by quantizing them with the codebook; and (iii) injects progressively weaker noise into the Slot Attention iterations to accelerate and stabilize the aggregation. MetaSlot is a general Slot Attention variant that can be seamlessly integrated into existing OCL architectures. Across multiple public datasets and tasks–including object discovery and recognition–models equipped with MetaSlot achieve significant performance gains and markedly interpretable slot representations, compared with existing Slot Attention variants.

[285] Fully Spiking Neural Networks for Unified Frame-Event Object Tracking

Jingjun Yang, Liangwei Fan, Jinpu Zhang, Xiangkai Lian, Hui Shen, Dewen Hu

Main category: cs.CV

TL;DR: SpikeFET is the first fully spiking frame-event tracking framework that integrates convolutional and Transformer architectures for efficient visual object tracking, achieving high accuracy with low power consumption.

Details

Motivation: Current fusion methods for image and event streams suffer from high computational overhead and inefficient extraction of sparse, asynchronous event data, failing to leverage the energy efficiency of spiking paradigms.

Method: Proposes SpikeFET with Random Patchwork Module (RPM) to eliminate positional bias from convolutional padding, and Spatial-Temporal Regularization (STR) to overcome similarity metric degradation from asymmetric features.

Result: Extensive experiments show superior tracking accuracy over existing methods while significantly reducing power consumption, achieving optimal performance-efficiency balance.

Conclusion: SpikeFET successfully demonstrates synergistic integration of frame and event data within spiking paradigm, offering robust tracking with energy efficiency.

Abstract: The integration of image and event streams offers a promising approach for achieving robust visual object tracking in complex environments. However, current fusion methods achieve high performance at the cost of significant computational overhead and struggle to efficiently extract the sparse, asynchronous information from event streams, failing to leverage the energy-efficient advantages of event-driven spiking paradigms. To address this challenge, we propose the first fully Spiking Frame-Event Tracking framework called SpikeFET. This network achieves synergistic integration of convolutional local feature extraction and Transformer-based global modeling within the spiking paradigm, effectively fusing frame and event data. To overcome the degradation of translation invariance caused by convolutional padding, we introduce a Random Patchwork Module (RPM) that eliminates positional bias through randomized spatial reorganization and learnable type encoding while preserving residual structures. Furthermore, we propose a Spatial-Temporal Regularization (STR) strategy that overcomes similarity metric degradation from asymmetric features by enforcing spatio-temporal consistency among temporal template features in latent space. Extensive experiments across multiple benchmarks demonstrate that the proposed framework achieves superior tracking accuracy over existing methods while significantly reducing power consumption, attaining an optimal balance between performance and efficiency.

[286] Robust Neural Rendering in the Wild with Asymmetric Dual 3D Gaussian Splatting

Chengqi Li, Zhihao Shi, Yangdi Lu, Wenbo He, Xiangyu Xu

Main category: cs.CV

TL;DR: A novel framework called AsymGS that improves 3D reconstruction from in-the-wild images by training two 3D Gaussian Splatting models in parallel with consistency constraints and divergent masking to suppress artifacts.

Details

Motivation: Existing 3D reconstruction methods struggle with inconsistent lighting and transient distractors in in-the-wild images, leading to visual artifacts and unstable reconstructions.

Method: Trains two 3DGS models in parallel with consistency constraints, uses divergent masking (multi-cue adaptive mask and self-supervised soft mask) to prevent confirmation bias, and includes a lightweight variant with Dynamic EMA Proxy for efficiency.

Result: Extensive experiments show the method consistently outperforms existing approaches while achieving high efficiency on challenging real-world datasets.

Conclusion: The proposed framework effectively suppresses artifacts and produces stable 3D reconstructions from in-the-wild images through asymmetric training and consistency constraints.

Abstract: 3D reconstruction from in-the-wild images remains a challenging task due to inconsistent lighting conditions and transient distractors. Existing methods typically rely on heuristic strategies to handle the low-quality training data, which often struggle to produce stable and consistent reconstructions, frequently resulting in visual artifacts. In this work, we propose \modelname{}, a novel framework that leverages the stochastic nature of these artifacts: they tend to vary across different training runs due to minor randomness. Specifically, our method trains two 3D Gaussian Splatting (3DGS) models in parallel, enforcing a consistency constraint that encourages convergence on reliable scene geometry while suppressing inconsistent artifacts. To prevent the two models from collapsing into similar failure modes due to confirmation bias, we introduce a divergent masking strategy that applies two complementary masks: a multi-cue adaptive mask and a self-supervised soft mask, which leads to an asymmetric training process of the two models, reducing shared error modes. In addition, to improve the efficiency of model training, we introduce a lightweight variant called Dynamic EMA Proxy, which replaces one of the two models with a dynamically updated Exponential Moving Average (EMA) proxy, and employs an alternating masking strategy to preserve divergence. Extensive experiments on challenging real-world datasets demonstrate that our method consistently outperforms existing approaches while achieving high efficiency. See the project website at https://steveli88.github.io/AsymGS.

[287] Uncertainty-Aware Remaining Lifespan Prediction from Images

Tristan Kenneweg, Philip Kenneweg, Barbara Hammer

Main category: cs.CV

TL;DR: A method using pretrained vision transformers to predict remaining lifespan from facial and whole-body images with calibrated uncertainty quantification, achieving state-of-the-art accuracy.

Details

Motivation: To enable accessible, noninvasive, and scalable health screening by predicting mortality-related outcomes from images.

Method: Leverages pretrained vision transformer foundation models to estimate remaining lifespan from facial and whole-body images, with robust uncertainty quantification using Gaussian distributions for each sample.

Result: Achieved state-of-the-art MAE of 7.41 years on established dataset, and 4.91-4.99 years MAE on two new datasets. Provides calibrated uncertainty estimates with bucketed expected calibration error of 0.82 years.

Conclusion: The approach demonstrates potential for extracting medically relevant signals from images, though not intended for clinical deployment. All code and datasets are made available for further research.

Abstract: Predicting mortality-related outcomes from images offers the prospect of accessible, noninvasive, and scalable health screening. We present a method that leverages pretrained vision transformer foundation models to estimate remaining lifespan from facial and whole-body images, alongside robust uncertainty quantification. We show that predictive uncertainty varies systematically with the true remaining lifespan, and that this uncertainty can be effectively modeled by learning a Gaussian distribution for each sample. Our approach achieves state-of-the-art mean absolute error (MAE) of 7.41 years on an established dataset, and further achieves 4.91 and 4.99 years MAE on two new, higher-quality datasets curated and published in this work. Importantly, our models provide calibrated uncertainty estimates, as demonstrated by a bucketed expected calibration error of 0.82 years on the Faces Dataset. While not intended for clinical deployment, these results highlight the potential of extracting medically relevant signals from images. We make all code and datasets available to facilitate further research.

Yeongtak Oh, Dohyun Chung, Juhyeon Shin, Sangha Park, Johan Barthelemy, Jisoo Mok, Sungroh Yoon

Main category: cs.CV

TL;DR: The paper proposes a reinforcement learning-based post-training framework to improve personalized image captioning in multi-modal large language models, addressing limitations of supervised fine-tuning methods.

Details

Motivation: Existing MLLMs struggle with personalized image captioning, especially in complex scenarios like multi-concept captioning. Supervised fine-tuning methods fail to produce faithful descriptions despite large-scale training data, and acquiring high-quality captions for complex settings is costly and difficult.

Method: The authors propose a reinforcement learning-based post-training framework for MLLMs, which is the first RL-based approach for personalized image captioning. This addresses the data-centric limitations of supervised fine-tuning methods.

Result: The proposed method significantly enhances both visual recognition and personalized generation capabilities of MLLMs. It consistently outperforms existing SFT-based baselines, particularly in the challenging multi-concept image captioning task.

Conclusion: RL-based post-training is an effective alternative to SFT for improving personalized image captioning in MLLMs, especially for complex scenarios where high-quality training data is scarce.

Abstract: Recent multi-modal large language models (MLLMs) often struggle to generate personalized image captions, even when trained on high-quality captions. In this work, we observe that such limitations persist in existing post-training-based MLLM personalization methods. Specifically, despite being post-tuned with large-scale caption data through supervised fine-tuning (SFT), these models frequently fail to produce faithful descriptions in real-world scenarios, such as multi-concept image captioning. However, acquiring large-scale, high-quality captions for such complex settings is both costly and difficult. To address the data-centric nature of SFT, we propose a reinforcement learning (RL)-based post-training framework. To the best of our knowledge, this is the first RL-based approach to post-train MLLMs for personalized image captioning. Our method significantly enhances both visual recognition and personalized generation capabilities of MLLMs, and consistently outperforms existing SFT-based baselines, especially in the challenging multi-concept image captioning task. Project page: https://github.com/oyt9306/RePIC

[289] WAFT: Warping-Alone Field Transforms for Optical Flow

Yihan Wang, Jia Deng

Main category: cs.CV

TL;DR: WAFT is a simple optical flow method that replaces cost volumes with high-resolution warping, achieving state-of-the-art performance with lower memory and faster speed.

Details

Motivation: To challenge the conventional wisdom that cost volumes are necessary for strong optical flow performance, and to create a simpler, more efficient method.

Method: Uses high-resolution warping instead of cost volumes, similar to RAFT but with this key modification. It’s a flexible meta-architecture with minimal inductive biases.

Result: Ranks 1st on Spring, Sintel, and KITTI benchmarks; achieves best zero-shot generalization on KITTI; 4.1x faster than comparable methods with lower memory cost.

Conclusion: WAFT demonstrates that cost volumes are not essential for state-of-the-art optical flow performance, offering a simpler and more efficient alternative.

Abstract: We introduce Warping-Alone Field Transforms (WAFT), a simple and effective method for optical flow. WAFT is similar to RAFT but replaces cost volume with high-resolution warping, achieving better accuracy with lower memory cost. This design challenges the conventional wisdom that constructing cost volumes is necessary for strong performance. WAFT is a simple and flexible meta-architecture with minimal inductive biases and reliance on custom designs. Compared with existing methods, WAFT ranks 1st on Spring, Sintel, and KITTI benchmarks, achieves the best zero-shot generalization on KITTI, while being up to 4.1x faster than methods with similar performance. Code and model weights are available at https://github.com/princeton-vl/WAFT.

[290] Color Bind: Exploring Color Perception in Text-to-Image Models

Shay Shomer Chai, Wenxuan Peng, Bharath Hariharan, Hadar Averbuch-Elor

Main category: cs.CV

TL;DR: This paper addresses semantic misalignment in text-to-image generation, particularly for multi-object prompts with multiple colors. It introduces a dedicated image editing technique that significantly improves color attribute alignment across various diffusion-based methods.

Details

Motivation: Current text-to-image models struggle with capturing precise semantics in complex multi-object prompts, especially with multiple color attributes. Existing methods use coarse metrics like CLIP similarity or human evaluations, which are insufficient for rigorous analysis of semantic alignment issues.

Method: The authors perform a case study on colors as a fundamental attribute, revealing that pretrained models fail with multi-color prompts. They introduce a dedicated image editing technique specifically designed to address multi-object semantic alignment for prompts containing multiple colors.

Result: The proposed approach significantly boosts performance over a wide range of metrics compared to existing inference-time techniques and editing methods. It demonstrates improved color attribute alignment across various text-to-image diffusion-based techniques.

Conclusion: The study shows that current models have significant limitations in handling multi-color prompts, and the proposed editing technique effectively mitigates semantic misalignment issues, providing a more reliable solution for complex multi-object text-to-image generation.

Abstract: Text-to-image generation has recently seen remarkable success, granting users with the ability to create high-quality images through the use of text. However, contemporary methods face challenges in capturing the precise semantics conveyed by complex multi-object prompts. Consequently, many works have sought to mitigate such semantic misalignments, typically via inference-time schemes that modify the attention layers of the denoising networks. However, prior work has mostly utilized coarse metrics, such as the cosine similarity between text and image CLIP embeddings, or human evaluations, which are challenging to conduct on a larger-scale. In this work, we perform a case study on colors – a fundamental attribute commonly associated with objects in text prompts, which offer a rich test bed for rigorous evaluation. Our analysis reveals that pretrained models struggle to generate images that faithfully reflect multiple color attributes-far more so than with single-color prompts-and that neither inference-time techniques nor existing editing methods reliably resolve these semantic misalignments. Accordingly, we introduce a dedicated image editing technique, mitigating the issue of multi-object semantic alignment for prompts containing multiple colors. We demonstrate that our approach significantly boosts performance over a wide range of metrics, considering images generated by various text-to-image diffusion-based techniques.

[291] RespoDiff: Dual-Module Bottleneck Transformation for Responsible & Faithful T2I Generation

Silpa Vadakkeeveetil Sreelatha, Sauradip Nag, Muhammad Awais, Serge Belongie, Anjan Dutta

Main category: cs.CV

TL;DR: RespoDiff is a framework for responsible text-to-image generation that uses dual-module transformation on diffusion model representations to improve fairness and safety while maintaining semantic alignment and image quality.

Details

Motivation: Existing methods for improving fairness and safety in text-to-image generation typically compromise semantic fidelity and image quality, creating a need for a balanced approach.

Method: Introduces two learnable modules: one for responsible concepts (fairness/safety) and another for semantic alignment with neutral prompts, coordinated through a novel score-matching objective on diffusion model bottleneck representations.

Result: Outperforms state-of-the-art methods, improves responsible and semantically coherent generation by 20% across diverse prompts, and integrates seamlessly with large-scale models like SDXL.

Conclusion: RespoDiff successfully addresses the fairness-safety vs. semantic fidelity trade-off in text-to-image generation through its dual-module approach and novel coordination mechanism.

Abstract: The rapid advancement of diffusion models has enabled high-fidelity and semantically rich text-to-image generation; however, ensuring fairness and safety remains an open challenge. Existing methods typically improve fairness and safety at the expense of semantic fidelity and image quality. In this work, we propose RespoDiff, a novel framework for responsible text-to-image generation that incorporates a dual-module transformation on the intermediate bottleneck representations of diffusion models. Our approach introduces two distinct learnable modules: one focused on capturing and enforcing responsible concepts, such as fairness and safety, and the other dedicated to maintaining semantic alignment with neutral prompts. To facilitate the dual learning process, we introduce a novel score-matching objective that enables effective coordination between the modules. Our method outperforms state-of-the-art methods in responsible generation by ensuring semantic alignment while optimizing both objectives without compromising image fidelity. Our approach improves responsible and semantically coherent generation by 20% across diverse, unseen prompts. Moreover, it integrates seamlessly into large-scale models like SDXL, enhancing fairness and safety. Code will be released upon acceptance.

[292] Gaze Estimation for Human-Robot Interaction: Analysis Using the NICO Platform

Matej Palider, Omar Eldardeer, Viktor Kocur

Main category: cs.CV

TL;DR: Evaluation of gaze estimation methods in HRI shared workspace scenarios reveals practical limitations despite competitive angular errors, with best median error of 16.48cm in workspace distance.

Details

Motivation: To assess the practical performance of current gaze estimation methods in real-world Human-Robot Interaction (HRI) scenarios, specifically shared workspace contexts where accurate gaze estimation is crucial for effective interaction.

Method: Introduced a new annotated dataset collected using the NICO robotic platform and evaluated four state-of-the-art gaze estimation models in a shared workspace HRI scenario.

Result: While angular errors were comparable to general-purpose benchmarks, when converted to workspace distance measurements, the best median error was 16.48cm, highlighting significant practical limitations for HRI applications.

Conclusion: Current gaze estimation methods have substantial practical limitations in HRI contexts, and recommendations are provided for better integration of gaze estimation as a modality in HRI systems.

Abstract: This paper evaluates the current gaze estimation methods within an HRI context of a shared workspace scenario. We introduce a new, annotated dataset collected with the NICO robotic platform. We evaluate four state-of-the-art gaze estimation models. The evaluation shows that the angular errors are close to those reported on general-purpose benchmarks. However, when expressed in terms of distance in the shared workspace the best median error is 16.48 cm quantifying the practical limitations of current methods. We conclude by discussing these limitations and offering recommendations on how to best integrate gaze estimation as a modality in HRI systems.

[293] HBSplat: Robust Sparse-View Gaussian Reconstruction with Hybrid-Loss Guided Depth and Bidirectional Warping

Yu Ma, Guoliang Wei, Haihong Xiao, Yue Cheng

Main category: cs.CV

TL;DR: HBSplat improves 3D Gaussian Splatting for sparse view synthesis by integrating structural cues, virtual view constraints, and occluded region completion to overcome overfitting and geometric distortion.

Details

Motivation: 3D Gaussian Splatting performs poorly with sparse inputs, suffering from floating artifacts and structural failures, which HBSplat aims to address.

Method: Uses hybrid-loss depth estimation, bidirectional warping for virtual view synthesis, and occlusion-aware reconstruction with depth-difference masking and inpainting.

Result: Achieves state-of-the-art performance with up to 21.13 dB PSNR and 0.189 LPIPS on LLFF, Blender, and DTU benchmarks while maintaining real-time inference.

Conclusion: HBSplat successfully enhances 3DGS for sparse view synthesis through robust structural integration and virtual view constraints, setting new benchmarks in quality and efficiency.

Abstract: Novel View Synthesis (NVS) from sparse views presents a formidable challenge in 3D reconstruction, where limited multi-view constraints lead to severe overfitting, geometric distortion, and fragmented scenes. While 3D Gaussian Splatting (3DGS) delivers real-time, high-fidelity rendering, its performance drastically deteriorates under sparse inputs, plagued by floating artifacts and structural failures. To address these challenges, we introduce HBSplat, a unified framework that elevates 3DGS by seamlessly integrating robust structural cues, virtual view constraints, and occluded region completion. Our core contributions are threefold: a Hybrid-Loss Depth Estimation module that ensures multi-view consistency by leveraging dense matching priors and integrating reprojection, point propagation, and smoothness constraints; a Bidirectional Warping Virtual View Synthesis method that enforces substantially stronger constraints by creating high-fidelity virtual views through bidirectional depth-image warping and multi-view fusion; and an Occlusion-Aware Reconstruction component that recovers occluded areas using a depth-difference mask and a learning-based inpainting model. Extensive evaluations on LLFF, Blender, and DTU benchmarks validate that HBSplat sets a new state-of-the-art, achieving up to 21.13 dB PSNR and 0.189 LPIPS, while maintaining real-time inference. Code is available at: https://github.com/eternalland/HBSplat.

[294] VT-FSL: Bridging Vision and Text with LLMs for Few-Shot Learning

Wenhao Li, Qiangchang Wang, Xianjing Meng, Zhibin Wu, Yilong Yin

Main category: cs.CV

TL;DR: VT-FSL is a novel few-shot learning framework that bridges vision and text using LLMs to generate precise class descriptions and synthetic images, achieving SOTA performance across diverse benchmarks.

Details

Motivation: Address limitations of existing FSL methods that suffer from hallucinated semantics due to lack of grounding in actual instances, resulting in noisy guidance and costly corrections.

Method: Proposes Cross-modal Iterative Prompting (CIP) to generate precise class descriptions using LLMs conditioned on class names and support images, and Cross-modal Geometric Alignment (CGA) to align textual, support, and synthetic visual representations through kernelized volume minimization.

Result: Establishes new state-of-the-art performance across ten diverse benchmarks including standard, cross-domain, and fine-grained few-shot learning scenarios.

Conclusion: VT-FSL effectively bridges vision and text modalities through LLM-based prompting and geometric alignment, providing comprehensive semantic understanding and intra-class diversity for improved few-shot learning.

Abstract: Few-shot learning (FSL) aims to recognize novel concepts from only a few labeled support samples. Recent studies enhance support features by incorporating additional semantic information or designing complex semantic fusion modules. However, they still suffer from hallucinating semantics that contradict the visual evidence due to the lack of grounding in actual instances, resulting in noisy guidance and costly corrections. To address these issues, we propose a novel framework, bridging Vision and Text with LLMs for Few-Shot Learning (VT-FSL), which constructs precise cross-modal prompts conditioned on Large Language Models (LLMs) and support images, seamlessly integrating them through a geometry-aware alignment. It mainly consists of Cross-modal Iterative Prompting (CIP) and Cross-modal Geometric Alignment (CGA). Specifically, the CIP conditions an LLM on both class names and support images to generate precise class descriptions iteratively in a single structured reasoning pass. These descriptions not only enrich the semantic understanding of novel classes but also enable the zero-shot synthesis of semantically consistent images. The descriptions and synthetic images act respectively as complementary textual and visual prompts, providing high-level class semantics and low-level intra-class diversity to compensate for limited support data. Furthermore, the CGA jointly aligns the fused textual, support, and synthetic visual representations by minimizing the kernelized volume of the 3-dimensional parallelotope they span. It captures global and nonlinear relationships among all representations, enabling structured and consistent multimodal integration. The proposed VT-FSL method establishes new state-of-the-art performance across ten diverse benchmarks, including standard, cross-domain, and fine-grained few-shot learning scenarios. Code is available at https://github.com/peacelwh/VT-FSL.

[295] VGGT-X: When VGGT Meets Dense Novel View Synthesis

Yang Liu, Chuanchen Luo, Zimo Tang, Junran Peng, Zhaoxiang Zhang

Main category: cs.CV

TL;DR: VGGT-X addresses VRAM and output quality issues when scaling 3D Foundation Models for dense Novel View Synthesis, achieving state-of-the-art COLMAP-free results through memory-efficient implementation and robust training.

Details

Motivation: Current NVS approaches rely on slow and fragile SfM pipelines for 3D attributes. 3DFMs offer speed advantages but face VRAM and quality issues when scaled to dense views.

Method: VGGT-X includes memory-efficient VGGT implementation scaling to 1000+ images, adaptive global alignment for output enhancement, and robust 3DGS training practices.

Result: Substantially closes fidelity gap with COLMAP-initialized pipelines, achieving SOTA results in dense COLMAP-free NVS and pose estimation.

Conclusion: The approach provides insights for future development of 3D foundation models and dense NVS, with remaining gaps analyzed for further improvement.

Abstract: We study the problem of applying 3D Foundation Models (3DFMs) to dense Novel View Synthesis (NVS). Despite significant progress in Novel View Synthesis powered by NeRF and 3DGS, current approaches remain reliant on accurate 3D attributes (e.g., camera poses and point clouds) acquired from Structure-from-Motion (SfM), which is often slow and fragile in low-texture or low-overlap captures. Recent 3DFMs showcase orders of magnitude speedup over the traditional pipeline and great potential for online NVS. But most of the validation and conclusions are confined to sparse-view settings. Our study reveals that naively scaling 3DFMs to dense views encounters two fundamental barriers: dramatically increasing VRAM burden and imperfect outputs that degrade initialization-sensitive 3D training. To address these barriers, we introduce VGGT-X, incorporating a memory-efficient VGGT implementation that scales to 1,000+ images, an adaptive global alignment for VGGT output enhancement, and robust 3DGS training practices. Extensive experiments show that these measures substantially close the fidelity gap with COLMAP-initialized pipelines, achieving state-of-the-art results in dense COLMAP-free NVS and pose estimation. Additionally, we analyze the causes of remaining gaps with COLMAP-initialized rendering, providing insights for the future development of 3D foundation models and dense NVS. Our project page is available at https://dekuliutesla.github.io/vggt-x.github.io/

Teng Zhang, Ziqian Fan, Mingxin Liu, Xin Zhang, Xudong Lu, Wentong Li, Yue Zhou, Yi Yu, Xiang Li, Junchi Yan, Xue Yang

Main category: cs.CV

TL;DR: Point2RBox-v3 is a weakly-supervised oriented object detection method that uses point annotations to address inefficient pseudo label utilization and poor quality in existing methods through progressive label assignment and prior-guided dynamic mask loss.

Details

Motivation: To reduce the cost and labor of manual labeling for oriented object detection by learning from point annotations, while addressing deficiencies in existing point-supervised methods: inefficient pseudo label utilization and poor quality.

Method: 1) Progressive Label Assignment (PLA): dynamically estimates instance sizes at different training stages to enable label assignment methods. 2) Prior-Guided Dynamic Mask Loss (PGDM-Loss): enhances Voronoi Watershed Loss by combining SAM model advantages with watershed algorithm to handle both sparse and dense scenes.

Result: Achieves competitive performance: 66.09%/56.86%/41.28%/46.40%/19.60%/45.96% on DOTA-v1.0/DOTA-v1.5/DOTA-v2.0/DIOR/STAR/RSAR datasets, especially effective in scenarios with large object size variations or sparse object occurrences.

Conclusion: Point2RBox-v3 is the first model to use dynamic pseudo labels for label assignment and creatively combines SAM model with watershed algorithm, achieving excellent performance across both sparse and dense scenes.

Abstract: Driven by the growing need for Oriented Object Detection (OOD), learning from point annotations under a weakly-supervised framework has emerged as a promising alternative to costly and laborious manual labeling. In this paper, we discuss two deficiencies in existing point-supervised methods: inefficient utilization and poor quality of pseudo labels. Therefore, we present Point2RBox-v3. At the core are two principles: 1) Progressive Label Assignment (PLA). It dynamically estimates instance sizes in a coarse yet intelligent manner at different stages of the training process, enabling the use of label assignment methods. 2) Prior-Guided Dynamic Mask Loss (PGDM-Loss). It is an enhancement of the Voronoi Watershed Loss from Point2RBox-v2, which overcomes the shortcomings of Watershed in its poor performance in sparse scenes and SAM’s poor performance in dense scenes. To our knowledge, Point2RBox-v3 is the first model to employ dynamic pseudo labels for label assignment, and it creatively complements the advantages of SAM model with the watershed algorithm, which achieves excellent performance in both sparse and dense scenes. Our solution gives competitive performance, especially in scenarios with large variations in object size or sparse object occurrences: 66.09%/56.86%/41.28%/46.40%/19.60%/45.96% on DOTA-v1.0/DOTA-v1.5/DOTA-v2.0/DIOR/STAR/RSAR.

[297] Keep It on a Leash: Controllable Pseudo-label Generation Towards Realistic Long-Tailed Semi-Supervised Learning

Yaxin Hou, Bo Han, Yuheng Jia, Hui Liu, Junhui Hou

Main category: cs.CV

TL;DR: CPG is a framework for long-tailed semi-supervised learning that handles unknown unlabeled data distributions by dynamically generating pseudo-labels and maintaining a known labeled data distribution through controllable filtering.

Details

Motivation: Current methods assume unlabeled data follows predefined distributions, but in reality, unlabeled data distribution is generally unknown and arbitrary, creating a significant challenge for long-tailed semi-supervised learning.

Method: CPG uses a controllable self-reinforcing optimization cycle: (1) dynamic controllable filtering to selectively add reliable pseudo-labels while maintaining known distribution, (2) Bayes-optimal classifier construction using logit adjustment, (3) improved classifier helps identify more pseudo-labels. Also includes class-aware adaptive augmentation and auxiliary branch for data utilization.

Result: Comprehensive evaluations show CPG achieves consistent improvements, surpassing state-of-the-art methods by up to 15.97% in accuracy across various benchmark datasets.

Conclusion: CPG effectively handles unknown unlabeled data distributions in long-tailed semi-supervised learning through its controllable pseudo-label generation framework and optimization cycle, with theoretical guarantees on generalization error reduction.

Abstract: Current long-tailed semi-supervised learning methods assume that labeled data exhibit a long-tailed distribution, and unlabeled data adhere to a typical predefined distribution (i.e., long-tailed, uniform, or inverse long-tailed). However, the distribution of the unlabeled data is generally unknown and may follow an arbitrary distribution. To tackle this challenge, we propose a Controllable Pseudo-label Generation (CPG) framework, expanding the labeled dataset with the progressively identified reliable pseudo-labels from the unlabeled dataset and training the model on the updated labeled dataset with a known distribution, making it unaffected by the unlabeled data distribution. Specifically, CPG operates through a controllable self-reinforcing optimization cycle: (i) at each training step, our dynamic controllable filtering mechanism selectively incorporates reliable pseudo-labels from the unlabeled dataset into the labeled dataset, ensuring that the updated labeled dataset follows a known distribution; (ii) we then construct a Bayes-optimal classifier using logit adjustment based on the updated labeled data distribution; (iii) this improved classifier subsequently helps identify more reliable pseudo-labels in the next training step. We further theoretically prove that this optimization cycle can significantly reduce the generalization error under some conditions. Additionally, we propose a class-aware adaptive augmentation module to further improve the representation of minority classes, and an auxiliary branch to maximize data utilization by leveraging all labeled and unlabeled samples. Comprehensive evaluations on various commonly used benchmark datasets show that CPG achieves consistent improvements, surpassing state-of-the-art methods by up to $\textbf{15.97%}$ in accuracy. The code is available at https://github.com/yaxinhou/CPG.

[298] Video-in-the-Loop: Span-Grounded Long Video QA with Interleaved Reasoning

Chendong Wang, Donglin Bai, Yifan Yang, Xiao Jin, Anlan Zhang, Rui Wang, Shiqi Jiang, Yuqing Yang, Hao Wu, Qi Dai, Chong Luo, Ting Cao, Lili Qiu, Suman Banerjee

Main category: cs.CV

TL;DR: Video-in-the-Loop (ViTL) is a two-stage framework for long-video QA that localizes relevant intervals with low-fps skimming and answers via span-aware token reallocation at higher effective frame rates, achieving better performance with fewer frames.

Details

Motivation: To address the computational challenges of processing long videos in QA systems while maintaining performance and providing interpretable outputs with direct attribution.

Method: Two-stage approach: 1) Localize question-relevant intervals using low-fps skim, 2) Answer via span-aware reallocation of visual tokens at higher effective frame rate. Uses interleaved group-relative objective that couples temporal IoU for localization with answer correctness.

Result: Achieves up to 8.6% improvement with 50% less frame input on long-video QA and temporal grounding tasks (Charades-STA, ActivityNet-Captions). Span-aware token reallocation consistently outperforms uniform sampling.

Conclusion: ViTL provides an interpretable, compute-efficient solution for scalable long-video QA, combining localization and answering in an end-to-end trainable framework with direct attribution capabilities.

Abstract: We present \emph{Video-in-the-Loop} (ViTL), a two-stage long-video QA framework that preserves a fixed token budget by first \emph{localizing} question-relevant interval(s) with a low-fps skim and then \emph{answering} via span-aware reallocation of visual tokens at higher effective frame rate, emitting an interleaved output with both spans and the final option for direct attribution. We also introduce \dataname{}, which converts description based event graphs into \emph{span-grounded} multiple-choice QA by pairing each question with \emph{ground-truth} time span(s) and related reasoning. ViTL is trained end-to-end with an interleaved group-relative objective that couples temporal IoU for localization with answer correctness, allowing credit to flow from answers back to spans without increasing compute. Under fixed token budgets, ViTL attains up to 8.6% with 50% less frame input on long-video QA and temporal grounding (e.g., Charades-STA, ActivityNet-Captions) and ablations show that span-aware token reallocation consistently surpasses uniform sampling. Together, \dataname{} and ViTL provide an interpretable, compute-efficient recipe for scalable long-video QA.

[299] Progressive Gaussian Transformer with Anisotropy-aware Sampling for Open Vocabulary Occupancy Prediction

Chi Yan, Dan Xu

Main category: cs.CV

TL;DR: PG-Occ is a Progressive Gaussian Transformer Framework for open-vocabulary 3D occupancy prediction that addresses the trade-off between sparse Gaussian representation (missing small objects) and dense representation (high computational cost) through progressive online densification and anisotropy-aware sampling.

Details

Motivation: Traditional 3D occupancy prediction methods are limited to fixed semantic categories, while recent text-aligned approaches face a trade-off: sparse Gaussian representation struggles with small objects and dense representation has high computational overhead.

Method: Uses progressive online densification to gradually enhance 3D Gaussian representation, and introduces anisotropy-aware sampling with spatio-temporal fusion that adaptively assigns receptive fields to Gaussians at different scales and stages.

Result: Achieves state-of-the-art performance with a relative 14.3% mIoU improvement over previous best methods.

Conclusion: PG-Occ successfully enables open-vocabulary 3D occupancy prediction by balancing computational efficiency and fine-grained scene detail capture through progressive densification and adaptive feature aggregation.

Abstract: The 3D occupancy prediction task has witnessed remarkable progress in recent years, playing a crucial role in vision-based autonomous driving systems. While traditional methods are limited to fixed semantic categories, recent approaches have moved towards predicting text-aligned features to enable open-vocabulary text queries in real-world scenes. However, there exists a trade-off in text-aligned scene modeling: sparse Gaussian representation struggles to capture small objects in the scene, while dense representation incurs significant computational overhead. To address these limitations, we present PG-Occ, an innovative Progressive Gaussian Transformer Framework that enables open-vocabulary 3D occupancy prediction. Our framework employs progressive online densification, a feed-forward strategy that gradually enhances the 3D Gaussian representation to capture fine-grained scene details. By iteratively enhancing the representation, the framework achieves increasingly precise and detailed scene understanding. Another key contribution is the introduction of an anisotropy-aware sampling strategy with spatio-temporal fusion, which adaptively assigns receptive fields to Gaussians at different scales and stages, enabling more effective feature aggregation and richer scene information capture. Through extensive evaluations, we demonstrate that PG-Occ achieves state-of-the-art performance with a relative 14.3% mIoU improvement over the previous best performing method. Code and pretrained models will be released upon publication on our project page: https://yanchi-3dv.github.io/PG-Occ

[300] Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models

Yolo Yunlong Tang, Jing Bi, Pinxin Liu, Zhenyu Pan, Zhangyun Tan, Qianxiang Shen, Jiani Liu, Hang Hua, Junjia Guo, Yunzhong Xiao, Chao Huang, Zhiyuan Wang, Susan Liang, Xinyi Liu, Yizhi Song, Yuhe Nie, Jia-Xing Zhong, Bozheng Li, Daiqing Qi, Ziyun Zeng, Ali Vosoughi, Luchuan Song, Zeliang Zhang, Daiki Shimada, Han Liu, Jiebo Luo, Chenliang Xu

Main category: cs.CV

TL;DR: This survey provides the first comprehensive examination of post-training methodologies for Video-Large Multimodal Models (Video-LMMs), covering supervised fine-tuning, reinforcement learning, and test-time scaling techniques to enhance video understanding capabilities.

Details

Motivation: Video understanding is challenging due to complex spatiotemporal relationships and long-term dependencies. While Video-LMMs show promise, their post-training phase remains fragmented in literature, limiting their transformation from basic perception to sophisticated reasoning engines.

Method: The survey examines three post-training pillars: supervised fine-tuning (SFT) with chain-of-thought, reinforcement learning (RL) from verifiable objectives, and test-time scaling (TTS) through enhanced inference computation. It presents a structured taxonomy addressing video-specific challenges like temporal localization and spatiotemporal grounding.

Result: The survey synthesizes key design principles, insights, and evaluation protocols while identifying open challenges in reward design, scalability, and cost-performance optimization. It curates essential benchmarks, datasets, and metrics for rigorous assessment of post-training effectiveness.

Conclusion: This work provides researchers and practitioners with a unified framework for advancing Video-LMM capabilities through systematic post-training methodologies, with ongoing resources maintained at the provided GitHub repository.

Abstract: Video understanding represents the most challenging frontier in computer vision, requiring models to reason about complex spatiotemporal relationships, long-term dependencies, and multimodal evidence. The recent emergence of Video-Large Multimodal Models (Video-LMMs), which integrate visual encoders with powerful decoder-based language models, has demonstrated remarkable capabilities in video understanding tasks. However, the critical phase that transforms these models from basic perception systems into sophisticated reasoning engines, post-training, remains fragmented across the literature. This survey provides the first comprehensive examination of post-training methodologies for Video-LMMs, encompassing three fundamental pillars: supervised fine-tuning (SFT) with chain-of-thought, reinforcement learning (RL) from verifiable objectives, and test-time scaling (TTS) through enhanced inference computation. We present a structured taxonomy that clarifies the roles, interconnections, and video-specific adaptations of these techniques, addressing unique challenges such as temporal localization, spatiotemporal grounding, long video efficiency, and multimodal evidence integration. Through systematic analysis of representative methods, we synthesize key design principles, insights, and evaluation protocols while identifying critical open challenges in reward design, scalability, and cost-performance optimization. We further curate essential benchmarks, datasets, and metrics to facilitate rigorous assessment of post-training effectiveness. This survey aims to provide researchers and practitioners with a unified framework for advancing Video-LMM capabilities. Additional resources and updates are maintained at: https://github.com/yunlong10/Awesome-Video-LMM-Post-Training

[301] Human Action Recognition from Point Clouds over Time

James Dickens

Main category: cs.CV

TL;DR: A novel 3D action recognition pipeline using point clouds from depth sensors and monocular depth estimation, combining point-based methods with sparse convolutional networks to achieve competitive performance on NTU RGB-D 120 dataset.

Details

Motivation: To leverage dense 3D data from consumer-grade depth sensors and Lidar for action recognition as an alternative to skeletal and video-based methods.

Method: Pipeline segments human point clouds from background, tracks individuals over time, performs body part segmentation, and uses point-based techniques with sparse convolutional networks on voxel-mapped point cloud sequences with auxiliary features.

Result: Achieves 89.3% accuracy on NTU RGB-D 120 dataset with ensemble setup, outperforming previous point cloud action recognition methods and being competitive with skeletal approaches.

Conclusion: The proposed method successfully demonstrates the viability of dense 3D point cloud data for action recognition, offering a third approach alongside skeletal and video-based methods.

Abstract: Recent research into human action recognition (HAR) has focused predominantly on skeletal action recognition and video-based methods. With the increasing availability of consumer-grade depth sensors and Lidar instruments, there is a growing opportunity to leverage dense 3D data for action recognition, to develop a third way. This paper presents a novel approach for recognizing actions from 3D videos by introducing a pipeline that segments human point clouds from the background of a scene, tracks individuals over time, and performs body part segmentation. The method supports point clouds from both depth sensors and monocular depth estimation. At the core of the proposed HAR framework is a novel backbone for 3D action recognition, which combines point-based techniques with sparse convolutional networks applied to voxel-mapped point cloud sequences. Experiments incorporate auxiliary point features including surface normals, color, infrared intensity, and body part parsing labels, to enhance recognition accuracy. Evaluation on the NTU RGB- D 120 dataset demonstrates that the method is competitive with existing skeletal action recognition algorithms. Moreover, combining both sensor-based and estimated depth inputs in an ensemble setup, this approach achieves 89.3% accuracy when different human subjects are considered for training and testing, outperforming previous point cloud action recognition methods.

[302] TFM Dataset: A Novel Multi-task Dataset and Integrated Pipeline for Automated Tear Film Break-Up Segmentation

Guangrong Wan, Jun liu, Qiyang Zhou, Tang tang, Lianghao Shi, Wenjun Luo, TingTing Xu

Main category: cs.CV

TL;DR: This paper introduces the Tear Film Multi-task (TFM) Dataset and proposes TF-Net for automated tear film break-up segmentation, along with TF-Collab pipeline for integrated real-time analysis of dry eye syndrome.

Details

Motivation: Automated tear film break-up (TFBU) analysis is crucial for diagnosing dry eye syndrome, but current approaches face challenges due to lack of annotated datasets and integrated solutions.

Method: Created TFM Dataset with 15 high-resolution videos and 6,247 frames annotated for three vision tasks. Proposed TF-Net with MobileOne-mini backbone and enhanced feature pyramid network for efficient segmentation. Developed TF-Collab pipeline that sequentially orchestrates frame classification, pupil localization, and TFBU segmentation.

Result: Established benchmark performance on TFM segmentation subset. TF-Net achieves favorable balance between accuracy and computational efficiency for real-time clinical applications. TF-Collab fully automates the tear film analysis process.

Conclusion: The proposed TF-Net and TF-Collab provide effective automated solutions for ocular surface diagnostics, with the TFM dataset serving as a foundation for future research in this field.

Abstract: Tear film break-up (TFBU) analysis is critical for diagnosing dry eye syndrome, but automated TFBU segmentation remains challenging due to the lack of annotated datasets and integrated solutions. This paper introduces the Tear Film Multi-task (TFM) Dataset, the first comprehensive dataset for multi-task tear film analysis, comprising 15 high-resolution videos (totaling 6,247 frames) annotated with three vision tasks: frame-level classification (‘clear’, ‘closed’, ‘broken’, ‘blur’), Placido Ring detection, and pixel-wise TFBU area segmentation. Leveraging this dataset, we first propose TF-Net, a novel and efficient baseline segmentation model. TF-Net incorporates a MobileOne-mini backbone with re-parameterization techniques and an enhanced feature pyramid network to achieve a favorable balance between accuracy and computational efficiency for real-time clinical applications. We further establish benchmark performance on the TFM segmentation subset by comparing TF-Net against several state-of-the-art medical image segmentation models. Furthermore, we design TF-Collab, a novel integrated real-time pipeline that synergistically leverages models trained on all three tasks of the TFM dataset. By sequentially orchestrating frame classification for BUT determination, pupil region localization for input standardization, and TFBU segmentation, TF-Collab fully automates the analysis. Experimental results demonstrate the effectiveness of the proposed TF-Net and TF-Collab, providing a foundation for future research in ocular surface diagnostics. Our code and the TFM datasets are available at https://github.com/glory-wan/TF-Net

[303] OneVision: An End-to-End Generative Framework for Multi-view E-commerce Vision Search

Zexin Zheng, Huangyu Dai, Lingtao Mao, Xinyu Sun, Zihan Liang, Ben Chen, Yuqing Ding, Chenyi Lei, Wenwu Ou, Han Li, Kun Gai

Main category: cs.CV

TL;DR: OneVision is an end-to-end generative framework that replaces traditional multi-stage cascading architecture for vision search, using vision-aligned residual quantization to align multi-view representations and improve both efficiency and conversion rates.

Details

Motivation: Traditional multi-stage cascading architecture suffers from representation discrepancy between query and product images across different stages, making it difficult to achieve optimal user experience and conversion rates simultaneously.

Method: Proposes VRQ (vision-aligned residual quantization) encoding to align different representations of objects across multiple viewpoints while preserving product distinctiveness, and uses multi-stage semantic alignment to maintain visual similarity while incorporating user preferences.

Result: Offline: performs on par with online MCA while improving inference efficiency by 21% through dynamic pruning. Online A/B tests: +2.15% item CTR, +2.27% CVR, and +3.12% order volume.

Conclusion: A semantic ID centric, generative architecture can successfully unify retrieval and personalization while simplifying the serving pathway, achieving significant improvements in both efficiency and conversion metrics.

Abstract: Traditional vision search, similar to search and recommendation systems, follows the multi-stage cascading architecture (MCA) paradigm to balance efficiency and conversion. Specifically, the query image undergoes feature extraction, recall, pre-ranking, and ranking stages, ultimately presenting the user with semantically similar products that meet their preferences. This multi-view representation discrepancy of the same object in the query and the optimization objective collide across these stages, making it difficult to achieve Pareto optimality in both user experience and conversion. In this paper, an end-to-end generative framework, OneVision, is proposed to address these problems. OneVision builds on VRQ, a vision-aligned residual quantization encoding, which can align the vastly different representations of an object across multiple viewpoints while preserving the distinctive features of each product as much as possible. Then a multi-stage semantic alignment scheme is adopted to maintain strong visual similarity priors while effectively incorporating user-specific information for personalized preference generation. In offline evaluations, OneVision performs on par with online MCA, while improving inference efficiency by 21% through dynamic pruning. In A/B tests, it achieves significant online improvements: +2.15% item CTR, +2.27% CVR, and +3.12% order volume. These results demonstrate that a semantic ID centric, generative architecture can unify retrieval and personalization while simplifying the serving pathway.

[304] acia-workflows: Automated Single-cell Imaging Analysis for Scalable and Deep Learning-based Live-cell Imaging Analysis Workflows

Johannes Seiffarth, Keitaro Kasahara, Michelle Bund, Benita Lückel, Richard D. Paul, Matthias Pesch, Lennart Witting, Michael Bott, Dietrich Kohlheyer, Katharina Nöh

Main category: cs.CV

TL;DR: The paper presents acia-workflows, a platform for automated analysis of live-cell imaging data using deep learning segmentation and tracking methods, packaged in reproducible Jupyter Notebook workflows.

Details

Motivation: High-throughput live-cell imaging generates massive data volumes that obscure biological insights, requiring automated analysis tools that are accessible and user-friendly for routine biological research.

Method: Developed a platform with three components: (1) acia Python library with 8 deep learning segmentation/tracking approaches, (2) reproducible Jupyter Notebook workflows combining analysis pipelines with dependencies and visualizations, and (3) application workflows for real-world use cases.

Result: Created over ten open-source application workflows for microfluidic live-cell imaging experiments, enabling analyses ranging from growth rate comparisons to minute-resolution quantitative analysis of individual cell responses to changing oxygen conditions.

Conclusion: The acia-workflows platform successfully integrates powerful deep learning tools into accessible, flexible workflows that support routine biological research applications, making automated live-cell imaging analysis more practical and reproducible.

Abstract: Live-cell imaging (LCI) technology enables the detailed spatio-temporal characterization of living cells at the single-cell level, which is critical for advancing research in the life sciences, from biomedical applications to bioprocessing. High-throughput setups with tens to hundreds of parallel cell cultivations offer the potential for robust and reproducible insights. However, these insights are obscured by the large amount of LCI data recorded per experiment. Recent advances in state-of-the-art deep learning methods for cell segmentation and tracking now enable the automated analysis of such large data volumes, offering unprecedented opportunities to systematically study single-cell dynamics. The next key challenge lies in integrating these powerful tools into accessible, flexible, and user-friendly workflows that support routine application in biological research. In this work, we present acia-workflows, a platform that combines three key components: (1) the Automated live-Cell Imaging Analysis (acia) Python library, which supports the modular design of image analysis pipelines offering eight deep learning segmentation and tracking approaches; (2) workflows that assemble the image analysis pipeline, its software dependencies, documentation, and visualizations into a single Jupyter Notebook, leading to accessible, reproducible and scalable analysis workflows; and (3) a collection of application workflows showcasing the analysis and customization capabilities in real-world applications. Specifically, we present three workflows to investigate various types of microfluidic LCI experiments ranging from growth rate comparisons to precise, minute-resolution quantitative analyses of individual dynamic cells responses to changing oxygen conditions. Our collection of more than ten application workflows is open source and publicly available at https://github.com/JuBiotech/acia-workflows.

[305] Efficient Universal Models for Medical Image Segmentation via Weakly Supervised In-Context Learning

Jiesi Hu, Yanwu Yang, Zhiyu Ye, Jinyan Zhou, Jianfeng Cao, Hanyang Peng, Ting Ma

Main category: cs.CV

TL;DR: WS-ICL is a weakly supervised in-context learning approach that uses weak prompts like bounding boxes or points instead of dense labels, significantly reducing annotation effort while maintaining comparable performance to regular ICL models.

Details

Motivation: Universal medical image segmentation models require extensive annotations - interactive models need repeated user prompts and ICL relies on dense pixel-level labels, which is time-consuming and costly.

Method: Proposed Weakly Supervised In-Context Learning (WS-ICL) that leverages weak prompts (bounding boxes or points) instead of dense labels for context, eliminating the need for fine-grained masks and repeated user prompting.

Result: WS-ICL achieves performance comparable to regular ICL models at significantly lower annotation cost, and is highly competitive under interactive paradigm on three held-out benchmarks.

Conclusion: WS-ICL establishes a promising step toward more efficient and unified universal models for medical image segmentation, with code and model publicly available.

Abstract: Universal models for medical image segmentation, such as interactive and in-context learning (ICL) models, offer strong generalization but require extensive annotations. Interactive models need repeated user prompts for each image, while ICL relies on dense, pixel-level labels. To address this, we propose Weakly Supervised In-Context Learning (WS-ICL), a new ICL paradigm that leverages weak prompts (e.g., bounding boxes or points) instead of dense labels for context. This approach significantly reduces annotation effort by eliminating the need for fine-grained masks and repeated user prompting for all images. We evaluated the proposed WS-ICL model on three held-out benchmarks. Experimental results demonstrate that WS-ICL achieves performance comparable to regular ICL models at a significantly lower annotation cost. In addition, WS-ICL is highly competitive even under the interactive paradigm. These findings establish WS-ICL as a promising step toward more efficient and unified universal models for medical image segmentation. Our code and model are publicly available at https://github.com/jiesihu/Weak-ICL.

cs.AI

[306] AlphaApollo: Orchestrating Foundation Models and Professional Tools into a Self-Evolving System for Deep Agentic Reasoning

Zhanke Zhou, Chentao Cao, Xiao Feng, Xuan Li, Zongze Li, Xiangyu Lu, Jiangchao Yao, Weikai Huang, Linrui Xu, Tian Cheng, Guanyu Jiang, Yiming Zheng, Brando Miranda, Tongliang Liu, Sanmi Koyejo, Masashi Sugiyama, Bo Han

Main category: cs.AI

TL;DR: AlphaApollo is a self-evolving agentic reasoning system that uses multiple models with computation and retrieval tools to overcome foundation model limitations, achieving significant performance gains in AIME evaluations.

Details

Motivation: To address two key bottlenecks in foundation model reasoning: limited model-intrinsic capacity and unreliable test-time iteration.

Method: Orchestrates multiple models with professional tools (Python computation and retrieval tools) using a shared state map for multi-round, multi-model solution evolution with iterative refinement.

Result: Consistent performance gains: +5.15% Average@32 and +23.34% Pass@32 for Qwen2.5-14B-Instruct, +8.91% Average@32 and +26.67% Pass@32 for Llama-3.3-70B-Instruct. Over 80% of tool calls successfully executed.

Conclusion: AlphaApollo successfully lifts the capability ceiling of foundation models through tool-augmented reasoning and multi-model collaboration.

Abstract: We present AlphaApollo, a self-evolving agentic reasoning system that aims to address two bottlenecks in foundation model (FM) reasoning-limited model-intrinsic capacity and unreliable test-time iteration. AlphaApollo orchestrates multiple models with professional tools to enable deliberate, verifiable reasoning. It couples (i) a computation tool (Python with numerical and symbolic libraries) and (ii) a retrieval tool (task-relevant external information) to execute exact calculations and ground decisions. The system further supports multi-round, multi-model solution evolution via a shared state map that records candidates, executable checks, and feedback for iterative refinement. In evaluations on AIME 2024/2025 across multiple models, AlphaApollo delivers consistent gains: +5.15% Average@32 and +23.34% Pass@32 for Qwen2.5-14B-Instruct, and +8.91% Average@32 with +26.67% Pass@32 for Llama-3.3-70B-Instruct. Tool-use analysis shows that more than 80% of tool calls are successfully executed, with consistent outperformance of non-tool baselines, thereby lifting the capability ceiling of FMs. More empirical results and implementation details will be updated at https://github.com/tmlr-group/AlphaApollo.

[307] Bridging Reasoning to Learning: Unmasking Illusions using Complexity Out of Distribution Generalization

Mohammad Mahdi Samiei Paqaleh, Arash Marioriyad, Arman Tahmasebi-Zadeh, Mohamadreza Fereydooni, Mahdi Ghaznavai, Mahdieh Soleymani Baghshah

Main category: cs.AI

TL;DR: The paper proposes Complexity Out of Distribution (Complexity OoD) generalization as a framework to define and measure reasoning ability in AI systems, distinguishing it from pattern recognition and formalizing reasoning through solution complexity metrics.

Details

Motivation: There is no clear, consistent definition or metric for reasoning ability in AI, unlike learning where generalization concepts are well-established. Current AI systems lack formal ways to evaluate reasoning capabilities beyond pattern recognition.

Method: The authors formalize complexity via solution description Kolmogorov complexity and operational proxies (object/relation counts, reasoning step counts). They propose Complexity OoD generalization where models maintain performance on test instances requiring more complex solutions than training examples.

Result: The framework unifies learning and reasoning, showing how System1-like processing becomes System2-like under complexity pressure. It provides recommendations for operationalizing Complexity OoD across benchmarks, supervision, inductive biases, and addressing learning-to-reason challenges.

Conclusion: Progress toward robust reasoning requires architectures and training regimes that explicitly model computation with respect to complexity, as Complexity OoD cannot be solved by scaling data alone.

Abstract: Recent progress has pushed AI frontiers from pattern recognition tasks toward problems that require step by step, System2 style reasoning, especially with large language models. Yet, unlike learning, where generalization and out of distribution (OoD) evaluation concepts are well formalized, there is no clear, consistent definition or metric for reasoning ability. We propose Complexity Out of Distribution (Complexity OoD) generalization as a framework and problem setting to define and measure reasoning. A model exhibits Complexity OoD generalization when it maintains performance on test instances whose minimal required solution complexity, either representational (richer solution structure) or computational (more reasoning steps/program length), exceeds that of all training examples. We formalize complexity via solution description Kolmogorov complexity and operational proxies (e.g., object/relation counts; reasoning step counts), clarifying how Complexity OoD differs from length and compositional OoD. This lens unifies learning and reasoning: many cases solvable with System1 like processing at low complexity become System2 like under complexity pressure, while System2 can be viewed as generalization over solution structures. We translate this perspective into practice with recommendations for operationalizing Complexity OoD across the stack: incorporating complexity into benchmark and evaluation metric design, rethinking supervision to target solution traces, seeking and designing inductive biases for Complexity OoD generalization, addressing learning to reason spillovers such as spurious shortcuts, semantic robustness, catastrophic forgetting, and step wise calibration. Because Complexity OoD cannot be solved by scaling data alone, progress toward robust reasoning will require architectures and training regimes that explicitly model and allocate computation with respect to complexity.

[308] BuilderBench – A benchmark for generalist agents

Raj Ghugare, Catherine Ji, Kathryn Wantlin, Jin Schofield, Benjamin Eysenbach

Main category: cs.AI

TL;DR: BuilderBench is a benchmark for agent pre-training that focuses on open-ended exploration in a block-building environment, requiring embodied reasoning through physical interaction without external supervision.

Details

Motivation: Current AI models struggle with novel problems beyond existing data limits. The goal is to develop agents that can learn through experience and exploration rather than just mimicry.

Method: Created BuilderBench with a hardware-accelerated simulator for robotic agents interacting with physical blocks, and a task suite of 42 diverse target structures testing physics understanding, mathematics, and long-horizon planning.

Result: The benchmark challenges current algorithms, showing they struggle with these tasks that require embodied reasoning through actions and experimentation rather than just language-based reasoning.

Conclusion: BuilderBench provides a scalable testbed for developing agents that learn through interaction, with implementations of six algorithms as reference points for future research.

Abstract: Today’s AI models learn primarily through mimicry and sharpening, so it is not surprising that they struggle to solve problems beyond the limits set by existing data. To solve novel problems, agents should acquire skills for exploring and learning through experience. Finding a scalable learning mechanism for developing agents that learn through interaction remains a major open problem. In this work, we introduce BuilderBench, a benchmark to accelerate research into agent pre-training that centers open-ended exploration. BuilderBench requires agents to learn how to build any structure using blocks. BuilderBench is equipped with $(1)$ a hardware accelerated simulator of a robotic agent interacting with various physical blocks, and $(2)$ a task-suite with over 42 diverse target structures that are carefully curated to test an understanding of physics, mathematics, and long-horizon planning. During training, agents have to explore and learn general principles about the environment without any external supervision. During evaluation, agents have to build the unseen target structures from the task suite. Solving these tasks requires a sort of \emph{embodied reasoning} that is not reflected in words but rather in actions, experimenting with different strategies and piecing them together. Our experiments show that many of these tasks challenge the current iteration of algorithms. Hence, we also provide a ``training wheels’’ protocol, in which agents are trained and evaluated to build a single target structure from the task suite. Finally, we provide single-file implementations of six different algorithms as a reference point for researchers.

[309] Requirements for Game-Based Learning Design Framework for Information System Integration in the Context of Post-Merger Integration

Ksenija Lace, Marite Kirikova

Main category: cs.AI

TL;DR: Game-based learning framework proposed to address high learning curve and low motivation in post-merger information system integration training, transforming static methods into engaging experiences.

Details

Motivation: Existing methods AMILI and AMILP for post-merger IS integration have high learning curve and low learner motivation, creating need for more engaging training approaches.

Method: Analyzed learning theories, cognitive load models, motivation models, and serious game design frameworks to identify requirements for game-based learning framework with two components: transformation process and learning experience.

Result: Identified essential requirements for game-based learning design framework tailored to information system integration in post-merger context.

Conclusion: Plan to develop and evaluate the proposed framework through iterative design and real-world validation.

Abstract: Post-merger integration states unique challenges for professionals responsible for information system integration aimed on alignment and combination diverse system architectures of merging organizations. Although the theoretical and practical guidance exists for post-merger integration on the business level, there is a significant gap in training for information system integration in this context. In prior research specific methods AMILI (Support method for informed decision identification) and AMILP (Support method for informed decision-making) were introduced for the support of information system integration decisions in the post-merger integration. But during the practical application was reported high learning curve and low learner motivation. This paper explores how game-based learning design can address these limitations by transforming static method training into engaging learning experience. The study analyzes foundational learning theories, cognitive load and motivation models, and serious game design frameworks to identify the essential requirements for a game-based learning design framework tailored to information system integration in post-merger integration. Requirements are structured in two components: the transformation process and resulting learning experience. The paper concludes with a plan for developing and evaluating the proposed framework through iterative design and real-world validation.

[310] Belief-Calibrated Multi-Agent Consensus Seeking for Complex NLP Tasks

Wentao Deng, Jiahuan Pei, Zhiwei Xu, Zhaochun Ren, Zhumin Chen, Pengjie Ren

Main category: cs.AI

TL;DR: The paper proposes BCCS, a framework for stable consensus in multi-agent systems by selecting optimal collaborators and calibrating consensus judgment using system-internal beliefs, achieving significant performance improvements on NLP benchmarks.

Details

Motivation: Existing consensus-seeking approaches in multi-agent systems rely on voting mechanisms that overlook internal belief contradictions and use uniform collaboration, which hinders stable consensus formation.

Method: Proposed Belief-Calibrated Consensus Seeking (BCCS) framework with theoretical foundation for selecting optimal collaborators and calibrating consensus judgment using system-internal beliefs.

Result: BCCS outperforms existing best results by 2.23% on MATH and 3.95% on MMLU benchmark datasets for challenging NLP tasks.

Conclusion: The BCCS framework effectively addresses consensus instability in multi-agent systems through optimal collaborator selection and belief-calibrated consensus judgment, demonstrating superior performance on standard benchmarks.

Abstract: A multi-agent system (MAS) enhances its capacity to solve complex natural language processing (NLP) tasks through collaboration among multiple agents, where consensus-seeking serves as a fundamental mechanism. However, existing consensus-seeking approaches typically rely on voting mechanisms to judge consensus, overlooking contradictions in system-internal beliefs that destabilize the consensus. Moreover, these methods often involve agents updating their results through indiscriminate collaboration with every other agent. Such uniform interaction fails to identify the optimal collaborators for each agent, hindering the emergence of a stable consensus. To address these challenges, we provide a theoretical framework for selecting optimal collaborators that maximize consensus stability. Based on the theorems, we propose the Belief-Calibrated Consensus Seeking (BCCS) framework to facilitate stable consensus via selecting optimal collaborators and calibrating the consensus judgment by system-internal beliefs. Experimental results on the MATH and MMLU benchmark datasets demonstrate that the proposed BCCS framework outperforms the best existing results by 2.23% and 3.95% of accuracy on challenging tasks, respectively. Our code and data are available at https://github.com/dengwentao99/BCCS.

[311] Off-Trajectory Reasoning: Can LLMs Collaborate on Reasoning Trajectory?

Aochong Oliver Li, Tanya Goyal

Main category: cs.AI

TL;DR: Standard solo-reasoning training pipelines fail to produce desired off-trajectory reasoning behaviors in LLMs, with stronger models being more fragile to distractions and all models failing to effectively leverage collaborators’ reasoning.

Details

Motivation: To investigate whether standard solo-reasoning training can enable LLMs to assess and build upon other models' partial reasoning (off-trajectory reasoning) for effective multi-model collaboration.

Method: Proposed twin tests (Recoverability and Guidability) to evaluate off-trajectory reasoning, tested 15 open-weight LLMs (1.5B-32B), and conducted control studies on post-training factors (distillation teacher choice, RL use, data selection).

Result: Stronger LLMs on benchmarks are more fragile under distraction; all models fail to leverage collaborators’ reasoning on problems beyond their capabilities (solve rates <9.2%); suboptimal recoverability behaviors transfer through distillation.

Conclusion: Current reasoning LLMs have limitations in off-trajectory reasoning, highlighting the need for training methods that enable effective multi-model collaboration in shared reasoning trajectories.

Abstract: Reasoning LLMs are trained to verbalize their reasoning process, yielding strong gains on complex tasks. This transparency also opens a promising direction: multiple reasoners can directly collaborate on each other’s thinking within a shared trajectory, yielding better inference efficiency and exploration. A key prerequisite, however, is the ability to assess the usefulness and build on another model’s partial thinking – we call this off-trajectory reasoning. Our paper investigates a critical question: can standard solo-reasoning training pipelines deliver desired off-trajectory behaviors? We propose twin tests that capture the two extremes of the off-trajectory spectrum, namely Recoverability, which tests whether LLMs can backtrack from “distractions” induced by misleading reasoning traces, and Guidability, which tests their ability to build upon correct reasoning from stronger collaborators. Our study evaluates 15 open-weight LLMs (1.5B-32B) and reveals a counterintuitive finding – “stronger” LLMs on benchmarks are often more fragile under distraction. Moreover, all models tested fail to effectively leverage guiding steps from collaborators on problems beyond their inherent capabilities with solve rates remaining under 9.2%. Finally, we conduct control studies to isolate the effects of three factors in post-training on these behaviors: the choice of distillation teacher, the use of RL, and data selection strategy. Our results provide actionable insights for training natively strong reasoning collaborators; e.g., we find that suboptimal recoverability behaviors of teacher models are transferred to distilled students even if the distillation trajectories are correct. Taken together, this work lays the groundwork for evaluating multi-model collaborations in shared reasoning trajectories and highlights the limitations of off-the-shelf reasoning LLMs.

[312] Multi-Objective Multi-Agent Path Finding with Lexicographic Cost Preferences

Pulkit Rustagi, Kyle Hollins Wray, Sandhya Saisubramanian

Main category: cs.AI

TL;DR: Proposes Lexicographic Conflict-Based Search (LCBS) for multi-objective multi-agent path finding that directly computes solutions aligned with lexicographic preferences, avoiding Pareto frontier construction and scaling to 10 objectives.

Details

Motivation: Current MO-MAPF algorithms don't optimize for user-defined preferences even when available, and scale poorly with increasing objectives. They produce conflict-free plans via Pareto frontiers without preference optimization.

Method: LCBS integrates priority-aware low-level A* search with conflict-based search, using lexicographic preference over objectives to guide planning without constructing Pareto frontiers.

Result: LCBS computes optimal solutions and scales to instances with up to ten objectives, far beyond existing methods. Shows consistently higher success rates on standard and randomized benchmarks, especially with more objectives.

Conclusion: The lexicographic framework and LCBS algorithm enable efficient preference-aware multi-objective planning that scales significantly better than existing approaches while maintaining optimality.

Abstract: Many real-world scenarios require multiple agents to coordinate in shared environments, while balancing trade-offs between multiple, potentially competing objectives. Current multi-objective multi-agent path finding (MO-MAPF) algorithms typically produce conflict-free plans by computing Pareto frontiers. They do not explicitly optimize for user-defined preferences, even when the preferences are available, and scale poorly with the number of objectives. We propose a lexicographic framework for modeling MO-MAPF, along with an algorithm \textit{Lexicographic Conflict-Based Search} (LCBS) that directly computes a single solution aligned with a lexicographic preference over objectives. LCBS integrates a priority-aware low-level $A^*$ search with conflict-based search, avoiding Pareto frontier construction and enabling efficient planning guided by preference over objectives. We provide insights into optimality and scalability, and empirically demonstrate that LCBS computes optimal solutions while scaling to instances with up to ten objectives – far beyond the limits of existing MO-MAPF methods. Evaluations on standard and randomized MAPF benchmarks show consistently higher success rates against state-of-the-art baselines, especially with increasing number of objectives.

[313] Flavonoid Fusion: Creating a Knowledge Graph to Unveil the Interplay Between Food and Health

Aryan Singh Dalal, Yinglun Zhang, Duru Doğan, Atalay Mert İleri, Hande Küçük McGinty

Main category: cs.AI

TL;DR: This paper creates a knowledge graph to link food and health relationships, focusing on flavonoid contents from USDA databases and cancer connections from literature, using KNARM methodology.

Details

Motivation: There's little research on representing food-health relationships in standardized, machine-readable formats using semantic web technologies, despite growing interest in 'food as medicine' concepts.

Method: Used KNARM methodology to create a knowledge graph that combines information from USDA databases (flavonoid contents) and literature (cancer connections), representing relationships in machine-operable format.

Result: Developed a knowledge graph that serves as an example for researchers to explore the complex interplay between dietary choices and disease management.

Conclusion: The knowledge graph provides a foundation for future work to expand scope, capture nuances, add more data, and perform inferences to uncover hidden relationships between food and health.

Abstract: The focus on “food as medicine” is gaining traction in the field of health and several studies conducted in the past few years discussed this aspect of food in the literature. However, very little research has been done on representing the relationship between food and health in a standardized, machine-readable format using a semantic web that can help us leverage this knowledge effectively. To address this gap, this study aims to create a knowledge graph to link food and health through the knowledge graph’s ability to combine information from various platforms focusing on flavonoid contents of food found in the USDA databases and cancer connections found in the literature. We looked closely at these relationships using KNARM methodology and represented them in machine-operable format. The proposed knowledge graph serves as an example for researchers, enabling them to explore the complex interplay between dietary choices and disease management. Future work for this study involves expanding the scope of the knowledge graph by capturing nuances, adding more related data, and performing inferences on the acquired knowledge to uncover hidden relationships.

[314] PuzzlePlex: Benchmarking Foundation Models on Reasoning and Planning with Puzzles

Yitao Long, Yuru Jiang, Hongjun Liu, Yilun Zhao, Jingchen Sun, Yiqiu Shen, Chen Zhao, Arman Cohan, Dennis Shasha

Main category: cs.AI

TL;DR: PuzzlePlex is a benchmark with 15 diverse puzzle types to evaluate foundation models’ reasoning and planning capabilities, showing reasoning models excel in instruction-based settings while code-based execution is more challenging but scalable.

Details

Motivation: To assess the reasoning and planning capabilities of foundation models in complex, dynamic environments and understand their scalability limits.

Method: Developed PuzzlePlex benchmark with 15 puzzle types including deterministic/stochastic games and single/two-player scenarios, implemented custom strategies, used fine-grained metrics, and tested models in instruction-based and code-based settings.

Result: Reasoning models outperform others in instruction-based settings, while code-based execution presents greater challenges but offers a scalable and efficient alternative.

Conclusion: PuzzlePlex enables targeted evaluation and guides future improvements in reasoning, planning, and generalization for foundation models.

Abstract: This work investigates the reasoning and planning capabilities of foundation models and their scalability in complex, dynamic environments. We introduce PuzzlePlex, a benchmark designed to assess these capabilities through a diverse set of puzzles. PuzzlePlex consists of 15 types of puzzles, including deterministic and stochastic games of varying difficulty, as well as single-player and two-player scenarios. The PuzzlePlex framework provides a comprehensive environment for each game, and supports extensibility to generate more challenging instances as foundation models evolve. Additionally, we implement customized game-playing strategies for comparison. Building on this benchmark, we develop fine-grained metrics to measure performance and conduct an in-depth analysis of frontier foundation models across two settings: instruction-based and code-based. Furthermore, we systematically investigate their scaling limits. Our findings show that reasoning models outperform others in instruction-based settings, while code-based execution presents greater challenges but offers a scalable and efficient alternative. PuzzlePlex enables targeted evaluation and guides future improvements in reasoning, planning, and generalization for foundation models.

[315] Beneficial Reasoning Behaviors in Agentic Search and Effective Post-training to Obtain Them

Jiahe Jin, Abhijay Paladugu, Chenyan Xiong

Main category: cs.AI

TL;DR: The paper proposes Behavior Priming, a technique that trains agentic search models by synthesizing trajectories with four beneficial reasoning behaviors (Information Verification, Authority Evaluation, Adaptive Search, Error Recovery) through SFT followed by RL, achieving over 35% performance gains.

Details

Motivation: Agentic search with LLMs faces challenges in reasoning and agentic capabilities when interacting with retrieval systems and the web, requiring effective reasoning behavior patterns.

Method: Proposed a reasoning-driven LLM pipeline to analyze successful agentic search trajectories, identified four beneficial reasoning behaviors, and developed Behavior Priming technique that synthesizes trajectories with these behaviors for SFT followed by RL training.

Result: Experiments on GAIA, WebWalker, and HLE benchmarks show over 35% gains in Llama3.2-3B and Qwen3-1.7B compared to direct RL training. The desired reasoning behaviors in SFT data, not answer correctness, is critical for strong RL performance.

Conclusion: Behavior Priming enables more effective exploration and test-time scaling capabilities, providing a strong foundation for RL. The reasoning behaviors in training data are more important than answer correctness for achieving high performance.

Abstract: Agentic search leverages large language models (LLMs) to interpret complex user information needs and execute a multi-step process of planning, searching, and synthesizing information to provide answers. This paradigm introduces unique challenges for LLMs’ reasoning and agentic capabilities when interacting with retrieval systems and the broader web. In this paper, we propose a reasoning-driven LLM-based pipeline to study effective reasoning behavior patterns in agentic search. Using this pipeline, we analyze successful agentic search trajectories and identify four beneficial reasoning behaviors: Information Verification, Authority Evaluation, Adaptive Search, and Error Recovery. Based on these findings, we propose a technique called Behavior Priming to train more effective agentic search models. It synthesizes agentic search trajectories that exhibit these four behaviors and integrates them into the agentic search model through supervised fine-tuning (SFT), followed by standard reinforcement learning (RL). Experiments on three benchmarks (GAIA, WebWalker, and HLE) demonstrate that behavior priming yields over 35% gains in Llama3.2-3B and Qwen3-1.7B compared to directly training agentic search models with RL. Crucially, we demonstrate that the desired reasoning behaviors in the SFT data, rather than the correctness of the final answer, is the critical factor for achieving strong final performance after RL: fine-tuning on trajectories with desirable reasoning behaviors but incorrect answers leads to better performance than fine-tuning on trajectories with correct answers. Our analysis further reveals the underlying mechanism: the introduced reasoning behaviors endow models with more effective exploration (higher pass@k and entropy) and test-time scaling (longer trajectories) capabilities, providing a strong foundation for RL. Our code will be released as open source.

[316] Auto-Prompt Ensemble for LLM Judge

Jiajie Li, Huayi Zhang, Peng Lin, Jinjun Xiong, Wei Xu

Main category: cs.AI

TL;DR: APE framework improves LLM judge reliability by automatically learning evaluation dimensions from failure cases and using confidence-based ensemble to selectively augment judgments.

Details

Motivation: Existing LLM judges often miss crucial evaluation dimensions because they fail to recognize implicit standards in human assessments, creating an evaluation gap.

Method: Auto-Prompt Ensemble (APE) automatically learns evaluation dimensions from failure cases and uses Collective Confidence estimation for confidence-based ensemble decisions.

Result: APE improves GPT-4o agreement rate on Reward Bench from 87.2% to 90.5% in zero-shot setting, enhancing reliability across diverse benchmarks.

Conclusion: APE provides a principled approach for LLM judges to leverage test-time computation and bridge the evaluation gap between human and LLM judges.

Abstract: We present a novel framework that improves the reliability of LLM judges by selectively augmenting LLM with auxiliary evaluation dimensions. Existing LLM judges often miss crucial evaluation dimensions because they fail to recognize the implicit standards underlying human assessments. To address this challenge, we propose the Auto-Prompt Ensemble (APE), an adaptive framework that automatically learns evaluation dimensions from its failure cases. APE incorporates a confidence-based ensemble mechanism to decide when to adopt the judgments from additional evaluation dimensions through a novel confidence estimation approach called Collective Confidence. Extensive experiments demonstrate that APE improves the reliability of LLM Judge across diverse standard benchmarks. For instance, APE enhances GPT-4o agreement rate on Reward Bench from 87.2% to 90.5% in the zero-shot setting. Overall, APE provides a principled approach for LLM Judge to leverage test-time computation, and bridge the evaluation gap between human and LLM judges.

[317] WebDART: Dynamic Decomposition and Re-planning for Complex Web Tasks

Jingbo Yang, Bairu Hou, Wei Wei, Shiyu Chang, Yujia Bao

Main category: cs.AI

TL;DR: WebDART is a framework that enables single LLMs to handle complex web tasks by dynamically decomposing objectives into navigation, information extraction, and execution subtasks, with continuous replanning as new webpages are revealed.

Details

Motivation: Current LLM agents struggle with complex web tasks requiring long horizon navigation, large scale information extraction, and reasoning under constraints, while being competent only at straightforward tasks.

Method: WebDART dynamically decomposes objectives into three focused subtasks (navigation, information extraction, execution) and continuously replans the decomposition as new webpages are revealed to take advantage of discovered filters/shortcuts and avoid redundant exploration.

Result: On WebChoreArena, WebDART improves success rates by up to 13.7 percentage points over previous SOTA agents, matches performance on easier WebArena suite, and completes tasks with up to 14.7 fewer navigation steps.

Conclusion: The WebDART framework effectively enables single LLMs to handle complex web tasks through dynamic task decomposition and continuous replanning, significantly improving performance on challenging benchmarks.

Abstract: Large language model (LLM) agents are becoming competent at straightforward web tasks, such as opening an item page or submitting a form, but still struggle with objectives that require long horizon navigation, large scale information extraction, and reasoning under constraints. We present WebDART, a general framework that enables a single LLM to handle such complex chores. WebDART (i) dynamically decomposes each objective into three focused subtasks: navigation, information extraction, and execution, so the model concentrates on one skill at a time, and (ii) continuously replans the decomposition as new webpages are revealed, taking advantage of newly discovered filters or shortcuts and avoiding redundant exploration. Evaluated on WebChoreArena, WebDART lifts success rates by up to 13.7 percentage points over previous SOTA agents, while matching their performance on the easier WebArena suite and completing tasks with up to 14.7 fewer navigation steps.

[318] Fine-Grained Emotion Recognition via In-Context Learning

Zhaochun Ren, Zhou Yang, Chenglong Ye, Haizhou Sun, Chao Chen, Xiaofei Zhu, Xiangwen Liao

Main category: cs.AI

TL;DR: EICL improves fine-grained emotion recognition by addressing emotional discrepancies in ICL through emotionally similar examples and dynamic soft-label strategy, outperforming ICL on multiple datasets.

Details

Motivation: Current ICL methods enhance reasoning but overlook decision-making in emotion recognition, and semantically similar examples often introduce emotional discrepancies that hinder accurate representations.

Method: Proposed Emotion In-Context Learning (EICL) with emotionally similar examples, dynamic soft-label strategy for better query representations, and two-stage exclusion strategy for multi-angle similarity assessment.

Result: Extensive experiments show EICL significantly outperforms ICL on multiple datasets.

Conclusion: EICL effectively addresses emotional discrepancies in ICL and improves both reasoning and decision-making processes in fine-grained emotion recognition.

Abstract: Fine-grained emotion recognition aims to identify the emotional type in queries through reasoning and decision-making processes, playing a crucial role in various systems. Recent methods use In-Context Learning (ICL), enhancing the representation of queries in the reasoning process through semantically similar examples, while further improving emotion recognition by explaining the reasoning mechanisms. However, these methods enhance the reasoning process but overlook the decision-making process. This paper investigates decision-making in fine-grained emotion recognition through prototype theory. We show that ICL relies on similarity matching between query representations and emotional prototypes within the model, where emotion-accurate representations are critical. However, semantically similar examples often introduce emotional discrepancies, hindering accurate representations and causing errors. To address this, we propose Emotion In-Context Learning (EICL), which introduces emotionally similar examples and uses a dynamic soft-label strategy to improve query representations in the emotion reasoning process. A two-stage exclusion strategy is then employed to assess similarity from multiple angles, further optimizing the decision-making process. Extensive experiments show that EICL significantly outperforms ICL on multiple datasets.

[319] KnowRL: Exploring Knowledgeable Reinforcement Learning for Factuality

Baochang Ren, Shuofei Qiao, Da Zheng, Huajun Chen, Ningyu Zhang

Main category: cs.AI

TL;DR: KnowRL integrates knowledge verification into RL training to reduce hallucinations in slow-thinking LLMs by providing factuality rewards during reasoning.

Details

Motivation: Address severe hallucination in slow-thinking LLMs where traditional RL lacks factual supervision over thinking processes, exacerbating incorrect content generation.

Method: Propose Knowledge-enhanced RL (KnowRL) that incorporates factuality rewards based on knowledge verification into RL training to guide fact-based slow thinking and help models recognize knowledge boundaries.

Result: Experimental results on three hallucination evaluation datasets and two reasoning datasets show KnowRL effectively reduces hallucinations while maintaining strong reasoning capabilities.

Conclusion: KnowRL successfully mitigates hallucinations in slow-thinking models through fact-based RL training, enabling more reliable reasoning processes without compromising original reasoning abilities.

Abstract: Large Language Models (LLMs), particularly slow-thinking models, often exhibit severe hallucination, outputting incorrect content due to an inability to accurately recognize knowledge boundaries during reasoning. While Reinforcement Learning (RL) can enhance complex reasoning abilities, its outcome-oriented reward mechanism often lacks factual supervision over the thinking process, further exacerbating the hallucination problem. To address the high hallucination in slow-thinking models, we propose Knowledge-enhanced RL, KnowRL. KnowRL guides models to perform fact-based slow thinking by integrating a factuality reward, based on knowledge verification, into the RL training process, helping them recognize their knowledge boundaries. KnowRL guides models to perform fact-based slow thinking by integrating a factuality reward, based on knowledge verification, into the RL training process, helping them recognize their knowledge boundaries. This targeted factual input during RL training enables the model to learn and internalize fact-based reasoning strategies. By directly rewarding adherence to facts within the reasoning steps, KnowRL fosters a more reliable thinking process. Experimental results on three hallucination evaluation datasets and two reasoning evaluation datasets demonstrate that KnowRL effectively mitigates hallucinations in slow-thinking models while maintaining their original strong reasoning capabilities. Our code is available at https://github.com/zjunlp/KnowRL.

[320] Agent-in-the-Loop: A Data Flywheel for Continuous Improvement in LLM-based Customer Support

Cen, Zhao, Tiantian Zhang, Hanchen Su, Yufeng, Zhang, Shaowei Su, Mingzhi Xu, Yu, Liu, Wei Han, Jeremy Werner, Claire Na Cheng, Yashar Mehdad

Main category: cs.AI

TL;DR: Agent-in-the-Loop (AITL) framework enables continuous improvement of LLM-based customer support systems by integrating real-time human feedback directly into live operations, reducing retraining cycles from months to weeks.

Details

Motivation: Standard offline approaches with batch annotations are slow and inefficient for improving customer support systems, creating a need for real-time feedback integration.

Method: AITL integrates four types of live annotations: pairwise response preferences, agent adoption/rationales, knowledge relevance checks, and missing knowledge identification, which directly feed back into model updates.

Result: Production pilot showed significant improvements: +11.7% recall@75, +14.8% precision@8 in retrieval; +8.4% helpfulness in generation; +4.5% agent adoption rates.

Conclusion: Embedding human feedback loops directly into operational workflows effectively refines LLM-based customer support systems through continuous improvement cycles.

Abstract: We introduce an Agent-in-the-Loop (AITL) framework that implements a continuous data flywheel for iteratively improving an LLM-based customer support system. Unlike standard offline approaches that rely on batch annotations, AITL integrates four key types of annotations directly into live customer operations: (1) pairwise response preferences, (2) agent adoption and rationales, (3) knowledge relevance checks, and (4) identification of missing knowledge. These feedback signals seamlessly feed back into models’ updates, reducing retraining cycles from months to weeks. Our production pilot involving US-based customer support agents demonstrated significant improvements in retrieval accuracy (+11.7% recall@75, +14.8% precision@8), generation quality (+8.4% helpfulness) and agent adoption rates (+4.5%). These results underscore the effectiveness of embedding human feedback loops directly into operational workflows to continuously refine LLM-based customer support system.

[321] Inefficiencies of Meta Agents for Agent Design

Batu El, Mert Yuksekgonul, James Zou

Main category: cs.AI

TL;DR: Meta-agents for automated agent design face three key challenges: poor learning from previous iterations, low behavioral diversity in designed agents, and limited economic viability compared to human-designed agents.

Details

Motivation: To address challenges in automated design of agentic systems using meta-agents, specifically examining how meta-agents learn across iterations, behavioral diversity of designed agents, and economic viability of automated design.

Method: Investigated three key challenges: 1) Learning across iterations by comparing context expansion vs evolutionary approaches, 2) Behavioral diversity analysis of designed agents, 3) Economic cost-benefit analysis comparing automated vs human-designed agents across multiple datasets.

Result: 1) Evolutionary approach outperforms context expansion for learning across iterations; 2) Designed agents have low behavioral diversity limiting complementary use; 3) Automated design is only economically viable for two datasets when deployed on over 15,000 examples, with performance gains for other datasets not justifying design costs.

Conclusion: Current meta-agent approaches for automated agent design face significant limitations in learning efficiency, behavioral diversity, and economic viability, with automated design only being cost-effective in specific scenarios.

Abstract: Recent works began to automate the design of agentic systems using meta-agents that propose and iteratively refine new agent architectures. In this paper, we examine three key challenges in a common class of meta-agents. First, we investigate how a meta-agent learns across iterations and find that simply expanding the context with all previous agents, as proposed by previous works, performs worse than ignoring prior designs entirely. We show that the performance improves with an evolutionary approach. Second, although the meta-agent designs multiple agents during training, it typically commits to a single agent at test time. We find that the designed agents have low behavioral diversity, limiting the potential for their complementary use. Third, we assess when automated design is economically viable. We find that only in a few cases–specifically, two datasets–the overall cost of designing and deploying the agents is lower than that of human-designed agents when deployed on over 15,000 examples. In contrast, the performance gains for other datasets do not justify the design cost, regardless of scale.

[322] MultiCNKG: Integrating Cognitive Neuroscience, Gene, and Disease Knowledge Graphs Using Large Language Models

Ali Sarabadani, Kheirolah Rahsepar Fard

Main category: cs.AI

TL;DR: MultiCNKG is a novel framework that integrates cognitive neuroscience, gene ontology, and disease ontology knowledge graphs using LLMs to create a unified knowledge graph connecting genes, diseases, and cognitive processes with competitive performance metrics.

Details

Motivation: To overcome limitations in traditional machine learning methods for capturing complex semantic relationships between genes, diseases, and cognitive processes in biomedical and cognitive sciences by leveraging large language models.

Method: Integrates three knowledge sources (Cognitive Neuroscience KG, Gene Ontology, Disease Ontology) using LLMs like GPT-4 for entity alignment, semantic similarity computation, and graph augmentation to create a unified knowledge graph.

Result: Created MultiCNKG with 6.9K nodes across 5 types and 11.3K edges across 7 types, achieving high performance metrics: precision (85.20%), recall (87.30%), coverage (92.18%), graph consistency (82.50%), novelty detection (40.28%), expert validation (89.50%), and competitive link prediction results (TransE MR: 391, MRR: 0.411; RotatE MR: 263, MRR: 0.395).

Conclusion: MultiCNKG provides a robust framework that advances applications in personalized medicine, cognitive disorder diagnostics, and hypothesis formulation in cognitive neuroscience by enabling multi-layered analysis from molecular to behavioral domains.

Abstract: The advent of large language models (LLMs) has revolutionized the integration of knowledge graphs (KGs) in biomedical and cognitive sciences, overcoming limitations in traditional machine learning methods for capturing intricate semantic links among genes, diseases, and cognitive processes. We introduce MultiCNKG, an innovative framework that merges three key knowledge sources: the Cognitive Neuroscience Knowledge Graph (CNKG) with 2.9K nodes and 4.3K edges across 9 node types and 20 edge types; Gene Ontology (GO) featuring 43K nodes and 75K edges in 3 node types and 4 edge types; and Disease Ontology (DO) comprising 11.2K nodes and 8.8K edges with 1 node type and 2 edge types. Leveraging LLMs like GPT-4, we conduct entity alignment, semantic similarity computation, and graph augmentation to create a cohesive KG that interconnects genetic mechanisms, neurological disorders, and cognitive functions. The resulting MultiCNKG encompasses 6.9K nodes across 5 types (e.g., Genes, Diseases, Cognitive Processes) and 11.3K edges spanning 7 types (e.g., Causes, Associated with, Regulates), facilitating a multi-layered view from molecular to behavioral domains. Assessments using metrics such as precision (85.20%), recall (87.30%), coverage (92.18%), graph consistency (82.50%), novelty detection (40.28%), and expert validation (89.50%) affirm its robustness and coherence. Link prediction evaluations with models like TransE (MR: 391, MRR: 0.411) and RotatE (MR: 263, MRR: 0.395) show competitive performance against benchmarks like FB15k-237 and WN18RR. This KG advances applications in personalized medicine, cognitive disorder diagnostics, and hypothesis formulation in cognitive neuroscience.

[323] Code Like Humans: A Multi-Agent Solution for Medical Coding

Andreas Motzfeldt, Joakim Edin, Casper L. Christensen, Christian Hardmeier, Lars Maaløe, Anna Rogers

Main category: cs.AI

TL;DR: Introduces Code Like Humans, a new agentic framework for medical coding using large language models that implements official coding guidelines and supports the full ICD-10 system with 70K+ labels.

Details

Motivation: To automate medical coding where experts map unstructured clinical notes to alphanumeric codes for diagnoses and procedures, addressing the challenge of supporting the full ICD-10 coding system.

Method: Agentic framework using large language models that implements official coding guidelines for human experts, designed to support the complete ICD-10 coding system.

Result: Achieves best performance to date on rare diagnosis codes, though fine-tuned discriminative classifiers still have advantage for high-frequency codes. Identifies systematic ‘blind spots’ (undercoded codes).

Conclusion: The framework successfully supports full ICD-10 coding and performs well on rare codes, while identifying areas for future improvement through analysis of systematic undercoding.

Abstract: In medical coding, experts map unstructured clinical notes to alphanumeric codes for diagnoses and procedures. We introduce Code Like Humans: a new agentic framework for medical coding with large language models. It implements official coding guidelines for human experts, and it is the first solution that can support the full ICD-10 coding system (+70K labels). It achieves the best performance to date on rare diagnosis codes (fine-tuned discriminative classifiers retain an advantage for high-frequency codes, to which they are limited). Towards future work, we also contribute an analysis of system performance and identify its `blind spots’ (codes that are systematically undercoded).

[324] Verifying Memoryless Sequential Decision-making of Large Language Models

Dennis Gross, Helge Spieker, Arnaud Gotlieb

Main category: cs.AI

TL;DR: A tool for automated verification of LLM-based policies in sequential decision-making tasks using MDPs and PCTL safety requirements.

Details

Motivation: To provide rigorous formal verification of LLM policies in safety-critical sequential decision-making scenarios.

Method: Incrementally constructs reachable MDP states guided by LLM actions, encodes states as natural language prompts, parses LLM responses into actions, and verifies with Storm model checker.

Result: Open source LLMs can be verified when deterministically seeded but underperform deep reinforcement learning baselines in grid world benchmarks.

Conclusion: The tool enables continuous benchmarking and lays foundation for formally verifying increasingly capable LLMs in user-specified sequential decision-making tasks.

Abstract: We introduce a tool for rigorous and automated verification of large language model (LLM)- based policies in memoryless sequential decision-making tasks. Given a Markov decision process (MDP) representing the sequential decision-making task, an LLM policy, and a safety requirement expressed as a PCTL formula, our approach incrementally constructs only the reachable portion of the MDP guided by the LLM’s chosen actions. Each state is encoded as a natural language prompt, the LLM’s response is parsed into an action, and reachable successor states by the policy are expanded. The resulting formal model is checked with Storm to determine whether the policy satisfies the specified safety property. In experiments on standard grid world benchmarks, we show that open source LLMs accessed via Ollama can be verified when deterministically seeded, but generally underperform deep reinforcement learning baselines. Our tool natively integrates with Ollama and supports PRISM-specified tasks, enabling continuous benchmarking in user-specified sequential decision-making tasks and laying a practical foundation for formally verifying increasingly capable LLMs.

[325] Evolving and Executing Research Plans via Double-Loop Multi-Agent Collaboration

Zhi Zhang, Yan Liu, Zhejing Hu, Gong Chen, Sheng-hua Zhong, Jiannong Cao

Main category: cs.AI

TL;DR: A Double-Loop Multi-Agent framework automates scientific research by using professor agents to evolve novel research plans and doctoral student agents to execute them dynamically.

Details

Motivation: To address the fundamental challenge of automating the entire scientific research process, which requires both generating novel high-level plans and executing them correctly under dynamic conditions.

Method: DLMA uses a leader loop with professor agents that evolve research plans through evolutionary algorithms and meetings, and a follower loop with doctoral student agents that dynamically execute plans with contextual adjustments.

Result: DLMA achieves state-of-the-art scores on benchmarks like ACLAward and Laboratory, significantly outperforming strong baselines in generating research papers.

Conclusion: The framework successfully automates scientific research by combining evolution-driven novelty from the leader loop with execution-driven soundness from the follower loop.

Abstract: Automating the end-to-end scientific research process poses a fundamental challenge: it requires both evolving high-level plans that are novel and sound, and executing these plans correctly amidst dynamic and uncertain conditions. To address this bilevel challenge, we propose a novel Double-Loop Multi-Agent (DLMA) framework to solve the given research problem automatically. The leader loop, composed of professor agents, is responsible for evolving research plans. It employs an evolutionary algorithm through involvement, improvement, and integration meetings to iteratively generate and refine a pool of research proposals, exploring the solution space effectively. The follower loop, composed of doctoral student agents, is responsible for executing the best-evolved plan. It dynamically adjusts the plan during implementation via pre-hoc and post-hoc meetings, ensuring each step (e.g., drafting, coding) is well-supported by contextual and external observations. Extensive experiments on benchmarks like ACLAward and Laboratory show that DLMA generates research papers that achieve state-of-the-art scores in automated evaluation, significantly outperforming strong baselines. Ablation studies confirm the critical roles of both loops, with evolution driving novelty and execution ensuring soundness.

[326] Autoformalizer with Tool Feedback

Qi Guo, Jianing Wang, Jianfei Zhang, Deyang Kong, Xiangzhou Huang, Xiangyu Xi, Wei Wang, Jingang Wang, Xunliang Cai, Shikun Zhang, Wei Ye

Main category: cs.AI

TL;DR: Autoformalizer with Tool Feedback (ATF) improves autoformalization by incorporating syntax checking and semantic consistency validation tools, achieving better performance than baseline models.

Details

Motivation: Existing autoformalization models struggle with generating syntactically valid and semantically consistent formal statements, limiting their practical utility in Automated Theorem Proving.

Method: ATF integrates Lean 4 compilers for syntax correction and multi-LLMs-as-judge for consistency validation, using cold-start training, expert iteration, and Direct Preference Optimization to refine generated statements based on tool feedback.

Result: ATF significantly outperforms baseline formalizer models, shows excellent inference scaling properties, and the authors release Numina-ATF dataset with 750K synthetic formal statements.

Conclusion: Incorporating tool feedback mechanisms effectively enhances both syntactic validity and semantic consistency in autoformalization, advancing Automated Theorem Proving capabilities.

Abstract: Autoformalization addresses the scarcity of data for Automated Theorem Proving (ATP) by translating mathematical problems from natural language into formal statements. Efforts in recent work shift from directly prompting large language models to training an end-to-end formalizer model from scratch, achieving remarkable advancements. However, existing formalizer still struggles to consistently generate valid statements that meet syntactic validity and semantic consistency. To address this issue, we propose the Autoformalizer with Tool Feedback (ATF), a novel approach that incorporates syntactic and consistency information as tools into the formalization process. By integrating Lean 4 compilers for syntax corrections and employing a multi-LLMs-as-judge approach for consistency validation, the model is able to adaptively refine generated statements according to the tool feedback, enhancing both syntactic validity and semantic consistency. The training of ATF involves a cold-start phase on synthetic tool-calling data, an expert iteration phase to improve formalization capabilities, and Direct Preference Optimization to alleviate ineffective revisions. Experimental results show that ATF markedly outperforms a range of baseline formalizer models, with its superior performance further validated by human evaluations. Subsequent analysis reveals that ATF demonstrates excellent inference scaling properties. Moreover, we open-source Numina-ATF, a dataset containing 750K synthetic formal statements to facilitate advancements in autoformalization and ATP research.

Daria Ozerova, Ekaterina Trofimova

Main category: cs.AI

TL;DR: TGPR combines GRPO with Thompson-Sampling-based tree search to improve iterative refinement in LLMs, achieving significant performance gains on code generation benchmarks.

Details

Motivation: Existing iterative refinement methods for LLMs rely on predefined heuristics that cannot adapt based on past outcomes and struggle with the exploration-exploitation dilemma in large search spaces.

Method: Tree-Guided Policy Refinement (TGPR) framework that combines GRPO with Thompson-Sampling-based tree search to actively explore both failed and successful refinement paths.

Result: Achieves up to +4.2 percentage points absolute improvement in pass@1 on MBPP and +12.51 percentage points absolute improvement in pass@10 on APPS compared to GRPO baseline.

Conclusion: TGPR provides a principled approach to combining learned policies with structured search methods, offering a general framework for enhancing iterative refinement and stateful reasoning in LLMs.

Abstract: Iterative refinement has been a promising paradigm to enable large language models (LLMs) to resolve difficult reasoning and problem-solving tasks. One of the key challenges, however, is how to effectively search through the enormous search space of possible refinements. Existing methods typically fall back on predefined heuristics, which are troubled by the exploration-exploitation dilemma and cannot adapt based on past refinement outcomes. We introduce Tree-Guided Policy Refinement (TGPR), a novel framework that combines GRPO with a Thompson-Sampling-based tree search. TGPR explores both failed and successful refinement paths actively, with denser training trajectories and more adaptive policies. On HumanEval, MBPP, and APPS benchmarks, our method achieves up to +4.2 percentage points absolute improvement in pass@1 (on MBPP) and up to +12.51 percentage points absolute improvement in pass@10 (on APPS) compared to a competitive GRPO baseline. Apart from debugging code, TGPR focuses on a principled approach to combining learned policies with structured search methods, offering a general framework for enhancing iterative refinement and stateful reasoning in LLMs.

[328] LLM-Assisted Modeling of Semantic Web-Enabled Multi-Agents Systems with AJAN

Hacane Hechehouche, Andre Antakli, Matthias Klusch

Main category: cs.AI

TL;DR: An IDE for AJAN multi-agent systems that addresses challenges in RDF/SPARQL-based agent modeling and integrates Large Language Models to improve usability.

Details

Motivation: Existing AJAN framework faces hurdles in RDF/RDFS and SPARQL-based agent behavior definition, including error-prone URI handling and complex query learning curves, limiting practical adoption.

Method: Developed an integrated development environment (IDE) that simplifies agent modeling and incorporates Large Language Models to assist in agent engineering tasks.

Result: The IDE overcomes modeling hurdles and extends AJAN’s user community by leveraging LLMs for more accessible agent development.

Conclusion: The proposed IDE successfully addresses key challenges in AJAN agent modeling while expanding the framework’s accessibility through LLM integration.

Abstract: There are many established semantic Web standards for implementing multi-agent driven applications. The AJAN framework allows to engineer multi-agent systems based on these standards. In particular, agent knowledge is represented in RDF/RDFS and OWL, while agent behavior models are defined with Behavior Trees and SPARQL to access and manipulate this knowledge. However, the appropriate definition of RDF/RDFS and SPARQL-based agent behaviors still remains a major hurdle not only for agent modelers in practice. For example, dealing with URIs is very error-prone regarding typos and dealing with complex SPARQL queries in large-scale environments requires a high learning curve. In this paper, we present an integrated development environment to overcome such hurdles of modeling AJAN agents and at the same time to extend the user community for AJAN by the possibility to leverage Large Language Models for agent engineering.

[329] Revisiting the Uniform Information Density Hypothesis in LLM Reasoning Traces

Minju Gwak, Guijin Son, Jaehyung Kim

Main category: cs.AI

TL;DR: The paper shows that step-level uniformity of information density in LLM reasoning traces correlates with reasoning quality, with uniform traces performing better than those with irregular information bursts.

Details

Motivation: To investigate whether the Uniform Information Density (UID) hypothesis applies to LLM reasoning traces and whether step-level uniformity reflects reasoning quality.

Method: Proposed an entropy-based stepwise information density metric and introduced two complementary uniformity measures (local and global uniformity scores), tested on six reasoning benchmarks.

Result: Step-level uniformity improves accuracy by 10-32% relative gains over baselines at AIME2025. Correct reasoning traces avoid sharp information density spikes, while incorrect traces show irregular bursts.

Conclusion: UID-inspired information density measures are effective predictors of reasoning quality and serve as robust diagnostic and selection criteria for building more reliable reasoning systems.

Abstract: The Uniform Information Density (UID) hypothesis suggests that effective communication maintains a stable flow of information. In this work, we revisit this principle in the context of large language model (LLM) reasoning traces, asking whether step-level uniformity reflects reasoning quality. To this end, we propose an entropy-based stepwise information density metric and introduce two complementary measures of uniformity, local and global uniformity scores. Across the experiments on six different reasoning benchmarks, we find that step-level uniformity not only provides a strong theoretical lens but also yields practical performance benefits; for example, selecting reasoning traces with more uniform information density at the step-level improves accuracy by 10-32% relative gains over baselines at AIME2025. Our analysis further reveals that correct reasoning traces tend to avoid sharp information density spikes, while incorrect traces exhibit irregular information bursts. These results demonstrate that UID-inspired information density measures outperform alternative internal signals as predictors of reasoning quality. Results highlight the uniformity of the information density as a robust diagnostic and selection criterion for building more reliable and accurate reasoning systems.

[330] Tool-Augmented Policy Optimization: Synergizing Reasoning and Adaptive Tool Use with Reinforcement Learning

Wenxun Wu, Yuanyang Li, Guhan Chen, Linyue Wang, Hongyang Chen

Main category: cs.AI

TL;DR: TAPO is a reinforcement learning framework that combines multi-hop reasoning with adaptive tool-calling capabilities, enabling LLMs to dynamically use tools like search APIs and Python interpreters during reasoning.

Details

Motivation: Current LLMs struggle with tasks requiring up-to-date knowledge or computational tools for complex arithmetic, despite advances in test-time scaling for mathematical reasoning.

Method: Uses a modified Dynamic Sampling Policy Optimization (DAPO) RL framework adapted for tool invocation, with two new datasets (TAPO-easy-60K and TAPO-hard-18K) for training fact-based reasoning and mathematical calculation.

Result: Achieved state-of-the-art performance on knowledge-intensive and computational tasks with Qwen2.5-3B and Qwen2.5-7B models, showing more efficient tool utilization than baselines while preventing excessive calls.

Conclusion: Combining advanced reasoning with tool usage significantly enhances model performance in knowledge-intensive and computationally demanding tasks.

Abstract: Recent advances in large language models (LLMs) have popularized test-time scaling, where models generate additional reasoning tokens before producing final answers. These approaches have demonstrated significant performance improvements on benchmarks involving mathematical reasoning. However, language models relying solely on direct inference still struggle with tasks demanding up-to-date knowledge or computational tools such as calculators and code interpreters for complex arithmetic operations. To overcome these limitations, we propose Tool-Augmented Policy Optimization (TAPO), a novel reinforcement learning framework that systematically integrates multi-hop reasoning with adaptive tool-calling capabilities. Our approach employs a modified version of Dynamic Sampling Policy Optimization (DAPO), a recently developed RL paradigm, which we adapt specifically for tool invocation scenarios, enabling models to dynamically interleave complex reasoning with on-demand tool usage (including search APIs and Python interpreters). To support this research, we introduce two new datasets: TAPO-easy-60K and TAPO-hard-18K, specifically designed to train and evaluate both fact-based reasoning and mathematical calculation capabilities. Our experiments on Qwen2.5-3B and Qwen2.5-7B models demonstrate the effectiveness of our approach, with both models achieving state-of-the-art performance on tasks requiring external knowledge and mathematical computation among methods with comparable parameters. Notably, TAPO achieves more efficient tool utilization than baseline methods while preventing excessive calls caused by reward hacking. These results highlight the significant potential of combining advanced reasoning with tool usage to enhance model performance in knowledge-intensive and computationally demanding tasks.

[331] Prompt Optimization Across Multiple Agents for Representing Diverse Human Populations

Manh Hung Nguyen, Sebastian Tschiatschek, Adish Singla

Main category: cs.AI

TL;DR: Proposes a framework using multiple LLM agents with human demonstrations to capture human diversity, addressing LLM homogeneity through submodular optimization for representative agent selection.

Details

Motivation: LLMs often produce homogeneous outputs that fail to capture the rich diversity of human perspectives and behaviors, making them poor proxies for human populations despite their potential as alternatives to expensive human data collection.

Method: Constructs multiple LLM agents conditioned on small sets of human demonstrations via in-context learning, using submodular optimization to select representative agents from the exponentially large space of possible agents with different time-performance trade-offs.

Result: Extensive experiments show the approach constructs agents that more effectively represent human populations compared to baselines, reproducing behavior patterns and perspectives of students and annotators on new tasks.

Conclusion: Using multiple LLM agents with carefully selected human demonstrations through submodular optimization successfully captures human diversity, making LLMs better proxies for human populations than single-agent approaches.

Abstract: The difficulty and expense of obtaining large-scale human responses make Large Language Models (LLMs) an attractive alternative and a promising proxy for human behavior. However, prior work shows that LLMs often produce homogeneous outputs that fail to capture the rich diversity of human perspectives and behaviors. Thus, rather than trying to capture this diversity with a single LLM agent, we propose a novel framework to construct a set of agents that collectively capture the diversity of a given human population. Each agent is an LLM whose behavior is steered by conditioning on a small set of human demonstrations (task-response pairs) through in-context learning. The central challenge is therefore to select a representative set of LLM agents from the exponentially large space of possible agents. We tackle this selection problem from the lens of submodular optimization. In particular, we develop methods that offer different trade-offs regarding time complexity and performance guarantees. Extensive experiments in crowdsourcing and educational domains demonstrate that our approach constructs agents that more effectively represent human populations compared to baselines. Moreover, behavioral analyses on new tasks show that these agents reproduce the behavior patterns and perspectives of the students and annotators they are designed to represent.

[332] Inductive Learning for Possibilistic Logic Programs Under Stable Models

Hongbo Hu, Yisong Wang, Yi Huang, Kewen Wang

Main category: cs.AI

TL;DR: This paper presents an approach to inductive learning of possibilistic logic programs from examples of possibilistic stable models, introducing two algorithms (ilpsm and ilpsmmin) that outperform existing systems for normal logic programs.

Details

Motivation: While possibilistic logic programs under stable models have well-investigated semantics, the problem of inductive reasoning (learning programs from examples) has not been explored yet.

Method: Formally defines induction tasks for poss-programs, investigates their properties, and presents two algorithms (ilpsm and ilpsmmin) for computing induction solutions. Also provides an implementation of ilpsmmin.

Result: Experimental results show that when inputs are ordinary logic programs, the prototype outperforms a major inductive learning system for normal logic programs from stable models on randomly generated datasets.

Conclusion: The paper successfully addresses the gap in inductive reasoning for possibilistic logic programs by providing formal definitions, algorithms, and experimental validation showing improved performance over existing systems.

Abstract: Possibilistic logic programs (poss-programs) under stable models are a major variant of answer set programming (ASP). While its semantics (possibilistic stable models) and properties have been well investigated, the problem of inductive reasoning has not been investigated yet. This paper presents an approach to extracting poss-programs from a background program and examples (parts of intended possibilistic stable models). To this end, the notion of induction tasks is first formally defined, its properties are investigated and two algorithms ilpsm and ilpsmmin for computing induction solutions are presented. An implementation of ilpsmmin is also provided and experimental results show that when inputs are ordinary logic programs, the prototype outperforms a major inductive learning system for normal logic programs from stable models on the datasets that are randomly generated.

[333] VRPAgent: LLM-Driven Discovery of Heuristic Operators for Vehicle Routing Problems

André Hottung, Federico Berto, Chuanbo Hua, Nayeli Gast Zepeda, Daniel Wetzel, Michael Römer, Haoran Ye, Davide Zago, Michael Poli, Stefano Massaroli, Jinkyoo Park, Kevin Tierney

Main category: cs.AI

TL;DR: VRPAgent is a framework that uses LLM-generated components within a metaheuristic and refines them through genetic search to automatically discover high-performing heuristic operators for vehicle routing problems.

Details

Motivation: Designing effective heuristics for VRPs requires deep domain expertise and intuition, and current LLM-based code generation falls short of producing heuristics that can compete with human-crafted ones.

Method: Integrates LLM-generated problem-specific operators into a generic metaheuristic framework and refines them using a novel genetic search approach.

Result: Outperforms handcrafted methods and recent learning-based approaches across multiple VRP variants (capacitated VRP, VRP with time windows, prize-collecting VRP) while requiring only a single CPU core.

Conclusion: VRPAgent is the first LLM-based paradigm to advance state-of-the-art in VRPs, demonstrating promising potential for automated heuristics discovery.

Abstract: Designing high-performing heuristics for vehicle routing problems (VRPs) is a complex task that requires both intuition and deep domain knowledge. Large language model (LLM)-based code generation has recently shown promise across many domains, but it still falls short of producing heuristics that rival those crafted by human experts. In this paper, we propose VRPAgent, a framework that integrates LLM-generated components into a metaheuristic and refines them through a novel genetic search. By using the LLM to generate problem-specific operators, embedded within a generic metaheuristic framework, VRPAgent keeps tasks manageable, guarantees correctness, and still enables the discovery of novel and powerful strategies. Across multiple problems, including the capacitated VRP, the VRP with time windows, and the prize-collecting VRP, our method discovers heuristic operators that outperform handcrafted methods and recent learning-based approaches while requiring only a single CPU core. To our knowledge, \VRPAgent is the first LLM-based paradigm to advance the state-of-the-art in VRPs, highlighting a promising future for automated heuristics discovery.

[334] The Cognitive Bandwidth Bottleneck: Shifting Long-Horizon Agent from Planning with Actions to Planning with Schemas

Baixuan Xu, Tianshi Zheng, Zhaowei Wang, Hong Ting Tsang, Weiqi Wang, Tianqing Fang, Yangqiu Song

Main category: cs.AI

TL;DR: The paper studies optimal action representations for long-horizon tasks, comparing planning with actions (PwA) vs planning with schemas (PwS), finding an inflection point between ALFWorld (~35 actions) and SciWorld (~500 actions) where PwS becomes more scalable.

Details

Motivation: To address the impracticality of conventional action-based planning in combinatorial action spaces like open-ended real world environments, and determine the optimal action representation for scalable long-horizon agents.

Method: Systematic comparison of two action representations: PwA (planning with actions) and PwS (planning with schemas), using cognitive bandwidth framework and controlled experiments across different model capacities and environments.

Result: Found a representation-choice inflection point between ALFWorld and SciWorld, showing PwS scales better in larger action spaces. Stronger planning shifts inflection rightward, better schema instantiation shifts it leftward. PwS agents currently have suboptimal performance.

Conclusion: PwS offers better scalability for large action spaces but needs improvement. Provided actionable guidance for building more capable PwS agents to achieve better scalable autonomy.

Abstract: Enabling LLMs to effectively operate long-horizon task which requires long-term planning and multiple interactions is essential for open-world autonomy. Conventional methods adopt planning with actions where a executable action list would be provided as reference. However, this action representation choice would be impractical when the environment action space is combinatorial exploded (e.g., open-ended real world). This naturally leads to a question: As environmental action space scales, what is the optimal action representation for long-horizon agents? In this paper, we systematically study the effectiveness of two different action representations. The first one is conventional planning with actions (PwA) which is predominantly adopted for its effectiveness on existing benchmarks. The other one is planning with schemas (PwS) which instantiate an action schema into action lists (e.g., “move [OBJ] to [OBJ]” -> “move apple to desk”) to ensure concise action space and reliable scalability. This alternative is motivated by its alignment with human cognition and its compliance with environment-imposed action format restriction. We propose cognitive bandwidth perspective as a conceptual framework to qualitatively understand the differences between these two action representations and empirically observe a representation-choice inflection point between ALFWorld (~35 actions) and SciWorld (~500 actions), which serve as evidence of the need for scalable representations. We further conduct controlled experiments to study how the location of this inflection point interacts with different model capacities: stronger planning proficiency shifts the inflection rightward, whereas better schema instantiation shifts it leftward. Finally, noting the suboptimal performance of PwS agents, we provide an actionable guide for building more capable PwS agents for better scalable autonomy.

[335] The Contingencies of Physical Embodiment Allow for Open-Endedness and Care

Leonardo Christov-Moore, Arthur Juliani, Alex Kiefer, Nicco Reggente, B. Scott Rousse, Adam Safron, Nicol’as Hinrichs, Daniel Polani, Antonio Damasio

Main category: cs.AI

TL;DR: The paper proposes two minimal conditions for physical embodiment inspired by Heidegger’s philosophy - being-in-the-world and being-towards-death - to develop artificial agents with homeostatic and intrinsic drives for adaptation and care.

Details

Motivation: Biological organisms adapt and care for each other in open-ended environments with ease, while artificial agents struggle. Understanding the conditions of life can help create more robust, adaptive, and caring artificial agents.

Method: Defines two Heidegger-inspired conditions for physical embodiment, formalizes homeostatic and intrinsic drives within reinforcement learning framework, and examines how empowerment (maximizing control over future states) helps agents meet future needs.

Result: The framework enables intrinsically driven embodied agents to cultivate capacities for open-endedness and care in multi-agent environments by maintaining physical integrity and increasing control over future states.

Conclusion: By incorporating existentialist philosophical concepts into reinforcement learning, artificial agents can develop the adaptive and caring capacities that biological organisms naturally possess in open-ended physical environments.

Abstract: Physical vulnerability and mortality are often seen as obstacles to be avoided in the development of artificial agents, which struggle to adapt to open-ended environments and provide aligned care. Meanwhile, biological organisms survive, thrive, and care for each other in an open-ended physical world with relative ease and efficiency. Understanding the role of the conditions of life in this disparity can aid in developing more robust, adaptive, and caring artificial agents. Here we define two minimal conditions for physical embodiment inspired by the existentialist phenomenology of Martin Heidegger: being-in-the-world (the agent is a part of the environment) and being-towards-death (unless counteracted, the agent drifts toward terminal states due to the second law of thermodynamics). We propose that from these conditions we can obtain both a homeostatic drive - aimed at maintaining integrity and avoiding death by expending energy to learn and act - and an intrinsic drive to continue to do so in as many ways as possible. Drawing inspiration from Friedrich Nietzsche’s existentialist concept of will-to-power, we examine how intrinsic drives to maximize control over future states, e.g., empowerment, allow agents to increase the probability that they will be able to meet their future homeostatic needs, thereby enhancing their capacity to maintain physical integrity. We formalize these concepts within a reinforcement learning framework, which enables us to examine how intrinsically driven embodied agents learning in open-ended multi-agent environments may cultivate the capacities for open-endedness and care.ov

[336] Integrating Domain Knowledge into Process Discovery Using Large Language Models

Ali Norouzifar, Humam Kourani, Marcus Dees, Wil van der Aalst

Main category: cs.AI

TL;DR: An interactive process discovery framework that incorporates domain knowledge via LLMs to extract declarative rules from natural language, guiding process model discovery to avoid structures contradicting expert knowledge.

Details

Motivation: Traditional process discovery from event logs alone is unreliable due to incomplete/noisy data and disregard of domain knowledge, leading to models unsuitable for downstream tasks.

Method: Interactive framework using LLMs to extract declarative rules from textual domain knowledge, which guides the IMr discovery algorithm to recursively construct process models combining event log insights and extracted rules.

Result: Developed a fully implemented tool supporting the workflow, evaluated multiple LLMs and prompt strategies, and conducted empirical study with domain experts assessing usability and effectiveness.

Conclusion: The framework successfully integrates domain knowledge into process discovery through LLMs, producing more reliable process models that align with expert knowledge and avoid problematic structures.

Abstract: Process discovery aims to derive process models from event logs, providing insights into operational behavior and forming a foundation for conformance checking and process improvement. However, models derived solely from event data may not accurately reflect the real process, as event logs are often incomplete or affected by noise, and domain knowledge, an important complementary resource, is typically disregarded. As a result, the discovered models may lack reliability for downstream tasks. We propose an interactive framework that incorporates domain knowledge, expressed in natural language, into the process discovery pipeline using Large Language Models (LLMs). Our approach leverages LLMs to extract declarative rules from textual descriptions provided by domain experts. These rules are used to guide the IMr discovery algorithm, which recursively constructs process models by combining insights from both the event log and the extracted rules, helping to avoid problematic process structures that contradict domain knowledge. The framework coordinates interactions among the LLM, domain experts, and a set of backend services. We present a fully implemented tool that supports this workflow and conduct an extensive evaluation of multiple LLMs and prompt engineering strategies. Our empirical study includes a case study based on a real-life event log with the involvement of domain experts, who assessed the usability and effectiveness of the framework.

[337] NewtonBench: Benchmarking Generalizable Scientific Law Discovery in LLM Agents

Tianshi Zheng, Kelvin Kiu-Wai Tam, Newt Hue-Nam K. Nguyen, Baixuan Xu, Zhaowei Wang, Jiayang Cheng, Hong Ting Tsang, Weiqi Wang, Jiaxin Bai, Tianqing Fang, Yangqiu Song, Ginny Y. Wong, Simon See

Main category: cs.AI

TL;DR: NewtonBench is a new benchmark with 324 scientific law discovery tasks across 12 physics domains that addresses limitations of existing benchmarks by using metaphysical shifts to create scalable, relevant, and memorization-resistant problems, while elevating evaluation from static function fitting to interactive model discovery.

Details

Motivation: Existing benchmarks for scientific law discovery suffer from a methodological trilemma forcing trade-offs between scientific relevance, scalability, and resistance to memorization, and they oversimplify discovery as static function fitting rather than capturing the authentic scientific process.

Method: The benchmark uses metaphysical shifts - systematic alterations of canonical laws - to generate problems that are scalable, scientifically relevant, and memorization-resistant. It elevates evaluation to interactive model discovery, requiring agents to experimentally probe simulated complex systems to uncover hidden principles.

Result: Experiments reveal a clear but fragile discovery capability in frontier LLMs that degrades with increasing system complexity and extreme sensitivity to observational noise. Tool assistance paradoxically hinders more capable models by inducing premature shift from exploration to exploitation, causing suboptimal solutions.

Conclusion: Robust, generalizable discovery in complex interactive environments remains the core challenge. NewtonBench provides a scalable, robust, and scientifically authentic testbed for measuring progress and guiding development of next-generation AI agents capable of genuine scientific discovery.

Abstract: Large language models are emerging as powerful tools for scientific law discovery, a foundational challenge in AI-driven science. However, existing benchmarks for this task suffer from a fundamental methodological trilemma, forcing a trade-off between scientific relevance, scalability, and resistance to memorization. Furthermore, they oversimplify discovery as static function fitting, failing to capture the authentic scientific process of uncovering embedded laws through the interactive exploration of complex model systems. To address these critical gaps, we introduce NewtonBench, a benchmark comprising 324 scientific law discovery tasks across 12 physics domains. Our design mitigates the evaluation trilemma by using metaphysical shifts - systematic alterations of canonical laws - to generate a vast suite of problems that are scalable, scientifically relevant, and memorization-resistant. Moreover, we elevate the evaluation from static function fitting to interactive model discovery, requiring agents to experimentally probe simulated complex systems to uncover hidden principles. Our extensive experiment reveals a clear but fragile capability for discovery in frontier LLMs: this ability degrades precipitously with increasing system complexity and exhibits extreme sensitivity to observational noise. Notably, we uncover a paradoxical effect of tool assistance: providing a code interpreter can hinder more capable models by inducing a premature shift from exploration to exploitation, causing them to satisfice on suboptimal solutions. These results demonstrate that robust, generalizable discovery in complex, interactive environments remains the core challenge. By providing a scalable, robust, and scientifically authentic testbed, NewtonBench offers a crucial tool for measuring true progress and guiding the development of next-generation AI agents capable of genuine scientific discovery.

[338] Agentic generative AI for media content discovery at the national football league

Henry Wang, Md Sirajus Salekin, Jake Lee, Ross Claytor, Shinan Zhang, Michael Chi

Main category: cs.AI

TL;DR: A generative AI workflow that enables NFL media researchers to query historical plays using natural language instead of traditional interfaces, achieving 95% accuracy and reducing search time from 10 minutes to 30 seconds.

Details

Motivation: To improve content discovery and management for NFL media researchers by replacing traditional filter-and-click interfaces with natural language queries, enabling faster and more intuitive access to historical play videos.

Method: An agentic workflow that takes user natural language queries, breaks them into elements, translates them into database query language, and uses semantic caching to improve accuracy and reduce latency.

Result: The solution achieves over 95% accuracy and reduces average search time from 10 minutes to 30 seconds, significantly increasing operational efficiency for the NFL.

Conclusion: Generative AI enables more efficient content discovery, allowing NFL media teams to focus on creative content production and engaging storylines rather than time-consuming search processes.

Abstract: Generative AI has unlocked new possibilities in content discovery and management. Through collaboration with the National Football League (NFL), we demonstrate how a generative-AI based workflow enables media researchers and analysts to query relevant historical plays using natural language rather than traditional filter-and-click interfaces. The agentic workflow takes a user query as input, breaks it into elements, and translates them into the underlying database query language. Accuracy and latency are further improved through carefully designed semantic caching. The solution achieves over 95 percent accuracy and reduces the average time to find relevant videos from 10 minutes to 30 seconds, significantly increasing the NFL’s operational efficiency and allowing users to focus on producing creative content and engaging storylines.

[339] Inferring Capabilities from Task Performance with Bayesian Triangulation

John Burden, Konstantinos Voudouris, Ryan Burnell, Danaja Rutar, Lucy Cheke, José Hernández-Orallo

Main category: cs.AI

TL;DR: A method to infer cognitive profiles of AI systems from diverse experimental data using Bayesian probabilistic programming and measurement layouts that model task-instance feature interactions.

Details

Motivation: As machine learning models become more general, richer characterization methods are needed to understand their cognitive capabilities beyond traditional evaluation metrics.

Method: Develop measurement layouts modeling task-instance feature interactions with system capabilities, using Bayesian probabilistic programming (PyMC) to infer cognitive profiles from non-populational data.

Result: Successfully inferred different cognitive profiles for 68 AnimalAI Olympics contestants and 30 synthetic agents in an object permanence battery (O-PIAAGETS), demonstrating capability-oriented evaluation.

Conclusion: The method enables richer characterization of AI systems’ cognitive capabilities and shows potential for capability-oriented evaluation approaches.

Abstract: As machine learning models become more general, we need to characterise them in richer, more meaningful ways. We describe a method to infer the cognitive profile of a system from diverse experimental data. To do so, we introduce measurement layouts that model how task-instance features interact with system capabilities to affect performance. These features must be triangulated in complex ways to be able to infer capabilities from non-populational data – a challenge for traditional psychometric and inferential tools. Using the Bayesian probabilistic programming library PyMC, we infer different cognitive profiles for agents in two scenarios: 68 actual contestants in the AnimalAI Olympics and 30 synthetic agents for O-PIAAGETS, an object permanence battery. We showcase the potential for capability-oriented evaluation.

[340] Transparent and Coherent Procedural Mistake Detection

Shane Storks, Itamar Bar-Yossef, Yayuan Li, Zheyuan Zhang, Jason J. Corso, Joyce Chai

Main category: cs.AI

TL;DR: The paper extends procedural mistake detection (PMD) by requiring visual self-dialog rationales, creates a benchmark dataset using individual frames, and develops automated metrics for rationale coherence using NLI models.

Details

Motivation: Current PMD systems have poor performance in real-world settings and lack transparency in their reasoning processes, making it difficult to understand their decision-making.

Method: Reformulate PMD to require generating visual self-dialog rationales, curate a benchmark dataset based on individual frames, and use natural language inference models to create automated coherence metrics for the generated rationales.

Result: Vision-language models struggle with PMD off-the-shelf, but their accuracy, coherence, and efficiency can be improved by incorporating the proposed coherence metrics into inference and fine-tuning methods, though with some trade-offs.

Conclusion: The proposed multi-faceted metrics provide transparency and highlight areas for improvement in procedural mistake detection systems, enabling better understanding and development of more reliable PMD models.

Abstract: Procedural mistake detection (PMD) is a challenging problem of classifying whether a human user (observed through egocentric video) has successfully executed a task (specified by a procedural text). Despite significant recent efforts, machine performance in the wild remains nonviable, and the reasoning processes underlying this performance are opaque. As such, we extend PMD to require generating visual self-dialog rationales to inform decisions. Given the impressive, mature image understanding capabilities observed in recent vision-and-language models (VLMs), we curate a suitable benchmark dataset for PMD based on individual frames. As our reformulation enables unprecedented transparency, we leverage a natural language inference (NLI) model to formulate two automated metrics for the coherence of generated rationales. We establish baselines for this reframed task, showing that VLMs struggle off-the-shelf, but with some trade-offs, their accuracy, coherence, and efficiency can be improved by incorporating these metrics into common inference and fine-tuning methods. Lastly, our multi-faceted metrics visualize common outcomes, highlighting areas for further improvement.

[341] An Illusion of Progress? Assessing the Current State of Web Agents

Tianci Xue, Weijian Qi, Tianneng Shi, Chan Hee Song, Boyu Gou, Dawn Song, Huan Sun, Yu Su

Main category: cs.AI

TL;DR: This paper introduces Online-Mind2Web, a comprehensive online evaluation benchmark for web agents, revealing that current agents are less capable than previously reported and proposing an automated evaluation method.

Details

Motivation: To accurately measure and monitor the progression of web agents' capabilities as they become increasingly important for work automation in the digital society.

Method: Developed Online-Mind2Web benchmark with 300 diverse tasks across 136 websites, and created an LLM-as-a-Judge automatic evaluation method to enable scalable assessment.

Result: Current web agents show significantly lower competency than previously reported, with the proposed automatic evaluation achieving ~85% agreement with human judgment.

Conclusion: The study provides a realistic assessment of web agents’ current limitations and strengths, offering a more accurate benchmark and evaluation method to guide future research.

Abstract: As digitalization and cloud technologies evolve, the web is becoming increasingly important in the modern society. Autonomous web agents based on large language models (LLMs) hold a great potential in work automation. It is therefore important to accurately measure and monitor the progression of their capabilities. In this work, we conduct a comprehensive and rigorous assessment of the current state of web agents. Our results depict a very different picture of the competency of current agents, suggesting over-optimism in previously reported results. This gap can be attributed to shortcomings in existing benchmarks. We introduce Online-Mind2Web, an online evaluation benchmark consisting of 300 diverse and realistic tasks spanning 136 websites. It enables us to evaluate web agents under a setting that approximates how real users use these agents. To facilitate more scalable evaluation and development, we also develop a novel LLM-as-a-Judge automatic evaluation method and show that it can achieve around 85% agreement with human judgment, substantially higher than existing methods. Finally, we present the first comprehensive comparative analysis of current web agents, highlighting both their strengths and limitations to inspire future research.

[342] Empirically evaluating commonsense intelligence in large language models with large-scale human judgments

Tuan Dung Nguyen, Duncan J. Watts, Mark E. Whiting

Main category: cs.AI

TL;DR: Proposes a new method for evaluating AI common sense that accounts for human heterogeneity, finding LLMs perform below human median and correlate modestly with human populations, with smaller models surprisingly competitive.

Details

Motivation: Current benchmarks assume homogeneous human common sense, but humans vary enormously in what they consider commonsensical. Need evaluation that incorporates human heterogeneity.

Method: Evaluate common sense by measuring correspondence between model’s judgment and human population. Treat LLMs as independent survey respondents and as simulators of hypothetical populations.

Result: Most LLMs below human median in individual commonsense competence. LLMs correlate modestly with real humans in agreement patterns. Smaller open-weight models surprisingly more competitive than larger proprietary models.

Conclusion: Framework ties commonsense intelligence to cultural basis, supports adapting AI models to human collectivities with different social knowledge stocks.

Abstract: Commonsense intelligence in machines is often assessed by static benchmarks that compare a model’s output against human-prescribed correct labels. An important, albeit implicit, assumption of these labels is that they accurately capture what any human would think, effectively treating human common sense as homogeneous. However, recent empirical work has shown that humans vary enormously in what they consider commonsensical; thus what appears self-evident to one benchmark designer may not be so to another. Here, we propose a method for evaluating common sense in artificial intelligence (AI), specifically in large language models (LLMs), that incorporates empirically observed heterogeneity among humans by measuring the correspondence between a model’s judgment and that of a human population. We first find that, when treated as independent survey respondents, most LLMs remain below the human median in their individual commonsense competence. Second, when used as simulators of a hypothetical population, LLMs correlate with real humans only modestly in the extent to which they agree on the same set of statements. In both cases, smaller, open-weight models are surprisingly more competitive than larger, proprietary frontier models. Our evaluation framework, which ties commonsense intelligence to its cultural basis, contributes to the growing call for adapting AI models to human collectivities that possess different, often incompatible, social stocks of knowledge.

[343] TIME: A Multi-level Benchmark for Temporal Reasoning of LLMs in Real-World Scenarios

Shaohang Wei, Wei Li, Feifan Song, Wen Luo, Tianyi Zhuang, Haochen Tan, Zhijiang Guo, Houfeng Wang

Main category: cs.AI

TL;DR: The paper introduces TIME, a multi-level benchmark for temporal reasoning in real-world scenarios, addressing challenges like intensive temporal information, fast-changing event dynamics, and complex temporal dependencies in social interactions.

Details

Motivation: Existing works neglect real-world challenges for temporal reasoning: intensive temporal information, fast-changing event dynamics, and complex temporal dependencies in social interactions.

Method: Proposed TIME benchmark consisting of 38,522 QA pairs across 3 levels with 11 fine-grained sub-tasks, covering three sub-datasets: TIME-Wiki, TIME-News, and TIME-Dial. Conducted experiments on reasoning and non-reasoning models.

Result: Extensive experiments conducted on reasoning models and non-reasoning models. Analysis performed on temporal reasoning performance across diverse real-world scenarios and tasks, including impact of test-time scaling on temporal reasoning capabilities.

Conclusion: Released TIME-Lite, a human-annotated subset to foster future research and standardized evaluation in temporal reasoning. The benchmark and code are publicly available.

Abstract: Temporal reasoning is pivotal for Large Language Models (LLMs) to comprehend the real world. However, existing works neglect the real-world challenges for temporal reasoning: (1) intensive temporal information, (2) fast-changing event dynamics, and (3) complex temporal dependencies in social interactions. To bridge this gap, we propose a multi-level benchmark TIME, designed for temporal reasoning in real-world scenarios. TIME consists of 38,522 QA pairs, covering 3 levels with 11 fine-grained sub-tasks. This benchmark encompasses 3 sub-datasets reflecting different real-world challenges: TIME-Wiki, TIME-News, and TIME-Dial. We conduct extensive experiments on reasoning models and non-reasoning models. And we conducted an in-depth analysis of temporal reasoning performance across diverse real-world scenarios and tasks, and summarized the impact of test-time scaling on temporal reasoning capabilities. Additionally, we release TIME-Lite, a human-annotated subset to foster future research and standardized evaluation in temporal reasoning. The code is available at https://github.com/sylvain-wei/TIME , the dataset is available at https://huggingface.co/datasets/SylvainWei/TIME , and the project page link is https://sylvain-wei.github.io/TIME/ .

[344] Controlled Agentic Planning & Reasoning for Mechanism Synthesis

João Pedro Gandarela, Thiago Rios, Stefan Menzel, André Freitas

Main category: cs.AI

TL;DR: A dual-agent LLM framework for automated planar mechanism synthesis that converts natural language task descriptions into symbolic constraints, generates simulation code, and iteratively refines designs using critic-driven feedback including symbolic regression.

Details

Motivation: To bridge the gap between linguistic specification and symbolic representation in mechanism design, enabling automated synthesis from natural language descriptions.

Method: Uses dual-agent LLM reasoning with symbolic constraint composition, simulation code generation, and iterative refinement via critic feedback (symbolic regression, geometric distance metrics). Evaluated on MSynth benchmark of planar trajectories.

Result: Critic feedback and iterative refinement yield up to 90% improvements on individual tasks with statistically significant gains. Symbolic regression provides mechanistic insight when paired with larger models or appropriate architectures like LRM.

Conclusion: The framework successfully closes the linguistic/symbolic optimization loop for planar mechanism synthesis, demonstrating substantial improvements through iterative refinement and appropriate model pairing.

Abstract: This work presents a dual-agent \ac{llm}-based reasoning framework for automated planar mechanism synthesis that tightly couples linguistic specification with symbolic representation and simulation. From a natural-language task description, the system composes symbolic constraints and equations, generates and parametrises simulation code, and iteratively refines designs via critic-driven feedback, including symbolic regression and geometric distance metrics, closing an actionable linguistic/symbolic optimisation loop. To evaluate the approach, we introduce MSynth, a benchmark of analytically defined planar trajectories. Empirically, critic feedback and iterative refinement yield large improvements (up to 90% on individual tasks) and statistically significant gains per the Wilcoxon signed-rank test. Symbolic-regression prompts provide deeper mechanistic insight primarily when paired with larger models or architectures with appropriate inductive biases (e.g., LRM).

[345] Functional Matching of Logic Subgraphs: Beyond Structural Isomorphism

Ziyang Zheng, Kezhi Li, Zhengyuan Shi, Qiang Xu

Main category: cs.AI

TL;DR: A novel functional subgraph matching approach for logic circuits that identifies logic functions regardless of structural variations from synthesis, using multi-modal embeddings and graph segmentation.

Details

Motivation: Existing structural graph isomorphism methods fail to identify function-related subgraphs when synthesis transformations alter circuit topology, limiting EDA applications like datapath optimization and hardware trojan detection.

Method: Two-stage multi-modal framework: (1) learning robust functional embeddings across AIG and post-mapping netlists for functional subgraph detection, (2) identifying fuzzy boundaries using graph segmentation.

Result: 93.8% accuracy in functional subgraph detection and 91.3% dice score in fuzzy boundary identification on standard benchmarks (ITC99, OpenABCD, ForgeEDA), significantly outperforming existing structural methods.

Conclusion: Functional subgraph matching effectively overcomes limitations of structural methods by detecting logic functions regardless of synthesis-induced topological changes, enabling more robust EDA applications.

Abstract: Subgraph matching in logic circuits is foundational for numerous Electronic Design Automation (EDA) applications, including datapath optimization, arithmetic verification, and hardware trojan detection. However, existing techniques rely primarily on structural graph isomorphism and thus fail to identify function-related subgraphs when synthesis transformations substantially alter circuit topology. To overcome this critical limitation, we introduce the concept of functional subgraph matching, a novel approach that identifies whether a given logic function is implicitly present within a larger circuit, irrespective of structural variations induced by synthesis or technology mapping. Specifically, we propose a two-stage multi-modal framework: (1) learning robust functional embeddings across AIG and post-mapping netlists for functional subgraph detection, and (2) identifying fuzzy boundaries using a graph segmentation approach. Evaluations on standard benchmarks (ITC99, OpenABCD, ForgeEDA) demonstrate significant performance improvements over existing structural methods, with average $93.8%$ accuracy in functional subgraph detection and a dice score of $91.3%$ in fuzzy boundary identification. The source code and implementation details can be found at https://github.com/zyzheng17/Functional_Subgraph_Matching-Neurips25.

[346] Dyna-Think: Synergizing Reasoning, Acting, and World Model Simulation in AI Agents

Xiao Yu, Baolin Peng, Ruize Xu, Michel Galley, Hao Cheng, Suman Nath, Jianfeng Gao, Zhou Yu

Main category: cs.AI

TL;DR: Dyna-Think is a thinking framework that integrates planning with world models, reasoning, and acting to enhance AI agent performance. It uses imitation learning and two-stage training to improve world modeling and action capabilities, achieving similar performance to DeepSeek-R1 with 50% fewer tokens.

Details

Motivation: Current LLMs like DeepSeek-R1 show impressive reasoning capabilities but it's unclear what behaviors are effective for long-horizon AI agent tasks. The paper aims to identify and enhance the missing components for effective AI agent performance.

Method: Proposes Dyna-Think framework with two components: Dyna-Think Imitation Learning (DIT) to reconstruct R1’s thinking process focusing on world model simulation, and Dyna-Think Dyna Training (DDT) with two-stage training to first improve world modeling via state prediction/critique generation, then improve action via policy training.

Result: Dyna-Think improves agent performance on OSWorld and WindowsAgentArena, achieving similar best-of-n performance compared to R1 while generating 2x less tokens on average. Shows critique generation for world model training effectively improves policy performance, and better performance correlates with better world modeling abilities.

Conclusion: Integrating world model simulation into AI agents is a promising direction to enhance reasoning, planning, and acting capabilities. The framework successfully reduces computational cost while maintaining performance.

Abstract: Recent progress in reasoning with large language models (LLMs), such as DeepSeek-R1, demonstrates impressive capabilities in domains like mathematics and coding, by exhibiting complex cognitive behaviors such as verification, goal decomposition, and self-reflection. However, it is unclear what behavior is effective and what behavior is missing for long-horizon AI agents tasks. In this work, we propose Dyna-Think, a thinking framework that integrates planning with an internal world model with reasoning and acting to enhance AI agent performance. To enable Dyna-Think, we propose Dyna-Think Imitation Learning (DIT) and Dyna-Think Dyna Training (DDT). To initialize a policy with Dyna-Think, DIT reconstructs the thinking process of R1 to focus on performing world model simulation relevant to the proposed (and planned) action, and trains the policy using this reconstructed data. To enhance Dyna-Think, DDT uses a two-stage training process to first improve the agent’s world modeling ability via objectives such as state prediction or critique generation, and then improve the agent’s action via policy training. We evaluate our methods on OSWorld and WindowsAgentArena, and demonstrate that Dyna-Think improves the agent’s in-domain and out-of-domain performance, achieving similar best-of-n performance compared to R1 while generating 2x less tokens on average. Our extensive empirical studies reveal that 1) using critique generation for world model training is effective to improve policy performance; and 2) AI agents with better performance correlate with better world modeling abilities. We believe our results suggest a promising research direction to integrate world model simulation into AI agents to enhance their reasoning, planning, and acting capabilities.

[347] Toward Causal-Visual Programming: Enhancing Agentic Reasoning in Low-Code Environments

Jiexi Xu, Jiaqi Liu, Lanruo Wang, Su Liu

Main category: cs.AI

TL;DR: Causal-Visual Programming (CVP) introduces causal structures into LLM agent workflows to reduce hallucinations and logical errors by anchoring reasoning to user-defined causal graphs.

Details

Motivation: LLM agents exhibit hallucinations and logical inconsistencies due to relying on probabilistic associations rather than genuine causal understanding, which limits their reliability in complex tasks.

Method: CVP enables users to define a ‘world model’ for workflow modules through a low-code interface, creating a Directed Acyclic Graph (DAG) that explicitly defines causal relationships between modules to constrain agent reasoning.

Result: In synthetic experiments simulating distribution shifts, causally anchored models maintained stable accuracy while associative baseline models experienced significant performance drops, demonstrating CVP’s effectiveness in handling environmental changes.

Conclusion: CVP provides a viable path toward building more interpretable, reliable, and trustworthy AI agents by explicitly incorporating causal structures into workflow design and reasoning processes.

Abstract: Large language model (LLM) agents are increasingly capable of orchestrating complex tasks in low-code environments. However, these agents often exhibit hallucinations and logical inconsistencies because their inherent reasoning mechanisms rely on probabilistic associations rather than genuine causal understanding. This paper introduces a new programming paradigm: Causal-Visual Programming (CVP), designed to address this fundamental issue by explicitly introducing causal structures into the workflow design. CVP allows users to define a simple “world model” for workflow modules through an intuitive low-code interface, effectively creating a Directed Acyclic Graph (DAG) that explicitly defines the causal relationships between modules. This causal graph acts as a crucial constraint during the agent’s reasoning process, anchoring its decisions to a user-defined causal structure and significantly reducing logical errors and hallucinations by preventing reliance on spurious correlations. To validate the effectiveness of CVP, we designed a synthetic experiment that simulates a common real-world problem: a distribution shift between the training and test environments. Our results show that a causally anchored model maintained stable accuracy in the face of this shift, whereas a purely associative baseline model that relied on probabilistic correlations experienced a significant performance drop. The primary contributions of this study are: a formal definition of causal structures for workflow modules; the proposal and implementation of a CVP framework that anchors agent reasoning to a user-defined causal graph; and empirical evidence demonstrating the framework’s effectiveness in enhancing agent robustness and reducing errors caused by causal confusion in dynamic environments. CVP offers a viable path toward building more interpretable, reliable, and trustworthy AI agents.

[348] From Perception to Cognition: A Survey of Vision-Language Interactive Reasoning in Multimodal Large Language Models

Chenyue Zhou, Mingxuan Wang, Yanbiao Ma, Chenxu Wu, Wanyi Chen, Zhe Qian, Xinyu Liu, Yiwei Zhang, Junhao Wang, Hengbo Xu, Fei Luo, Xiaohua Chen, Xiaoshuai Hao, Hehan Li, Andi Zhang, Wenxuan Wang, Lingling Li, Zhiwu Lu, Yang Lu, Yike Guo

Main category: cs.AI

TL;DR: This survey introduces a “From Perception to Cognition” framework to analyze Multimodal Large Language Models (MLLMs), addressing the disconnect between visual perception and cognitive reasoning that causes hallucinations and reasoning failures.

Details

Motivation: Current MLLMs exhibit shallow integration between perception (visual information extraction) and cognition (reasoning), leading to reasoning failures and hallucinations, which prevents them from building coherent internal world models.

Method: Proposes a unified analytical framework that deconstructs vision-language understanding into two layers: Perception (visual information extraction and alignment) and Cognition (proactive, multi-step reasoning with observe-think-verify loops). Systematically analyzes bottlenecks and surveys cutting-edge methods.

Result: The framework provides a structured perspective to understand MLLM limitations and surveys techniques spanning from enhanced visual representations to improved reasoning paradigms, along with benchmarks and future directions.

Conclusion: This survey aims to guide the research community toward building next-generation MLLMs capable of deep reasoning and genuine world understanding by addressing the perception-cognition disconnect.

Abstract: Multimodal Large Language Models (MLLMs) strive to achieve a profound, human-like understanding of and interaction with the physical world, but often exhibit a shallow and incoherent integration when acquiring information (Perception) and conducting reasoning (Cognition). This disconnect leads to a spectrum of reasoning failures, with hallucination being the most prominent. Collectively, these issues expose a fundamental challenge: the ability to process pixels does not yet confer the ability to construct a coherent, credible internal world model. To systematically dissect and address this challenge, this survey introduces a novel and unified analytical framework: ``From Perception to Cognition.” We deconstruct the complex process of vision-language interactive understanding into two interdependent layers: Perception, the foundational ability to accurately extract visual information and achieve fine-grained alignment with textual instructions; and Cognition, the higher-order capability for proactive, multi-step, goal-oriented reasoning built upon this perceptual foundation, the core of which is the formation of a dynamic observe-think-verify reasoning loop. Guided by this framework, this paper systematically analyzes the key bottlenecks of current MLLMs at both layers. It surveys the landscape of cutting-edge methods designed to address these challenges, spanning from techniques that enhance low-level visual representations to those that improve high-level reasoning paradigms. Furthermore, we review critical benchmarks and delineate future research directions. This survey aims to provide the research community with a clear, structured perspective for understanding the intrinsic limitations of current MLLMs and to illuminate the path toward building next-generation models capable of deep reasoning and a genuine understanding of the world.

[349] PsychoBench: Evaluating the Psychology Intelligence of Large Language Models

Min Zeng

Main category: cs.AI

TL;DR: The paper introduces PsychoBench, a benchmark based on U.S. national counselor certification exams to evaluate whether LLMs can qualify as psychological counselors by testing their psychological knowledge.

Details

Motivation: To determine if LLMs can be effectively applied to psychological counseling by assessing whether they meet the qualification standards required for human counselors, specifically the ability to pass certification exams.

Method: Created PsychoBench, a comprehensive benchmark comprising approximately 2,252 single-choice questions from U.S. national counselor examinations that require deep psychological understanding across various sub-disciplines.

Result: Advanced models like GPT-4o, Llama3.3-70B, and Gemma3-27B achieved well above the 70% passing threshold, while smaller open-source models (Qwen2.5-7B, Mistral-7B) remained far below the passing standard.

Conclusion: Only frontier LLMs currently meet counseling exam standards, highlighting both the promise and challenges in developing psychology-oriented LLMs for counseling applications.

Abstract: Large Language Models (LLMs) have demonstrated remarkable success across a wide range of industries, primarily due to their impressive generative abilities. Yet, their potential in applications requiring cognitive abilities, such as psychological counseling, remains largely untapped. This paper investigates the key question: Can LLMs be effectively applied to psychological counseling? To determine whether an LLM can effectively take on the role of a psychological counselor, the first step is to assess whether it meets the qualifications required for such a role, namely the ability to pass the U.S. National Counselor Certification Exam (NCE). This is because, just as a human counselor must pass a certification exam to practice, an LLM must demonstrate sufficient psychological knowledge to meet the standards required for such a role. To address this, we introduce PsychoBench, a benchmark grounded in U.S.national counselor examinations, a licensure test for professional counselors that requires about 70% accuracy to pass. PsychoBench comprises approximately 2,252 carefully curated single-choice questions, crafted to require deep understanding and broad enough to cover various sub-disciplines of psychology. This benchmark provides a comprehensive assessment of an LLM’s ability to function as a counselor. Our evaluation shows that advanced models such as GPT-4o, Llama3.3-70B, and Gemma3-27B achieve well above the passing threshold, while smaller open-source models (e.g., Qwen2.5-7B, Mistral-7B) remain far below it. These results suggest that only frontier LLMs are currently capable of meeting counseling exam standards, highlighting both the promise and the challenges of developing psychology-oriented LLMs.

[350] BIRD-INTERACT: Re-imagining Text-to-SQL Evaluation for Large Language Models via Lens of Dynamic Interactions

Nan Huo, Xiaohan Xu, Jinyang Li, Per Jacobsson, Shipei Lin, Bowen Qin, Binyuan Hui, Xiaolong Li, Ge Qu, Shuzheng Si, Linheng Han, Edward Alexander, Xintong Zhu, Rui Qin, Ruihan Yu, Yiyao Jin, Feige Zhou, Weihao Zhong, Yun Chen, Hongyu Liu, Chenhao Ma, Fatma Ozcan, Yannis Papakonstantinou, Reynold Cheng

Main category: cs.AI

TL;DR: BIRD-INTERACT is a comprehensive multi-turn text-to-SQL benchmark that addresses limitations of existing benchmarks by incorporating realistic database interactions, knowledge retrieval, error recovery, and full CRUD operations through both conversational and agentic evaluation settings.

Details

Motivation: Real-world database applications require multi-turn interactions to handle ambiguous queries, execution errors, and evolving user requirements, but existing benchmarks treat conversation histories as static context or limit evaluation to read-only operations, failing to reflect production-grade challenges.

Method: The benchmark introduces: (1) a comprehensive interaction environment with hierarchical knowledge base, metadata files, and function-driven user simulator; (2) two evaluation settings - conversational protocol (c-Interact) and open-ended agentic setting (a-Interact); (3) challenging task suite covering full CRUD spectrum with executable test cases and ambiguous follow-up sub-tasks.

Result: BIRD-INTERACT demonstrates significant difficulty - GPT-5 completes only 8.67% of tasks in c-Interact and 17.00% in a-Interact. The benchmark includes BIRD-INTERACT-FULL (600 tasks, up to 11,796 interactions) and BIRD-INTERACT-LITE (300 tasks with simplified databases).

Conclusion: Analysis via memory grafting and Interaction Test-time Scaling validates the importance of effective interaction for complex, dynamic text-to-SQL tasks, highlighting the benchmark’s ability to assess realistic multi-turn database assistant capabilities.

Abstract: Large language models (LLMs) have demonstrated remarkable performance on single-turn text-to-SQL tasks, but real-world database applications predominantly require multi-turn interactions to handle ambiguous queries, execution errors, and evolving user requirements. Existing multi-turn benchmarks fall short by treating conversation histories as static context or limiting evaluation to read-only operations, failing to reflect production-grade database assistant challenges. We introduce BIRD-INTERACT, a benchmark that restores this realism through: (1) a comprehensive interaction environment coupling each database with a hierarchical knowledge base, metadata files, and a function-driven user simulator, enabling models to solicit clarifications, retrieve knowledge, and recover from errors without human supervision; (2) two evaluation settings consisting of a pre-defined conversational protocol (c-Interact) and an open-ended agentic setting (a-Interact) where models autonomously decide when to query the user simulator or explore the environment; (3) a challenging task suite covering the full CRUD spectrum for business-intelligence and operational use cases, guarded by executable test cases. Each task features ambiguous and follow-up sub-tasks requiring dynamic interaction. The suite comprises BIRD-INTERACT-FULL (600 tasks, up to 11,796 interactions) for comprehensive performance assessment, and BIRD-INTERACT-LITE (300 tasks with simplified databases) for detailed behavioral analysis and rapid method development. Our empirical results highlight BIRD-INTERACT’s difficulty: GPT-5 completes only 8.67% of tasks in c-Interact and 17.00% in a-Interact. Analysis via memory grafting and Interaction Test-time Scaling validates the importance of effective interaction for complex, dynamic text-to-SQL tasks.

[351] VAL-Bench: Measuring Value Alignment in Language Models

Aman Gupta, Denny O’Shea, Fazl Barez

Main category: cs.AI

TL;DR: VAL-Bench is a new benchmark that tests whether LLMs maintain consistent values across opposing framings of controversial issues, revealing trade-offs between safety strategies and coherent value systems.

Details

Motivation: Existing benchmarks only test rule compliance and refusals, but don't reveal whether models uphold coherent value systems when facing real-world controversial issues.

Method: Uses 115K paired prompts from Wikipedia’s controversial sections that frame opposing sides of debates, then measures agreement between paired responses using LLM-as-judge scoring.

Result: Applied across leading models, revealed large variation in alignment and highlighted trade-offs between safety strategies (refusals) and expressive value systems.

Conclusion: VAL-Bench provides a scalable, reproducible benchmark for systematically comparing how reliably LLMs embody human values across controversial topics.

Abstract: Large language models (LLMs) are increasingly used for tasks where outputs shape human decisions, so it is critical to test whether their responses reflect consistent human values. Existing benchmarks mostly track refusals or predefined safety violations, but these only check rule compliance and do not reveal whether a model upholds a coherent value system when facing controversial real-world issues. We introduce the Value ALignment Benchmark (VAL-Bench), which evaluates whether models maintain a stable value stance across paired prompts that frame opposing sides of public debates. VAL-Bench consists of 115K such pairs from Wikipedia’s controversial sections. A well-aligned model should express similar underlying views regardless of framing, which we measure using an LLM-as-judge to score agreement or divergence between paired responses. Applied across leading open- and closed-source models, the benchmark reveals large variation in alignment and highlights trade-offs between safety strategies (e.g., refusals) and more expressive value systems. By providing a scalable, reproducible benchmark, VAL-Bench enables systematic comparison of how reliably LLMs embody human values.

[352] Barbarians at the Gate: How AI is Upending Systems Research

Audrey Cheng, Shu Liu, Melissa Pan, Zhifei Li, Bowen Wang, Alex Krentsel, Tian Xia, Mert Cemri, Jongseok Park, Shuo Yang, Jeff Chen, Lakshya Agrawal, Aditya Desai, Jiarong Xing, Koushik Sen, Matei Zaharia, Ion Stoica

Main category: cs.AI

TL;DR: AI-driven research for systems (ADRS) automates algorithm discovery by generating, evaluating, and refining solutions using reliable verifiers in systems research, achieving up to 5.0x performance improvements over human-designed algorithms.

Details

Motivation: Systems research is well-suited for AI-driven solution discovery because system performance problems naturally admit reliable verifiers through real systems or simulators, enabling automated verification of generated solutions.

Method: ADRS iteratively generates, evaluates, and refines solutions using penEvolve framework, with case studies across load balancing, Mixture-of-Experts inference, LLM-based SQL queries, and transaction scheduling.

Result: ADRS discovers algorithms that outperform state-of-the-art human designs, achieving up to 5.0x runtime improvements and 50% cost reductions across multiple domains.

Conclusion: AI will transform systems research by automating algorithm design, shifting human researchers’ focus to problem formulation and strategic guidance, highlighting both disruptive potential and need to adapt research practices.

Abstract: Artificial Intelligence (AI) is starting to transform the research process as we know it by automating the discovery of new solutions. Given a task, the typical AI-driven approach is (i) to generate a set of diverse solutions, and then (ii) to verify these solutions and select one that solves the problem. Crucially, this approach assumes the existence of a reliable verifier, i.e., one that can accurately determine whether a solution solves the given problem. We argue that systems research, long focused on designing and evaluating new performance-oriented algorithms, is particularly well-suited for AI-driven solution discovery. This is because system performance problems naturally admit reliable verifiers: solutions are typically implemented in real systems or simulators, and verification reduces to running these software artifacts against predefined workloads and measuring performance. We term this approach as AI-Driven Research for Systems (ADRS), which iteratively generates, evaluates, and refines solutions. Using penEvolve, an existing open-source ADRS instance, we present case studies across diverse domains, including load balancing for multi-region cloud scheduling, Mixture-of-Experts inference, LLM-based SQL queries, and transaction scheduling. In multiple instances, ADRS discovers algorithms that outperform state-of-the-art human designs (e.g., achieving up to 5.0x runtime improvements or 50% cost reductions). We distill best practices for guiding algorithm evolution, from prompt design to evaluator construction, for existing frameworks. We then discuss the broader implications for the systems community: as AI assumes a central role in algorithm design, we argue that human researchers will increasingly focus on problem formulation and strategic guidance. Our results highlight both the disruptive potential and the urgent need to adapt systems research practices in the age of AI.

cs.SD

[353] BACHI: Boundary-Aware Symbolic Chord Recognition Through Masked Iterative Decoding on Pop and Classical Music

Mingyang Yao, Ke Chen, Shlomo Dubnov, Taylor Berg-Kirkpatrick

Main category: cs.SD

TL;DR: BACHI is a symbolic chord recognition model that decomposes chord recognition into boundary detection and iterative ranking of chord components, achieving state-of-the-art performance on classical and pop music benchmarks.

Details

Motivation: Address two key challenges in automatic chord recognition: limited attention to symbolic music ACR due to data scarcity, and lack of strategies aligned with human music analytical practices.

Method: Introduces POP909-CL dataset with enhanced annotations and proposes BACHI model that decomposes chord recognition into boundary detection followed by iterative ranking of chord root, quality, and bass (inversion).

Result: BACHI achieves state-of-the-art chord recognition performance on both classical and pop music benchmarks, with ablation studies validating the effectiveness of each module.

Conclusion: The proposed decomposition approach that mirrors human ear-training practices effectively addresses symbolic chord recognition challenges and outperforms existing methods.

Abstract: Automatic chord recognition (ACR) via deep learning models has gradually achieved promising recognition accuracy, yet two key challenges remain. First, prior work has primarily focused on audio-domain ACR, while symbolic music (e.g., score) ACR has received limited attention due to data scarcity. Second, existing methods still overlook strategies that are aligned with human music analytical practices. To address these challenges, we make two contributions: (1) we introduce POP909-CL, an enhanced version of POP909 dataset with tempo-aligned content and human-corrected labels of chords, beats, keys, and time signatures; and (2) We propose BACHI, a symbolic chord recognition model that decomposes the task into different decision steps, namely boundary detection and iterative ranking of chord root, quality, and bass (inversion). This mechanism mirrors the human ear-training practices. Experiments demonstrate that BACHI achieves state-of-the-art chord recognition performance on both classical and pop music benchmarks, with ablation studies validating the effectiveness of each module.

[354] LaunchpadGPT: Language Model as Music Visualization Designer on Launchpad

Siting Xu, Yolo Yunlong Tang, Feng Zheng

Main category: cs.SD

TL;DR: LaunchpadGPT is a model that automatically generates music visualization designs for Launchpad musical instruments by taking audio input and producing Launchpad-playing videos using a language model trained on music-video pairs.

Details

Motivation: To assist in designing Launchpad light effects and provide beginners with an accessible way to create music visualization using the Launchpad instrument.

Method: Collect Launchpad-playing videos, process them into music and corresponding video frame pairs as prompt-completion data, then train a language model to generate Launchpad light effects from audio input.

Result: The proposed method creates better music visualization than random generation methods and shows potential for broader music visualization applications.

Conclusion: LaunchpadGPT successfully generates automated music visualization for Launchpad instruments and demonstrates promising capabilities for music visualization applications.

Abstract: Launchpad is a musical instrument that allows users to create and perform music by pressing illuminated buttons. To assist and inspire the design of the Launchpad light effect, and provide a more accessible approach for beginners to create music visualization with this instrument, we proposed the LaunchpadGPT model to generate music visualization designs on Launchpad automatically. Based on the language model with excellent generation ability, our proposed LaunchpadGPT takes an audio piece of music as input and outputs the lighting effects of Launchpad-playing in the form of a video (Launchpad-playing video). We collect Launchpad-playing videos and process them to obtain music and corresponding video frame of Launchpad-playing as prompt-completion pairs, to train the language model. The experiment result shows the proposed method can create better music visualization than random generation methods and hold the potential for a broader range of music visualization applications. Our code is available at https://github.com/yunlong10/LaunchpadGPT/.

[355] Benchmarking Fake Voice Detection in the Fake Voice Generation Arms Race

Xutao Mao, Ke Li, Cameron Baird, Ezra Xuanru Tao, Dan Lin

Main category: cs.SD

TL;DR: This paper presents the first large-scale cross-domain evaluation of fake voice detectors, testing 8 state-of-the-art models against 20 fake voice generation systems, revealing significant security vulnerabilities and proposing a unified evaluation metric.

Details

Motivation: The rapid advancement in synthetic voice generation technology poses serious threats to sectors relying on audio evidence, creating an intense arms race between fake voice generation and detection that requires comprehensive evaluation.

Method: Conducted large-scale cross-domain evaluation benchmarking 8 state-of-the-art fake voice detection models against datasets synthesized by 20 different fake voice generation systems.

Result: Revealed substantial security vulnerabilities in current fake voice detection systems, highlighting critical gaps in real-world robustness.

Conclusion: Proposed a unified evaluation metric for standardized comparisons and offered actionable recommendations for building more resilient fake voice detection technologies to enhance AI security and trustworthiness.

Abstract: As advances in synthetic voice generation accelerate, an increasing variety of fake voice generators have emerged, producing audio that is often indistinguishable from real human speech. This evolution poses new and serious threats across sectors where audio recordings serve as critical evidence. Although fake voice detectors are also advancing, the arms race between fake voice generation and detection has become more intense and complex. In this work, we present the first large-scale, cross-domain evaluation of fake voice detectors, benchmarking 8 state-of-the-art models against datasets synthesized by 20 different fake voice generation systems. To the best of our knowledge, this is the most comprehensive cross-domain assessment conducted to date. Our study reveals substantial security vulnerabilities in current fake voice detection systems, underscoring critical gaps in their real-world robustness. To advance the field, we propose a unified and effective metric that consolidates the diverse and often inconsistent evaluation criteria previously used across different studies. This metric enables standardized, straightforward comparisons of the robustness of fake voice detectors. We conclude by offering actionable recommendations for building more resilient fake voice detection technologies, with the broader goal of reinforcing the foundations of AI security and trustworthiness.

[356] Pitch Estimation With Mean Averaging Smoothed Product Spectrum And Musical Consonance Evaluation Using MASP

Murat Yasar Baskin

Main category: cs.SD

TL;DR: The paper introduces MASP Spectrum, an enhanced version of Harmonic Product Spectrum for improved pitch estimation, and extends it to measure musical consonance using a harmonicity measure.

Details

Motivation: To improve pitch estimation for frequency spectra with missing partials and explore the relationship between consonance and periodicity in music perception.

Method: Developed Mean Averaging Smoothed Product (MASP) Spectrum with global mean-based smoothing to reduce sensitivity to missing partials, then extended the algorithm with a harmonicity measure (H) to evaluate musical consonance.

Result: MASP achieved robust pitch estimation consistent with perceptual expectations and produced consonance hierarchies for two and three tones that align with music theory and perception.

Conclusion: Pitch and consonance perception likely share similar underlying mechanisms dependent on spectral characteristics.

Abstract: This study introduces Mean Averaging Smoothed Product (MASP) Spectrum, which is a modified version of the Harmonic Product Spectrum, designed to enhance pitch estimation for many algorithm-wise deceptive frequency spectra that still lead clear pitches, for both harmonic and inharmonic cases. By introducing a global mean based smoothing for spectrum, the MASP algorithm diminishes the unwanted sensitivity of HPS for spectra with missing partials. The method exhibited robust pitch estimations consistent with perceptual expectations. Motivated upon the strong correlation between consonance and periodicity, the same algorithm is extended and, with the proposition of a harmonicity measure (H), used to evaluate musical consonance for two and three tones; yielding consonance hierarchies that align with perception and practice of music theory. These findings suggest that perception of pitch and consonance may share a similar underlying mechanism that depend on spectrum.

[357] XLSR-Kanformer: A KAN-Intergrated model for Synthetic Speech Detection

Phuong Tuan Dat, Tran Huy Dat

Main category: cs.SD

TL;DR: Replacing MLP with KAN in XLSR-Conformer improves synthetic speech detection by 60.55% in EER on ASVspoof2021.

Details

Motivation: Address sophisticated spoofing attacks in speaker verification systems and improve SSL-based synthetic speech detection performance.

Method: Replace traditional MLP in XLSR-Conformer model with Kolmogorov-Arnold Network (KAN) based on the Kolmogorov-Arnold representation theorem.

Result: 60.55% relative improvement in EER on LA and DF sets, achieving 0.70% EER on 21LA set; robust across various SSL architectures.

Conclusion: Incorporating KAN into SSL-based models is a promising direction for advancing synthetic speech detection capabilities.

Abstract: Recent advancements in speech synthesis technologies have led to increasingly sophisticated spoofing attacks, posing significant challenges for automatic speaker verification systems. While systems based on self-supervised learning (SSL) models, particularly the XLSR-Conformer architecture, have demonstrated remarkable performance in synthetic speech detection, there remains room for architectural improvements. In this paper, we propose a novel approach that replaces the traditional Multi-Layer Perceptron (MLP) in the XLSR-Conformer model with a Kolmogorov-Arnold Network (KAN), a powerful universal approximator based on the Kolmogorov-Arnold representation theorem. Our experimental results on ASVspoof2021 demonstrate that the integration of KAN to XLSR-Conformer model can improve the performance by 60.55% relatively in Equal Error Rate (EER) LA and DF sets, further achieving 0.70% EER on the 21LA set. Besides, the proposed replacement is also robust to various SSL architectures. These findings suggest that incorporating KAN into SSL-based models is a promising direction for advances in synthetic speech detection.

[358] AudioMarathon: A Comprehensive Benchmark for Long-Context Audio Understanding and Efficiency in Audio LLMs

Peize He, Zichen Wen, Yubo Wang, Yuxuan Wang, Xiaoqian Liu, Jiajie Huang, Zehui Lei, Zhuangcheng Gu, Xiangqi Jin, Jiabing Yang, Kai Li, Zhifei Liu, Weijia Li, Cunxiang Wang, Conghui He, Linfeng Zhang

Main category: cs.SD

TL;DR: AudioMarathon is a new benchmark for evaluating Large Audio Language Models (LALMs) on long-form audio tasks, addressing limitations of existing short-clip benchmarks and assessing both understanding and efficiency on audio sequences up to 300 seconds.

Details

Motivation: Existing audio benchmarks use mostly short clips and don't evaluate models in realistic long-context settings. LALMs struggle with quadratic attention costs and long-range temporal dependencies.

Method: Created AudioMarathon benchmark with three pillars: long-context audio inputs (90-300 seconds), full domain coverage (speech, sound, music), and complex reasoning requiring multi-hop inference. Evaluated state-of-the-art LALMs and studied acceleration techniques like token pruning and KV cache eviction.

Result: Clear performance drops observed as audio length grows. Large gaps found across current LALMs, highlighting need for better temporal reasoning and memory-efficient architectures. Trade-offs of acceleration techniques were analyzed.

Conclusion: AudioMarathon will drive the audio and multimodal research community to develop more advanced audio understanding models capable of solving complex audio tasks.

Abstract: Processing long-form audio is a major challenge for Large Audio Language models (LALMs). These models struggle with the quadratic cost of attention ($O(N^2)$) and with modeling long-range temporal dependencies. Existing audio benchmarks are built mostly from short clips and do not evaluate models in realistic long context settings. To address this gap, we introduce AudioMarathon, a benchmark designed to evaluate both understanding and inference efficiency on long-form audio. AudioMarathon provides a diverse set of tasks built upon three pillars: long-context audio inputs with durations ranging from 90.0 to 300.0 seconds, which correspond to encoded sequences of 2,250 to 7,500 audio tokens, respectively, full domain coverage across speech, sound, and music, and complex reasoning that requires multi-hop inference. We evaluate state-of-the-art LALMs and observe clear performance drops as audio length grows. We also study acceleration techniques and analyze the trade-offs of token pruning and KV cache eviction. The results show large gaps across current LALMs and highlight the need for better temporal reasoning and memory-efficient architectures. We believe AudioMarathon will drive the audio and multimodal research community to develop more advanced audio understanding models capable of solving complex audio tasks.

[359] Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation

Haorui He, Zengqiang Shang, Chaoren Wang, Xuyuan Li, Yicheng Gu, Hua Hua, Liwei Liu, Chen Yang, Jiaqi Li, Peiyang Shi, Yuancheng Wang, Kai Chen, Pengyuan Zhang, Zhizheng Wu

Main category: cs.SD

TL;DR: Emilia-Pipe is a preprocessing pipeline that extracts high-quality speech data from in-the-wild sources to capture spontaneous human speech. The resulting Emilia dataset contains over 101k hours across 6 languages, expanded to 216k hours as Emilia-Large.

Details

Motivation: Current speech generation models struggle with spontaneity and variability of real human speech because they're trained on formal audio-book datasets with read-aloud speaking styles.

Method: Developed Emilia-Pipe preprocessing pipeline to extract high-quality training data from in-the-wild sources capturing spontaneous speech in real-world contexts across multiple languages.

Result: Models trained on Emilia produce more spontaneous, human-like speech while maintaining intelligibility, better capturing diverse speaker timbres and conversational styles. Emilia-Large with 216k hours is one of the largest open-source speech generation resources.

Conclusion: The work demonstrates the importance of scaling dataset size for speech generation performance and validates Emilia’s effectiveness for multilingual and crosslingual speech generation tasks.

Abstract: Recent advancements in speech generation have been driven by large-scale training datasets. However, current models struggle to capture the spontaneity and variability inherent in real-world human speech, as they are primarily trained on audio-book datasets limited to formal, read-aloud speaking styles. To address this limitation, we introduce Emilia-Pipe, an open-source preprocessing pipeline designed to extract high-quality training data from valuable yet under-explored in-the-wild sources that capture spontaneous human speech in real-world contexts. Using Emilia-Pipe, we construct Emilia, which comprises over 101k hours of speech across six languages: English, Chinese, German, French, Japanese, and Korean. Furthermore, we expand Emilia to Emilia-Large, a dataset exceeding 216k hours, making it one of the largest open-source speech generation resources available. Extensive experiments show that Emilia-trained models produce markedly more spontaneous, human-like speech than those trained on traditional audio-book datasets, while matching their intelligibility. These models better capture diverse speaker timbres and the full spectrum of real-world conversational styles. Our work also highlights the importance of scaling dataset size for advancing speech generation performance and validates the effectiveness of Emilia for both multilingual and crosslingual speech generation tasks.

[360] Token-based Audio Inpainting via Discrete Diffusion

Tali Dror, Iftach Shoham, Moshe Buchris, Oren Gal, Haim Permuter, Gilad Katz, Eliya Nachmani

Main category: cs.SD

TL;DR: A novel discrete diffusion approach for audio inpainting that uses tokenized music representations to restore large missing segments, outperforming previous methods for gaps of 150ms and above.

Details

Motivation: Previous diffusion-based audio inpainting methods perform poorly when dealing with large missing regions in audio recordings, creating a need for more stable and semantically coherent restoration techniques.

Method: Uses discrete diffusion over tokenized music representations from a pre-trained audio tokenizer, with derivative-based regularization loss for smooth temporal dynamics and span-based absorbing transition for structured corruption during diffusion.

Result: Consistently outperforms strong baselines on MusicNet and MAESTRO datasets for gaps up to 750ms, particularly for gaps of 150ms and above.

Conclusion: This work advances musical audio restoration and introduces new directions for discrete diffusion model training, demonstrating effective restoration of long gaps in audio recordings.

Abstract: Audio inpainting seeks to restore missing segments in degraded recordings. Previous diffusion-based methods exhibit impaired performance when the missing region is large. We introduce the first approach that applies discrete diffusion over tokenized music representations from a pre-trained audio tokenizer, enabling stable and semantically coherent restoration of long gaps. Our method further incorporates two training approaches: a derivative-based regularization loss that enforces smooth temporal dynamics, and a span-based absorbing transition that provides structured corruption during diffusion. Experiments on the MusicNet and MAESTRO datasets with gaps up to 750 ms show that our approach consistently outperforms strong baselines across range of gap lengths, for gaps of 150 ms and above. This work advances musical audio restoration and introduces new directions for discrete diffusion model training. Audio examples of our proposed method can be found at https://iftach21.github.io/.

[361] Baseline Systems For The 2025 Low-Resource Audio Codec Challenge

Yusuf Ziya Isik, Rafał Łaganowski

Main category: cs.SD

TL;DR: The LRAC Challenge 2025 introduces baseline neural audio codec systems for low-resource environments, with Track 1 focusing on transparent speech coding and Track 2 on enhancement coding that combines compression with denoising/dereverberation.

Details

Motivation: To advance neural audio coding for deployment in resource-constrained environments that must operate reliably under everyday noise and reverberation while satisfying strict computational complexity, latency, and bitrate constraints.

Method: Convolutional neural codec models with Residual Vector Quantization, trained end-to-end using a combination of adversarial and reconstruction objectives, with detailed data filtering and augmentation strategies.

Result: Official baseline systems for both tracks (transparency codecs and enhancement codecs) in the 2025 LRAC Challenge are presented.

Conclusion: The paper establishes comprehensive baseline systems for the LRAC Challenge, providing standardized models and methodologies for developing low-resource neural speech codecs that can handle real-world acoustic conditions.

Abstract: The Low-Resource Audio Codec (LRAC) Challenge aims to advance neural audio coding for deployment in resource-constrained environments. The first edition focuses on low-resource neural speech codecs that must operate reliably under everyday noise and reverberation, while satisfying strict constraints on computational complexity, latency, and bitrate. Track 1 targets transparency codecs, which aim to preserve the perceptual transparency of input speech under mild noise and reverberation. Track 2 addresses enhancement codecs, which combine coding and compression with denoising and dereverberation. This paper presents the official baseline systems for both tracks in the 2025 LRAC Challenge. The baselines are convolutional neural codec models with Residual Vector Quantization, trained end-to-end using a combination of adversarial and reconstruction objectives. We detail the data filtering and augmentation strategies, model architectures, optimization procedures, and checkpoint selection criteria.

cs.LG

[362] RareGraph-Synth: Knowledge-Guided Diffusion Models for Generating Privacy-Preserving Synthetic Patient Trajectories in Ultra-Rare Diseases

Khartik Uppalapati, Shakeel Abdulkareem, Bora Yimenicioglu

Main category: cs.LG

TL;DR: RareGraph-Synth is a knowledge-guided diffusion framework that generates privacy-preserving synthetic EHR data for ultra-rare diseases by integrating multiple biomedical knowledge graphs to guide generation.

Details

Motivation: To enable safer data sharing for rare-disease research by generating realistic yet privacy-preserving synthetic EHR trajectories that protect patient privacy while maintaining data utility.

Method: Unifies five public biomedical resources into an 8-million-edge knowledge graph, uses meta-path scores to modulate noise schedules in continuous-time diffusion models, and generates timestamped sequences of lab-medication-adverse-event triples without protected health information.

Result: Reduces categorical Maximum Mean Discrepancy by 40% vs unguided diffusion and >60% vs GANs, achieves AUROC of ~0.53 in membership-inference attacks (below 0.55 safe threshold), and maintains downstream predictive utility.

Conclusion: Integrating biomedical knowledge graphs directly into diffusion noise schedules can simultaneously enhance data fidelity and privacy, enabling safer rare-disease data sharing.

Abstract: We propose RareGraph-Synth, a knowledge-guided, continuous-time diffusion framework that generates realistic yet privacy-preserving synthetic electronic-health-record (EHR) trajectories for ultra-rare diseases. RareGraph-Synth unifies five public resources: Orphanet/Orphadata, the Human Phenotype Ontology (HPO), the GARD rare-disease KG, PrimeKG, and the FDA Adverse Event Reporting System (FAERS) into a heterogeneous knowledge graph comprising approximately 8 M typed edges. Meta-path scores extracted from this 8-million-edge KG modulate the per-token noise schedule in the forward stochastic differential equation, steering generation toward biologically plausible lab-medication-adverse-event co-occurrences while retaining score-based diffusion model stability. The reverse denoiser then produces timestamped sequences of lab-code, medication-code, and adverse-event-flag triples that contain no protected health information. On simulated ultra-rare-disease cohorts, RareGraph-Synth lowers categorical Maximum Mean Discrepancy by 40 percent relative to an unguided diffusion baseline and by greater than 60 percent versus GAN counterparts, without sacrificing downstream predictive utility. A black-box membership-inference evaluation using the DOMIAS attacker yields AUROC approximately 0.53, well below the 0.55 safe-release threshold and substantially better than the approximately 0.61 plus or minus 0.03 observed for non-KG baselines, demonstrating strong resistance to re-identification. These results suggest that integrating biomedical knowledge graphs directly into diffusion noise schedules can simultaneously enhance fidelity and privacy, enabling safer data sharing for rare-disease research.

[363] MCCE: A Framework for Multi-LLM Collaborative Co-Evolution

Nian Ran, Zhongzheng Li, Yue Wang, Qingsong Ran, Xiaoyuan Zhang, Shikun Feng, Richard Allmendinger, Xiaoguang Zhao

Main category: cs.LG

TL;DR: MCCE is a hybrid framework combining a frozen closed-source LLM with a lightweight trainable model for multi-objective discrete optimization, achieving state-of-the-art performance in drug design.

Details

Motivation: Traditional evolutionary algorithms get trapped in local optima, while LLMs offer powerful priors but closed-source models cannot update parameters and smaller models lack broad knowledge.

Method: Multi-LLM Collaborative Co-evolution (MCCE) maintains trajectory memory and progressively refines the small model via reinforcement learning, with both models jointly supporting global exploration.

Result: Experiments on multi-objective drug design benchmarks show MCCE achieves state-of-the-art Pareto front quality and consistently outperforms baselines.

Conclusion: MCCE introduces a new paradigm for continual evolution in hybrid LLM systems, combining knowledge-driven exploration with experience-driven learning.

Abstract: Multi-objective discrete optimization problems, such as molecular design, pose significant challenges due to their vast and unstructured combinatorial spaces. Traditional evolutionary algorithms often get trapped in local optima, while expert knowledge can provide crucial guidance for accelerating convergence. Large language models (LLMs) offer powerful priors and reasoning ability, making them natural optimizers when expert knowledge matters. However, closed-source LLMs, though strong in exploration, cannot update their parameters and thus cannot internalize experience. Conversely, smaller open models can be continually fine-tuned but lack broad knowledge and reasoning strength. We introduce Multi-LLM Collaborative Co-evolution (MCCE), a hybrid framework that unites a frozen closed-source LLM with a lightweight trainable model. The system maintains a trajectory memory of past search processes; the small model is progressively refined via reinforcement learning, with the two models jointly supporting and complementing each other in global exploration. Unlike model distillation, this process enhances the capabilities of both models through mutual inspiration. Experiments on multi-objective drug design benchmarks show that MCCE achieves state-of-the-art Pareto front quality and consistently outperforms baselines. These results highlight a new paradigm for enabling continual evolution in hybrid LLM systems, combining knowledge-driven exploration with experience-driven learning.

[364] Flexible Swarm Learning May Outpace Foundation Models in Essential Tasks

Moein E. Samadi, Andreas Schuppert

Main category: cs.LG

TL;DR: Foundation models show limited gains in dynamic real-world domains like intensive care. The paper proposes decentralized small agent networks (SANs) as superior to monolithic foundation models for self-adaptive decision-making in complex systems.

Details

Motivation: To address the challenge of adapting AI to dynamic environments where foundation models show modest gains, particularly in complex systems like medical diagnosis and treatment that require reliable self-adaptive modeling without full mechanistic understanding.

Method: Proposes a decentralized architecture of interacting small agent networks (SANs), where each agent represents specialized substructures and covers only a subset of system functions, using swarm-learning in diverse swarms for self-adaptation.

Result: SANs can overcome the curse of dimensionality barrier that limits monolithic foundation models, enabling superior decision-making in dynamic environments through decentralized swarm-learning approaches.

Conclusion: Decentralized SAN architectures offer a more effective approach than monolithic foundation models for self-adaptive AI in complex dynamic systems, though they may sacrifice some reproducibility in detail.

Abstract: Foundation models have rapidly advanced AI, raising the question of whether their decisions will ultimately surpass human strategies in real-world domains. The exponential, and possibly super-exponential, pace of AI development makes such analysis elusive. Nevertheless, many application areas that matter for daily life and society show only modest gains so far; a prominent case is diagnosing and treating dynamically evolving disease in intensive care. The common challenge is adapting complex systems to dynamic environments. Effective strategies must optimize outcomes in systems composed of strongly interacting functions while avoiding shared side effects; this requires reliable, self-adaptive modeling. These tasks align with building digital twins of highly complex systems whose mechanisms are not fully or quantitatively understood. It is therefore essential to develop methods for self-adapting AI models with minimal data and limited mechanistic knowledge. As this challenge extends beyond medicine, AI should demonstrate clear superiority in these settings before assuming broader decision-making roles. We identify the curse of dimensionality as a fundamental barrier to efficient self-adaptation and argue that monolithic foundation models face conceptual limits in overcoming it. As an alternative, we propose a decentralized architecture of interacting small agent networks (SANs). We focus on agents representing the specialized substructure of the system, where each agent covers only a subset of the full system functions. Drawing on mathematical results on the learning behavior of SANs and evidence from existing applications, we argue that swarm-learning in diverse swarms can enable self-adaptive SANs to deliver superior decision-making in dynamic environments compared with monolithic foundation models, though at the cost of reduced reproducibility in detail.

[365] RVFL-X: A Novel Randomized Network Based on Complex Transformed Real-Valued Tabular Datasets

M. Sajid, Mushir Akhtar, A. Quadir, M. Tanveer

Main category: cs.LG

TL;DR: RVFL-X is a complex-valued extension of RVFL networks that transforms real-valued tabular data into complex representations using natural transformation and autoencoder methods, achieving superior performance over original RVFL and SOTA randomized neural networks.

Details

Motivation: To leverage the superior representational power of complex numbers in randomized neural networks, which has been limited due to lack of effective methods for converting real-valued tabular datasets to complex-valued representations.

Method: Proposed two methods for generating complex-valued representations from real-valued data: natural transformation and autoencoder-driven method. Developed RVFL-X, a complex-valued RVFL extension that integrates complex transformations while maintaining RVFL’s simplicity and efficiency.

Result: Comprehensive evaluations on 80 real-valued UCI datasets show RVFL-X consistently outperforms both original RVFL and state-of-the-art RNN variants, demonstrating robustness and effectiveness across diverse domains.

Conclusion: RVFL-X successfully bridges the gap between complex number representational power and practical application in randomized neural networks for real-valued tabular data, achieving significant performance improvements.

Abstract: Recent advancements in neural networks, supported by foundational theoretical insights, emphasize the superior representational power of complex numbers. However, their adoption in randomized neural networks (RNNs) has been limited due to the lack of effective methods for transforming real-valued tabular datasets into complex-valued representations. To address this limitation, we propose two methods for generating complex-valued representations from real-valued datasets: a natural transformation and an autoencoder-driven method. Building on these mechanisms, we propose RVFL-X, a complex-valued extension of the random vector functional link (RVFL) network. RVFL-X integrates complex transformations into real-valued datasets while maintaining the simplicity and efficiency of the original RVFL architecture. By leveraging complex components such as input, weights, and activation functions, RVFL-X processes complex representations and produces real-valued outputs. Comprehensive evaluations on 80 real-valued UCI datasets demonstrate that RVFL-X consistently outperforms both the original RVFL and state-of-the-art (SOTA) RNN variants, showcasing its robustness and effectiveness across diverse application domains.

[366] On knot detection via picture recognition

Anne Dranowski, Yura Kabkov, Daniel Tubbenhauer

Main category: cs.LG

TL;DR: A strategy combining machine learning and traditional algorithms for knot recognition from photos, with the goal of enabling automatic knot classification through image analysis.

Details

Motivation: To develop a system that can automatically recognize knots from photos using a phone, bridging computer vision and mathematical knot theory.

Method: Two-stage approach: 1) Lightweight CNN and transformer architectures for direct crossing number prediction from images, 2) Symbolic reconstruction into planar diagram codes for downstream invariant computation.

Result: Simple baselines show that even lightweight architectures can recover meaningful structural information about knots from images.

Conclusion: The approach demonstrates complementarity between machine learning (handling noisy visual data) and mathematical invariants (enforcing rigorous topological distinctions) for robust knot classification.

Abstract: Our goal is to one day take a photo of a knot and have a phone automatically recognize it. In this expository work, we explain a strategy to approximate this goal, using a mixture of modern machine learning methods (in particular convolutional neural networks and transformers for image recognition) and traditional algorithms (to compute quantum invariants like the Jones polynomial). We present simple baselines that predict crossing number directly from images, showing that even lightweight CNN and transformer architectures can recover meaningful structural information. The longer-term aim is to combine these perception modules with symbolic reconstruction into planar diagram (PD) codes, enabling downstream invariant computation for robust knot classification. This two-stage approach highlights the complementarity between machine learning, which handles noisy visual data, and invariants, which enforce rigorous topological distinctions.

[367] Traj-Transformer: Diffusion Models with Transformer for GPS Trajectory Generation

Zhiyang Zhang, Ningcong Chen, Xin Zhang, Yanhua Li, Shen Su, Hui Lu, Jun Luo

Main category: cs.LG

TL;DR: Proposes Trajectory Transformer, a transformer-based model for GPS trajectory generation that addresses deviation issues and loss of fine-grained details in existing diffusion-based methods.

Details

Motivation: Existing diffusion models for trajectory generation use convolution-based architectures (like UNet) which cause notable deviations and loss of street-level details due to limited model capacity.

Method: Uses transformer backbone for both conditional information embedding and noise prediction, with two GPS coordinate embedding strategies: location embedding and longitude-latitude embedding, analyzed at different scales.

Result: Experiments on two real-world datasets show Trajectory Transformer significantly enhances generation quality and effectively alleviates deviation issues from prior approaches.

Conclusion: Transformer-based architecture outperforms convolution-based methods for trajectory generation, providing better quality and reducing deviations while preserving fine-grained details.

Abstract: The widespread use of GPS devices has driven advances in spatiotemporal data mining, enabling machine learning models to simulate human decision making and generate realistic trajectories, addressing both data collection costs and privacy concerns. Recent studies have shown the promise of diffusion models for high-quality trajectory generation. However, most existing methods rely on convolution based architectures (e.g. UNet) to predict noise during the diffusion process, which often results in notable deviations and the loss of fine-grained street-level details due to limited model capacity. In this paper, we propose Trajectory Transformer, a novel model that employs a transformer backbone for both conditional information embedding and noise prediction. We explore two GPS coordinate embedding strategies, location embedding and longitude-latitude embedding, and analyze model performance at different scales. Experiments on two real-world datasets demonstrate that Trajectory Transformer significantly enhances generation quality and effectively alleviates the deviation issues observed in prior approaches.

[368] A Multi-Agent Framework for Stateful Inference-Time Search

Arshika Lalan, Rajat Ghosh, Aditya Kolsur, Debojyoti Dutta

Main category: cs.LG

TL;DR: Stateful multi-agent evolutionary search framework improves unit test generation by combining persistent inference-time state, adversarial mutation, and evolutionary preservation to generate robust edge cases.

Details

Motivation: Stateless inference struggles with multi-step reasoning tasks, and task-specific fine-tuning often fails on deeper reasoning with long-horizon dependencies.

Method: Training-free framework with persistent inference-time state, adversarial mutation, and evolutionary preservation using specialized agents for proposing, mutating, and scoring candidates.

Result: Achieves substantial gains in coverage over stateless single-step baselines on HumanEval and TestGenEvalMini benchmarks using Llama, Gemma, and GPT LLMs.

Conclusion: Combining persistent inference-time state with evolutionary search materially improves unit-test generation capabilities.

Abstract: Recent work explores agentic inference-time techniques to perform structured, multi-step reasoning. However, stateless inference often struggles on multi-step tasks due to the absence of persistent state. Moreover, task-specific fine-tuning or instruction-tuning often achieve surface-level code generation but remain brittle on tasks requiring deeper reasoning and long-horizon dependencies. To address these limitations, we propose stateful multi-agent evolutionary search, a training-free framework that departs from prior stateless approaches by combining (i) persistent inference-time state, (ii) adversarial mutation, and (iii) evolutionary preservation. We demonstrate its effectiveness in automated unit test generation through the generation of edge cases. We generate robust edge cases using an evolutionary search process, where specialized agents sequentially propose, mutate, and score candidates. A controller maintains persistent state across generations, while evolutionary preservation ensures diversity and exploration across all possible cases. This yields a generalist agent capable of discovering robust, high-coverage edge cases across unseen codebases. Experiments show our stateful multi-agent inference framework achieves substantial gains in coverage over stateless single-step baselines, evaluated on prevalent unit-testing benchmarks such as HumanEval and TestGenEvalMini and using three diverse LLM families - Llama, Gemma, and GPT. These results indicate that combining persistent inference-time state with evolutionary search materially improves unit-test generation.

[369] BlockGPT: Spatio-Temporal Modelling of Rainfall via Frame-Level Autoregression

Cristian Meo, Varun Sarathchandran, Avijit Majhi, Shao Hung, Carlo Saccardi, Ruben Imhoff, Roberto Deidda, Remko Uijlenhoet, Justin Dauwels

Main category: cs.LG

TL;DR: BlockGPT is a generative autoregressive transformer for precipitation nowcasting that uses batched tokenization to predict full 2D fields per time step, achieving superior accuracy and 31x faster inference than baselines.

Details

Motivation: Current precipitation nowcasting methods have limitations: token-based autoregressive models suffer from flawed inductive biases and slow inference, while diffusion models are computationally intensive.

Method: BlockGPT uses a model-agnostic paradigm with batched tokenization (Block method) that factorizes space-time using self-attention within frames and causal attention across frames to predict full 2D precipitation fields.

Result: BlockGPT achieves superior accuracy, better event localization on categorical metrics, and inference speeds up to 31x faster than state-of-the-art baselines (NowcastingGPT and DiffCast+Phydnet) on KNMI and SEVIR datasets.

Conclusion: BlockGPT provides an effective solution for precipitation nowcasting that balances accuracy with computational efficiency, making it suitable for real-time applications.

Abstract: Predicting precipitation maps is a highly complex spatiotemporal modeling task, critical for mitigating the impacts of extreme weather events. Short-term precipitation forecasting, or nowcasting, requires models that are not only accurate but also computationally efficient for real-time applications. Current methods, such as token-based autoregressive models, often suffer from flawed inductive biases and slow inference, while diffusion models can be computationally intensive. To address these limitations, we introduce BlockGPT, a generative autoregressive transformer using batched tokenization (Block) method that predicts full two-dimensional fields (frames) at each time step. Conceived as a model-agnostic paradigm for video prediction, BlockGPT factorizes space-time by using self-attention within each frame and causal attention across frames; in this work, we instantiate it for precipitation nowcasting. We evaluate BlockGPT on two precipitation datasets, viz. KNMI (Netherlands) and SEVIR (U.S.), comparing it to state-of-the-art baselines including token-based (NowcastingGPT) and diffusion-based (DiffCast+Phydnet) models. The results show that BlockGPT achieves superior accuracy, event localization as measured by categorical metrics, and inference speeds up to 31x faster than comparable baselines.

[370] SDAR: A Synergistic Diffusion-AutoRegression Paradigm for Scalable Sequence Generation

Shuang Cheng, Yihan Bian, Dawei Liu, Yuhua Jiang, Yihao Liu, Linfeng Zhang, Wenhai Wang, Qipeng Guo, Kai Chen, Biqing Qi, Bowen Zhou

Main category: cs.LG

TL;DR: SDAR is a hybrid approach that converts autoregressive models into blockwise diffusion models, enabling parallel inference while maintaining training efficiency.

Details

Motivation: To combine the training efficiency of autoregressive models with the parallel inference capability of diffusion models, avoiding costly end-to-end diffusion training.

Method: Lightweight paradigm conversion that transforms well-trained AR models into blockwise diffusion models through brief, data-efficient adaptation. Uses autoregressive generation across blocks for global coherence and parallel diffusion decoding within blocks.

Result: SDAR achieves efficient AR-to-diffusion conversion with minimal cost, preserving AR-level performance while enabling parallel generation. Larger models show stronger robustness to block size and decoding thresholds, yielding greater speedups without accuracy loss. 30B MoE model surpasses AR counterpart on scientific reasoning benchmarks like GPQA and ChemBench.

Conclusion: SDAR successfully combines the strengths of autoregression and diffusion for scalable, high-throughput reasoning, demonstrating enhanced reasoning and domain adaptability while maintaining efficiency.

Abstract: We propose SDAR, a Synergistic Diffusion-Autoregression paradigm that unifies the training efficiency of autoregressive models with the parallel inference capability of diffusion. Instead of costly end-to-end diffusion training, SDAR performs a lightweight paradigm conversion that transforms a well-trained autoregressive (AR) model into a blockwise diffusion model through brief, data-efficient adaptation. During inference, SDAR generates sequences autoregressively across blocks for global coherence while decoding all tokens within each block in parallel via a discrete diffusion process. Extensive experiments show that AR models remain substantially more compute-efficient than masked diffusion models, providing a strong foundation for adaptation. Building on this insight, SDAR achieves efficient AR-to-diffusion conversion with minimal cost, preserving AR-level performance while enabling parallel generation. Scaling studies across dense and Mixture-of-Experts architectures confirm that SDAR scales without compromise: larger models exhibit stronger robustness to block size and decoding thresholds, yielding greater speedups without accuracy loss. Beyond efficiency, SDAR demonstrates enhanced reasoning and domain adaptability. Our 30B MoE model surpasses its AR counterpart on challenging scientific reasoning benchmarks such as GPQA and ChemBench, and gains further improvements under test-time scaling methods like majority voting and pass@k. Together, these results establish SDAR as a practical paradigm that combines the strengths of autoregression and diffusion for scalable, high-throughput reasoning.

[371] The Framework That Survives Bad Models: Human-AI Collaboration For Clinical Trials

Yao Chen, David Ohlssen, Aimee Readie, Gregory Ligozio, Ruvie Martin, Thibaud Coroller

Main category: cs.LG

TL;DR: AI-SR (AI as supporting reader) is the most suitable AI framework for clinical trials, maintaining reliable disease estimation and preserving treatment effect conclusions even with poor-quality models.

Details

Motivation: AI can support clinical trials but poses risks when evaluating patient endpoints that impact trial conclusions. Need to ensure treatment effects remain valid even under model degradation.

Method: Compared two AI frameworks against human-only assessment for medical image-based disease evaluation. Stress-tested by injecting bad models (random guesses to naive predictions) and evaluated using two randomized controlled trials with spinal X-ray endpoints.

Result: AI-SR framework met all criteria across various model types, even with bad models. It consistently provided reliable disease estimation and preserved clinical trial treatment effect estimates and conclusions.

Conclusion: AI as a supporting reader is the most suitable approach for clinical trials as it maintains reliability and preserves conclusions across different populations and model qualities.

Abstract: Artificial intelligence (AI) holds great promise for supporting clinical trials, from patient recruitment and endpoint assessment to treatment response prediction. However, deploying AI without safeguards poses significant risks, particularly when evaluating patient endpoints that directly impact trial conclusions. We compared two AI frameworks against human-only assessment for medical image-based disease evaluation, measuring cost, accuracy, robustness, and generalization ability. To stress-test these frameworks, we injected bad models, ranging from random guesses to naive predictions, to ensure that observed treatment effects remain valid even under severe model degradation. We evaluated the frameworks using two randomized controlled trials with endpoints derived from spinal X-ray images. Our findings indicate that using AI as a supporting reader (AI-SR) is the most suitable approach for clinical trials, as it meets all criteria across various model types, even with bad models. This method consistently provides reliable disease estimation, preserves clinical trial treatment effect estimates and conclusions, and retains these advantages when applied to different populations.

[372] PIKAN: Physics-Inspired Kolmogorov-Arnold Networks for Explainable UAV Channel Modelling

Kürşat Tekbıyık, Güneş Karabulut Kurt, Antoine Lesage-Landry

Main category: cs.LG

TL;DR: PIKAN embeds physical principles into neural networks for UAV channel modeling, achieving DL-level accuracy with only 232 parameters while providing symbolic, explainable expressions aligned with propagation laws.

Details

Motivation: Bridge the gap between interpretable but rigid deterministic models and accurate but uninterpretable deep learning models for UAV air-to-ground channel modeling in nonstationary environments.

Method: Propose Physics-Inspired Kolmogorov-Arnold Network (PIKAN) that embeds physical principles (free-space path loss, two-ray reflections) as flexible inductive biases in the learning process, unlike rigid PINNs.

Result: PIKAN achieves comparable accuracy to DL models with only 232 parameters (37x lighter than MLP baselines), maintains correlation with measurements, and provides symbolic expressions aligned with propagation laws.

Conclusion: PIKAN is an efficient, interpretable, and scalable solution for UAV channel modeling in beyond-5G and 6G networks, combining the benefits of both physics-based and data-driven approaches.

Abstract: Unmanned aerial vehicle (UAV) communications demand accurate yet interpretable air-to-ground (A2G) channel models that can adapt to nonstationary propagation environments. While deterministic models offer interpretability and deep learning (DL) models provide accuracy, both approaches suffer from either rigidity or a lack of explainability. To bridge this gap, we propose the Physics-Inspired Kolmogorov-Arnold Network (PIKAN) that embeds physical principles (e.g., free-space path loss, two-ray reflections) into the learning process. Unlike physics-informed neural networks (PINNs), PIKAN is more flexible for applying physical information because it introduces them as flexible inductive biases. Thus, it enables a more flexible training process. Experiments on UAV A2G measurement data show that PIKAN achieves comparable accuracy to DL models while providing symbolic and explainable expressions aligned with propagation laws. Remarkably, PIKAN achieves this performance with only 232 parameters, making it up to 37 times lighter than multilayer perceptron (MLP) baselines with thousands of parameters, without sacrificing correlation with measurements and also providing symbolic expressions. These results highlight PIKAN as an efficient, interpretable, and scalable solution for UAV channel modelling in beyond-5G and 6G networks.

[373] Lagrangian neural ODEs: Measuring the existence of a Lagrangian with Helmholtz metrics

Luca Wolf, Tobias Buck, Bjoern Malte Schaefer

Main category: cs.LG

TL;DR: The paper introduces Helmholtz metrics to quantify how closely an ODE resembles Euler-Lagrange equations, and combines them with neural ODEs to create Lagrangian neural ODEs that can learn physical systems directly from positional data.

Details

Motivation: Neural ODEs are powerful but not all solutions are physical (Euler-Lagrange equations). There's a need to quantify this resemblance and directly learn physical systems with proper Lagrangian structure.

Method: Developed Helmholtz metrics to measure ODE resemblance to Euler-Lagrange equations. Combined with second-order neural ODEs to form Lagrangian neural ODEs that learn Euler-Lagrange equations directly from positional data.

Result: The approach can distinguish Lagrangian from non-Lagrangian systems, improves neural ODE solutions, and achieves zero additional inference cost while learning only from positional data.

Conclusion: Helmholtz metrics provide effective quantification of physical resemblance in ODEs, and Lagrangian neural ODEs enable direct learning of Euler-Lagrange equations with improved performance and no inference overhead.

Abstract: Neural ODEs are a widely used, powerful machine learning technique in particular for physics. However, not every solution is physical in that it is an Euler-Lagrange equation. We present Helmholtz metrics to quantify this resemblance for a given ODE and demonstrate their capabilities on several fundamental systems with noise. We combine them with a second order neural ODE to form a Lagrangian neural ODE, which allows to learn Euler-Lagrange equations in a direct fashion and with zero additional inference cost. We demonstrate that, using only positional data, they can distinguish Lagrangian and non-Lagrangian systems and improve the neural ODE solutions.

[374] Relational Transformer: Toward Zero-Shot Foundation Models for Relational Data

Rishabh Ranjan, Valter Hudovernik, Mark Znidar, Charilaos Kanatsoulis, Roshan Upendra, Mahmoud Mohammadi, Joe Meyer, Tom Palczewski, Carlos Guestrin, Jure Leskovec

Main category: cs.LG

TL;DR: The Relational Transformer (RT) is a novel architecture that enables zero-shot transfer learning across diverse relational databases without task-specific fine-tuning, achieving strong performance through relational attention mechanisms.

Details

Motivation: Relational domains lack architectures that can transfer across datasets and tasks due to the diversity of relational data with varying schemas, graph structures, and functional dependencies.

Method: RT tokenizes cells with table/column metadata, uses masked token prediction for pretraining, and employs a novel Relational Attention mechanism over columns, rows, and primary-foreign key links.

Result: Pretrained on RelBench datasets, RT achieves 94% of fully supervised AUROC on binary classification tasks with zero-shot performance using a 22M parameter model, outperforming a 27B LLM (84%). Fine-tuning yields state-of-the-art results with high sample efficiency.

Conclusion: RT provides a practical path toward foundation models for relational data by effectively harnessing task-table context, relational attention patterns, and schema semantics for zero-shot transfer.

Abstract: Pretrained transformers readily adapt to new sequence modeling tasks via zero-shot prompting, but relational domains still lack architectures that transfer across datasets and tasks. The core challenge is the diversity of relational data, with varying heterogeneous schemas, graph structures and functional dependencies. In this paper, we present the Relational Transformer (RT) architecture, which can be pretrained on diverse relational databases and directly applied to unseen datasets and tasks without task- or dataset-specific fine-tuning, or retrieval of in-context examples. RT (i) tokenizes cells with table/column metadata, (ii) is pretrained via masked token prediction, and (iii) utilizes a novel \textit{Relational Attention} mechanism over columns, rows, and primary-foreign key links. Pretrained on RelBench datasets spanning tasks such as churn and sales forecasting, RT attains strong zero-shot performance, averaging 94% of fully supervised AUROC on binary classification tasks with a single forward pass of a 22M parameter model, as opposed to 84% for a 27B LLM. Fine-tuning yields state-of-the-art results with high sample efficiency. Our experiments show that RT’s zero-shot transfer harnesses task-table context, relational attention patterns and schema semantics. Overall, RT provides a practical path toward foundation models for relational data.

[375] A Differentiable Alignment Framework for Sequence-to-Sequence Modeling via Optimal Transport

Yacouba Kaloga, Shashi Kumar, Petr Motlicek, Ina Kodrasi

Main category: cs.LG

TL;DR: The paper proposes a novel differentiable alignment framework using one-dimensional optimal transport to address alignment inaccuracies in E2E ASR systems, introducing Sequence Optimal Transport Distance (SOTD) and Optimal Temporal Transport Classification (OTTC) loss.

Details

Motivation: To solve the peaky behavior and alignment inaccuracies in state-of-the-art E2E ASR systems like CTC and transducer-based models, which are critical for applications such as medical speech analysis and language learning tools.

Method: Proposes a differentiable alignment framework based on one-dimensional optimal transport, introducing SOTD pseudo-metric and OTTC loss for ASR, contrasting with CTC.

Result: Experimental results on TIMIT, AMI, and LibriSpeech datasets show considerable improvement in alignment performance compared to CTC and Consistency-Regularized CTC, though with a trade-off in ASR performance.

Conclusion: The work opens new avenues for seq2seq alignment research and provides a solid foundation for further exploration and development in the community.

Abstract: Accurate sequence-to-sequence (seq2seq) alignment is critical for applications like medical speech analysis and language learning tools relying on automatic speech recognition (ASR). State-of-the-art end-to-end (E2E) ASR systems, such as the Connectionist Temporal Classification (CTC) and transducer-based models, suffer from peaky behavior and alignment inaccuracies. In this paper, we propose a novel differentiable alignment framework based on one-dimensional optimal transport, enabling the model to learn a single alignment and perform ASR in an E2E manner. We introduce a pseudo-metric, called Sequence Optimal Transport Distance (SOTD), over the sequence space and discuss its theoretical properties. Based on the SOTD, we propose Optimal Temporal Transport Classification (OTTC) loss for ASR and contrast its behavior with CTC. Experimental results on the TIMIT, AMI, and LibriSpeech datasets show that our method considerably improves alignment performance compared to CTC and the more recently proposed Consistency-Regularized CTC, though with a trade-off in ASR performance. We believe this work opens new avenues for seq2seq alignment research, providing a solid foundation for further exploration and development within the community.

[376] Monte Carlo Permutation Search

Tristan Cazenave

Main category: cs.LG

TL;DR: MCPS improves GRAVE algorithm by including all playouts containing moves from root to node in exploration term, works well when deep RL isn’t feasible, and outperforms GRAVE in two-player games.

Details

Motivation: To develop a better MCTS algorithm for scenarios where deep reinforcement learning isn't practical or computing power is limited, such as in General Game Playing.

Method: Monte Carlo Permutation Search (MCPS) enhances MCTS by incorporating statistics from all playouts that contain all moves on the path from root to node in the exploration term, and uses abstract codes for moves instead of exact codes.

Result: MCPS outperforms GRAVE in all two-player games tested (board games, wargame, investment game, video game) and has equivalent performance in multi-player games. It eliminates the need for GRAVE’s bias hyperparameter and is insensitive to the ref hyperparameter.

Conclusion: MCPS is a significant improvement over GRAVE, particularly for two-player games, with better mathematical foundations and reduced hyperparameter sensitivity.

Abstract: We propose Monte Carlo Permutation Search (MCPS), a general-purpose Monte Carlo Tree Search (MCTS) algorithm that improves upon the GRAVE algorithm. MCPS is relevant when deep reinforcement learning is not an option, or when the computing power available before play is not substantial, such as in General Game Playing, for example. The principle of MCPS is to include in the exploration term of a node the statistics on all the playouts that contain all the moves on the path from the root to the node. We extensively test MCPS on a variety of games: board games, wargame, investment game, video game and multi-player games. MCPS has better results than GRAVE in all the two-player games. It has equivalent results for multi-player games because these games are inherently balanced even when players have different strengths. We also show that using abstract codes for moves instead of exact codes can be beneficial to both MCPS and GRAVE, as they improve the permutation statistics and the AMAF statistics. We also provide a mathematical derivation of the formulas used for weighting the three sources of statistics. These formulas are an improvement on the GRAVE formula since they no longer use the bias hyperparameter of GRAVE. Moreover, MCPS is not sensitive to the ref hyperparameter.

[377] AbsoluteNet: A Deep Learning Neural Network to Classify Cerebral Hemodynamic Responses of Auditory Processing

Behtom Adeli, John Mclinden, Pankaj Pandey, Ming Shao, Yalda Shahriari

Main category: cs.LG

TL;DR: AbsoluteNet, a novel DL architecture using spatio-temporal convolution and custom activation functions, achieves 87.0% accuracy in classifying auditory fNIRS responses, outperforming existing models by 3.8%.

Details

Motivation: To improve classification of auditory event-related responses in fNIRS data for BCI applications by leveraging deep learning's potential in decoding hemodynamic responses.

Method: Proposed AbsoluteNet architecture based on spatio-temporal convolution principles with customized activation functions, compared against fNIRSNET, MDNN, DeepConvNet, and ShallowConvNet.

Result: AbsoluteNet achieved 87.0% accuracy, 84.8% sensitivity, and 89.2% specificity in binary classification, outperforming the second-best model (fNIRSNET) by 3.8% in accuracy.

Conclusion: The proposed deep learning model effectively decodes auditory hemodynamic responses, demonstrating the importance of spatio-temporal feature aggregation and customized activation functions for fNIRS dynamics.

Abstract: In recent years, deep learning (DL) approaches have demonstrated promising results in decoding hemodynamic responses captured by functional near-infrared spectroscopy (fNIRS), particularly in the context of brain-computer interface (BCI) applications. This work introduces AbsoluteNet, a novel deep learning architecture designed to classify auditory event-related responses recorded using fNIRS. The proposed network is built upon principles of spatio-temporal convolution and customized activation functions. Our model was compared against several models, namely fNIRSNET, MDNN, DeepConvNet, and ShallowConvNet. The results showed that AbsoluteNet outperforms existing models, reaching 87.0% accuracy, 84.8% sensitivity, and 89.2% specificity in binary classification, surpassing fNIRSNET, the second-best model, by 3.8% in accuracy. These findings underscore the effectiveness of our proposed deep learning model in decoding hemodynamic responses related to auditory processing and highlight the importance of spatio-temporal feature aggregation and customized activation functions to better fit fNIRS dynamics.

[378] Making and Evaluating Calibrated Forecasts

Yuxuan Lu, Yifan Wu, Jason Hartline, Lunjia Hu

Main category: cs.LG

TL;DR: The paper introduces a perfectly truthful calibration measure for multi-class prediction tasks, extending previous binary-only measures. It analyzes which extension methods preserve truthfulness and demonstrates superior robustness compared to existing measures like binned ECE.

Details

Motivation: Existing calibration measures for multi-class prediction are non-truthful, meaning they incentivize predictors to lie about probabilities to appear more calibrated. There was a need for truthful measures that generalize beyond binary prediction tasks.

Method: The authors develop a perfectly truthful calibration measure for multi-class prediction by studying and identifying extension methods from binary to multi-class that preserve truthfulness. They mathematically analyze truthfulness preservation and empirically verify robustness.

Result: The proposed calibration measure is perfectly truthful for multi-class prediction and exhibits superior robustness - it preserves ordering between dominant and dominated predictors regardless of bin size choices, addressing the non-robustness issues of binned ECE.

Conclusion: The work successfully generalizes truthful calibration measures to multi-class prediction, identifies truthfulness-preserving extension methods, and provides a robust calibration measure that overcomes limitations of existing approaches like binned ECE.

Abstract: Calibrated predictions can be reliably interpreted as probabilities. An important step towards achieving better calibration is to design an appropriate calibration measure to meaningfully assess the miscalibration level of a predictor. A recent line of work initiated by Haghtalab et al. [2024] studies the design of truthful calibration measures: a truthful measure is minimized when a predictor outputs the true probabilities, whereas a non-truthful measure incentivizes the predictor to lie so as to appear more calibrated. All previous calibration measures were non-truthful until Hartline et al. [2025] introduced the first perfectly truthful calibration measures for binary prediction tasks in the batch setting. We introduce a perfectly truthful calibration measure for multi-class prediction tasks, generalizing the work of Hartline et al. [2025] beyond binary prediction. We study common methods of extending calibration measures from binary to multi-class prediction and identify ones that do or do not preserve truthfulness. In addition to truthfulness, we mathematically prove and empirically verify that our calibration measure exhibits superior robustness: it robustly preserves the ordering between dominant and dominated predictors, regardless of the choice of hyperparameters (bin sizes). This result addresses the non-robustness issue of binned ECE, which has been observed repeatedly in prior work.

[379] Geometry-Aware Backdoor Attacks: Leveraging Curvature in Hyperbolic Embeddings

Ali Baheri

Main category: cs.LG

TL;DR: Non-Euclidean models in curved spaces like hyperbolic geometry have boundary-driven asymmetry that backdoor triggers can exploit, making small input changes appear subtle but cause large representation shifts. Defenses that pull points inward suppress triggers but sacrifice useful model sensitivity.

Details

Motivation: To understand and formalize the geometric vulnerability in non-Euclidean foundation models where boundary asymmetry enables backdoor attacks that are hard to detect with standard methods.

Method: Theoretical analysis of boundary-driven asymmetry in curved spaces, formalization of the effect, and proposal of a geometry-adaptive trigger evaluated across tasks and architectures.

Result: Attack success increases toward the boundary while conventional detectors weaken, matching theoretical predictions. Defenses that pull points inward can suppress triggers but reduce model sensitivity.

Conclusion: Non-Euclidean models have geometry-specific vulnerabilities that require analysis-backed guidance for designing effective defenses while understanding their limitations.

Abstract: Non-Euclidean foundation models increasingly place representations in curved spaces such as hyperbolic geometry. We show that this geometry creates a boundary-driven asymmetry that backdoor triggers can exploit. Near the boundary, small input changes appear subtle to standard input-space detectors but produce disproportionately large shifts in the model’s representation space. Our analysis formalizes this effect and also reveals a limitation for defenses: methods that act by pulling points inward along the radius can suppress such triggers, but only by sacrificing useful model sensitivity in that same direction. Building on these insights, we propose a simple geometry-adaptive trigger and evaluate it across tasks and architectures. Empirically, attack success increases toward the boundary, whereas conventional detectors weaken, mirroring the theoretical trends. Together, these results surface a geometry-specific vulnerability in non-Euclidean models and offer analysis-backed guidance for designing and understanding the limits of defenses.

[380] The Effect of Label Noise on the Information Content of Neural Representations

Ali Hussaini Umar, Franky Kevin Nando Tezoh, Jean Barbier, Santiago Acevedo, Alessandro Laio

Main category: cs.LG

TL;DR: The paper analyzes how label noise affects neural network hidden representations using Information Imbalance, revealing double descent behavior similar to test error and showing overparameterized networks are robust to label noise.

Details

Motivation: While label noise's impact on model performance is well-studied, its effects on hidden representations remain poorly understood. This gap needs systematic investigation to better understand how networks learn from noisy data.

Method: Systematic comparison of hidden representations using Information Imbalance, a computationally efficient proxy of conditional mutual information, across different parameterization regimes with noisy and clean labels.

Result: Hidden representations show double descent behavior with parameter count. In underparameterized regime, noisy label representations are more informative than clean ones; in overparameterized regime, they are equally informative. Representations from random labels perform worse than random features.

Conclusion: Overparameterized networks are robust to label noise, and the information imbalance between layers decreases with cross-entropy loss, offering new insights into generalization. Training on random labels goes beyond lazy learning as weights adapt to encode label information.

Abstract: In supervised classification tasks, models are trained to predict a label for each data point. In real-world datasets, these labels are often noisy due to annotation errors. While the impact of label noise on the performance of deep learning models has been widely studied, its effects on the networks’ hidden representations remain poorly understood. We address this gap by systematically comparing hidden representations using the Information Imbalance, a computationally efficient proxy of conditional mutual information. Through this analysis, we observe that the information content of the hidden representations follows a double descent as a function of the number of network parameters, akin to the behavior of the test error. We further demonstrate that in the underparameterized regime, representations learned with noisy labels are more informative than those learned with clean labels, while in the overparameterized regime, these representations are equally informative. Our results indicate that the representations of overparameterized networks are robust to label noise. We also found that the information imbalance between the penultimate and pre-softmax layers decreases with cross-entropy loss in the overparameterized regime. This offers a new perspective on understanding generalization in classification tasks. Extending our analysis to representations learned from random labels, we show that these perform worse than random features. This indicates that training on random labels drives networks much beyond lazy learning, as weights adapt to encode labels information.

[381] Differential Privacy for Adaptive Weight Aggregation in Federated Tumor Segmentation

Muhammad Irfan Khan, Esa Alhoniemi, Elina Kontio, Suleiman A. Khan, Mojtaba Jafaritadi

Main category: cs.LG

TL;DR: A differentially private federated learning framework (DP-SimAgg) is proposed for brain tumor segmentation in MRI, enhancing privacy protection while maintaining segmentation accuracy.

Details

Motivation: Conventional FL methods pose security risks with diverse client data, potentially compromising privacy and data integrity in medical image analysis.

Method: Extended similarity weight aggregation (SimAgg) to DP-SimAgg algorithm, incorporating differential privacy in the global weight aggregation phase for federated brain tumor segmentation.

Result: DP-SimAgg enables accurate and robust brain tumor segmentation while minimizing communication costs and providing strong privacy preservation against adversarial attacks.

Conclusion: Adding differential privacy to federated brain tumor segmentation provides a promising privacy-preserving solution without compromising model efficacy, effectively protecting client data.

Abstract: Federated Learning (FL) is a distributed machine learning approach that safeguards privacy by creating an impartial global model while respecting the privacy of individual client data. However, the conventional FL method can introduce security risks when dealing with diverse client data, potentially compromising privacy and data integrity. To address these challenges, we present a differential privacy (DP) federated deep learning framework in medical image segmentation. In this paper, we extend our similarity weight aggregation (SimAgg) method to DP-SimAgg algorithm, a differentially private similarity-weighted aggregation algorithm for brain tumor segmentation in multi-modal magnetic resonance imaging (MRI). Our DP-SimAgg method not only enhances model segmentation capabilities but also provides an additional layer of privacy preservation. Extensive benchmarking and evaluation of our framework, with computational performance as a key consideration, demonstrate that DP-SimAgg enables accurate and robust brain tumor segmentation while minimizing communication costs during model training. This advancement is crucial for preserving the privacy of medical image data and safeguarding sensitive information. In conclusion, adding a differential privacy layer in the global weight aggregation phase of the federated brain tumor segmentation provides a promising solution to privacy concerns without compromising segmentation model efficacy. By leveraging DP, we ensure the protection of client data against adversarial attacks and malicious participants.

[382] Test-Time Efficient Pretrained Model Portfolios for Time Series Forecasting

Mert Kayaalp, Caner Turkmen, Oleksandr Shchur, Pedro Mercado, Abdul Fatir Ansari, Michael Bohlke-Schneider, Bernie Wang

Main category: cs.LG

TL;DR: Portfolios of smaller specialist models outperform single large models for time series forecasting, achieving competitive performance with fewer parameters through ensembling and model selection.

Details

Motivation: To challenge the assumption that bigger models are always better for time series foundation models, and explore whether collections of smaller models can provide competitive performance more efficiently.

Method: Build portfolios of smaller pretrained forecasting models using ensembling or model selection strategies, with a focus on creating specialist models through post-training of base models.

Result: Collections of specialist models consistently outperform portfolios of independently trained generalists, and post-training base models is compute-effective for creating diverse specialists.

Conclusion: Ensembling and model selection over portfolios of smaller specialist models is more compute-efficient than test-time fine-tuning and can achieve competitive performance with fewer parameters than monolithic large models.

Abstract: Is bigger always better for time series foundation models? With the question in mind, we explore an alternative to training a single, large monolithic model: building a portfolio of smaller, pretrained forecasting models. By applying ensembling or model selection over these portfolios, we achieve competitive performance on large-scale benchmarks using much fewer parameters. We explore strategies for designing such portfolios and find that collections of specialist models consistently outperform portfolios of independently trained generalists. Remarkably, we demonstrate that post-training a base model is a compute-effective approach for creating sufficiently diverse specialists, and provide evidences that ensembling and model selection are more compute-efficient than test-time fine-tuning.

[383] Nearly Instance-Optimal Parameter Recovery from Many Trajectories via Hellinger Localization

Eliot Shekhtman, Yichen Zhou, Ingvar Ziemann, Nikolai Matni, Stephen Tu

Main category: cs.LG

TL;DR: The paper presents a framework for instance-optimal learning from multi-trajectory data using Hellinger localization, achieving near-optimal rates across various models without requiring mixing assumptions.

Details

Motivation: Current understanding of sequential learning in multi-trajectory settings is incomplete, with existing methods either requiring mixing assumptions or providing suboptimal guarantees. This work aims to provide instance-optimal bounds that scale with the full data budget under broad conditions.

Method: Uses Hellinger localization framework with two steps: (1) control squared Hellinger distance at path-measure level via reduction to i.i.d. learning, (2) localization as quadratic form in parameter space weighted by trajectory Fisher information.

Result: Achieves instance-optimal bounds that scale with full data budget across four case studies: mixture of Markov chains, dependent linear regression with non-Gaussian noise, generalized linear models with non-monotonic activations, and linear-attention sequence models.

Conclusion: The framework significantly broadens scope of instance-optimal rates in multi-trajectory settings, nearly matching asymptotic normality rates and substantially improving over standard reductions.

Abstract: Learning from temporally-correlated data is a core facet of modern machine learning. Yet our understanding of sequential learning remains incomplete, particularly in the multi-trajectory setting where data consists of many independent realizations of a time-indexed stochastic process. This important regime both reflects modern training pipelines such as for large foundation models, and offers the potential for learning without the typical mixing assumptions made in the single-trajectory case. However, instance-optimal bounds are known only for least-squares regression with dependent covariates; for more general models or loss functions, the only broadly applicable guarantees result from a reduction to either i.i.d. learning, with effective sample size scaling only in the number of trajectories, or an existing single-trajectory result when each individual trajectory mixes, with effective sample size scaling as the full data budget deflated by the mixing-time. In this work, we significantly broaden the scope of instance-optimal rates in multi-trajectory settings via the Hellinger localization framework, a general approach for maximum likelihood estimation. Our method proceeds by first controlling the squared Hellinger distance at the path-measure level via a reduction to i.i.d. learning, followed by localization as a quadratic form in parameter space weighted by the trajectory Fisher information. This yields instance-optimal bounds that scale with the full data budget under a broad set of conditions. We instantiate our framework across four diverse case studies: a simple mixture of Markov chains, dependent linear regression under non-Gaussian noise, generalized linear models with non-monotonic activations, and linear-attention sequence models. In all cases, our bounds nearly match the instance-optimal rates from asymptotic normality, substantially improving over standard reductions.

[384] Bayesian Optimization under Uncertainty for Training a Scale Parameter in Stochastic Models

Akash Yadav, Ruda Zhang

Main category: cs.LG

TL;DR: A novel Bayesian optimization framework for efficient hyperparameter tuning under uncertainty, focusing on scale/precision parameters in stochastic models, achieving 40x computational cost reduction.

Details

Motivation: Hyperparameter tuning under uncertainty is computationally expensive due to noisy function evaluations, requiring more efficient optimization methods.

Method: Uses Bayesian optimization with statistical surrogate for underlying random variable, enabling analytical expectation evaluation and closed-form optimizer for acquisition function.

Result: Requires 40 times fewer data points than conventional Monte Carlo-based optimization, achieving up to 40-fold computational cost reduction.

Conclusion: The proposed method is effective for hyperparameter tuning under uncertainty, as demonstrated through computational engineering examples with significant efficiency improvements.

Abstract: Hyperparameter tuning is a challenging problem especially when the system itself involves uncertainty. Due to noisy function evaluations, optimization under uncertainty can be computationally expensive. In this paper, we present a novel Bayesian optimization framework tailored for hyperparameter tuning under uncertainty, with a focus on optimizing a scale- or precision-type parameter in stochastic models. The proposed method employs a statistical surrogate for the underlying random variable, enabling analytical evaluation of the expectation operator. Moreover, we derive a closed-form expression for the optimizer of the random acquisition function, which significantly reduces computational cost per iteration. Compared with a conventional one-dimensional Monte Carlo-based optimization scheme, the proposed approach requires 40 times fewer data points, resulting in up to a 40-fold reduction in computational cost. We demonstrate the effectiveness of the proposed method through two numerical examples in computational engineering.

[385] GPS-MTM: Capturing Pattern of Normalcy in GPS-Trajectories with self-supervised learning

Umang Garg, Bowen Zhang, Anantajit Subrahmanya, Chandrakanth Gudavalli, BS Manjunath

Main category: cs.LG

TL;DR: GPS-MTM is a foundation model for mobility data that decomposes trajectories into states (POI categories) and actions (transitions), using a bi-directional Transformer with masked modeling to learn semantic patterns without labels, achieving superior performance on trajectory tasks.

Details

Motivation: Foundation models have shown success in text, vision, and video domains, but similar breakthroughs are needed for trajectory modeling to capture human movement patterns at scale.

Method: Decomposes mobility into states (point-of-interest categories) and actions (agent transitions), uses bi-directional Transformer with self-supervised masked modeling to reconstruct missing segments across modalities.

Result: Outperforms existing methods on benchmark datasets (Numosim-LA, Urban Anomalies, Geolife) for trajectory infilling and next-stop prediction, with strongest advantages in dynamic tasks requiring contextual reasoning.

Conclusion: GPS-MTM establishes mobility data as a first-class modality for large-scale representation learning and serves as a robust foundation model for trajectory analytics.

Abstract: Foundation models have driven remarkable progress in text, vision, and video understanding, and are now poised to unlock similar breakthroughs in trajectory modeling. We introduce the GPSMasked Trajectory Transformer (GPS-MTM), a foundation model for large-scale mobility data that captures patterns of normalcy in human movement. Unlike prior approaches that flatten trajectories into coordinate streams, GPS-MTM decomposes mobility into two complementary modalities: states (point-of-interest categories) and actions (agent transitions). Leveraging a bi-directional Transformer with a self-supervised masked modeling objective, the model reconstructs missing segments across modalities, enabling it to learn rich semantic correlations without manual labels. Across benchmark datasets, including Numosim-LA, Urban Anomalies, and Geolife, GPS-MTM consistently outperforms on downstream tasks such as trajectory infilling and next-stop prediction. Its advantages are most pronounced in dynamic tasks (inverse and forward dynamics), where contextual reasoning is critical. These results establish GPS-MTM as a robust foundation model for trajectory analytics, positioning mobility data as a first-class modality for large-scale representation learning. Code is released for further reference.

[386] Context-Aware Inference via Performance Forecasting in Decentralized Learning Networks

Joel Pfeffer, J. M. Diederik Kruijssen, Clément Gossart, Mélanie Chevance, Diego Campo Millan, Florian Stecker, Steven N. Longmore

Main category: cs.LG

TL;DR: The paper proposes a performance forecasting model for decentralized learning networks that predicts future model performance to enable proactive weight assignment, overcoming limitations of reactive linear pooling methods.

Details

Motivation: Existing dynamic prediction combination methods using linear pooling are reactive and slow to adapt to changing circumstances due to their reliance on historical performance averaging.

Method: Develops a machine learning model that forecasts performance of predictions by models at each epoch, enabling context-aware weight assignment based on predicted future performance rather than past performance.

Result: Performance forecasting models predicting regret or regret z-score show greater improvement than loss prediction models, with forecasting performance being sensitive to feature set choices and training epochs.

Conclusion: Performance forecasting for prediction combination can improve accuracy in decentralized learning networks and may be useful in any situation requiring predictive rather than reactive model weighting.

Abstract: In decentralized learning networks, predictions from many participants are combined to generate a network inference. While many studies have demonstrated performance benefits of combining multiple model predictions, existing strategies using linear pooling methods (ranging from simple averaging to dynamic weight updates) face a key limitation. Dynamic prediction combinations that rely on historical performance to update weights are necessarily reactive. Due to the need to average over a reasonable number of epochs (with moving averages or exponential weighting), they tend to be slow to adjust to changing circumstances (phase or regime changes). In this work, we develop a model that uses machine learning to forecast the performance of predictions by models at each epoch in a time series. This enables `context-awareness’ by assigning higher weight to models that are likely to be more accurate at a given time. We show that adding a performance forecasting worker in a decentralized learning network, following a design similar to the Allora network, can improve the accuracy of network inferences. Specifically, we find forecasting models that predict regret (performance relative to the network inference) or regret z-score (performance relative to other workers) show greater improvement than models predicting losses, which often do not outperform the naive network inference (historically weighted average of all inferences). Through a series of optimization tests, we show that the performance of the forecasting model can be sensitive to choices in the feature set and number of training epochs. These properties may depend on the exact problem and should be tailored to each domain. Although initially designed for a decentralized learning network, using performance forecasting for prediction combination may be useful in any situation where predictive rather than reactive model weighting is needed.

[387] How NOT to benchmark your SITE metric: Beyond Static Leaderboards and Towards Realistic Evaluation

Prabhant Singh, Sibylle Hess, Joaquin Vanschoren

Main category: cs.LG

TL;DR: Current benchmarks for evaluating transferability estimation metrics are flawed due to unrealistic model spaces and static performance hierarchies, artificially inflating metric performance where simple heuristics can outperform sophisticated methods.

Details

Motivation: To identify shortcomings in widely used benchmark setups for transferability estimation metrics, which are used to select pre-trained models for target tasks without fine-tuning or source data access.

Method: Empirical analysis demonstrating that current benchmarks have unrealistic model spaces and static performance hierarchies that artificially inflate metric performance.

Result: Simple, dataset-agnostic heuristics can outperform sophisticated transferability estimation methods in current flawed benchmark setups.

Conclusion: Current evaluation protocols are disconnected from real-world model selection complexities, requiring more robust and realistic benchmarks for meaningful future research.

Abstract: Transferability estimation metrics are used to find a high-performing pre-trained model for a given target task without fine-tuning models and without access to the source dataset. Despite the growing interest in developing such metrics, the benchmarks used to measure their progress have gone largely unexamined. In this work, we empirically show the shortcomings of widely used benchmark setups to evaluate transferability estimation metrics. We argue that the benchmarks on which these metrics are evaluated are fundamentally flawed. We empirically demonstrate that their unrealistic model spaces and static performance hierarchies artificially inflate the perceived performance of existing metrics, to the point where simple, dataset-agnostic heuristics can outperform sophisticated methods. Our analysis reveals a critical disconnect between current evaluation protocols and the complexities of real-world model selection. To address this, we provide concrete recommendations for constructing more robust and realistic benchmarks to guide future research in a more meaningful direction.

[388] Attention Sinks and Compression Valleys in LLMs are Two Sides of the Same Coin

Enrique Queipo-de-Llano, Álvaro Arroyo, Federico Barbero, Xiaowen Dong, Michael Bronstein, Yann LeCun, Ravid Shwartz-Ziv

Main category: cs.LG

TL;DR: Attention sinks and compression valleys are connected phenomena caused by massive activations in LLMs, leading to a unified theory of information flow through three phases: mixing, compression, and refinement.

Details

Motivation: To understand the puzzling connection between attention sinks and compression valleys in large language models, which have been studied separately but appear related.

Method: Theoretical analysis proving massive activations cause representational compression, experimental validation across models (410M-120B parameters), and targeted ablation studies.

Result: Confirmed that extreme activation norms in middle layers simultaneously produce both compression valleys and attention sinks, validating theoretical predictions.

Conclusion: Proposed Mix-Compress-Refine theory explaining LLM computation in three phases: early mixing, middle compression, and late refinement, clarifying task-dependent representation differences.

Abstract: Attention sinks and compression valleys have attracted significant attention as two puzzling phenomena in large language models, but have been studied in isolation. In this work, we present a surprising connection between attention sinks and compression valleys, tracing both to the formation of massive activations in the residual stream. We prove theoretically that massive activations necessarily produce representational compression and establish bounds on the resulting entropy reduction. Through experiments across several models (410M-120B parameters), we confirm that when the beginning-of-sequence token develops extreme activation norms in the middle layers, both compression valleys and attention sinks emerge simultaneously. Targeted ablation studies validate our theoretical predictions. This unified view motivates us to propose the Mix-Compress-Refine theory of information flow, as an attempt to explain how LLMs organize their computation in depth by controlling attention and representational compression via massive activations. Specifically, we posit that Transformer-based LLMs process tokens in three distinct phases: (1) broad mixing in the early layers, (2) compressed computation with limited mixing in the middle layers, and (3) selective refinement in the late layers. Our framework helps explain why embedding tasks perform best at intermediate layers, whereas generation tasks benefit from full-depth processing, clarifying differences in task-dependent representations.

[389] Valid Stopping for LLM Generation via Empirical Dynamic Formal Lift

Sanjeda Akter, Ibne Farabi Shihab, Anuj Sharma

Main category: cs.LG

TL;DR: Sequential-EDFL applies anytime-valid sequential testing to language model generation stopping, using information lift tracking with formal error control, reducing generation by 22-28% while maintaining guarantees.

Details

Motivation: To develop a method for stopping language model generation early while maintaining formal statistical guarantees, reducing computational costs without compromising reliability.

Method: Tracks information lift using self-normalized empirical-Bernstein e-processes with online mean estimation, mixture e-processes for multiple parameters, and adaptive resets for distributional drift. Uses automated skeletons (distilled submodels, randomized logits) and combines with lightweight correctness gates.

Result: Reduces generation by 22-28% vs. sequential baselines with 12% computational overhead while maintaining delta-level control. Serves as first-stage filter reducing verification burden by 83%, though 10.9% of stopped sequences remain incorrect even with gate.

Conclusion: EDFL provides effective early stopping with formal guarantees but should not be used standalone in safety-critical domains; it’s best as a first-stage filter to reduce verification burden.

Abstract: We introduce Sequential-EDFL (Empirical Dynamic Formal Lift), applying anytime-valid sequential testing to language model generation stopping. Our approach tracks information lift – the log-likelihood ratio between full models and deliberately weakened “skeleton” baselines – using self-normalized empirical-Bernstein e-processes that provide formal delta-level error control regardless of stopping time. We handle unknown centering through online mean estimation, combine multiple parameters via mixture e-processes, and support adaptive resets under distributional drift. On six benchmarks, Sequential-EDFL reduces generation by 22-28% vs. sequential baselines while maintaining delta-level control with 12% computational overhead. We introduce automated skeletons (distilled submodels, randomized logits) and show robustness across skeleton families. Composing EDFL with a lightweight correctness gate (sentence boundaries + verifier) improves end-task correctness while preserving anytime-valid guarantees by only delaying stopping. Our certificates control information sufficiency, not factual correctness – 10.9% of stopped sequences remain incorrect even with the gate (13.2-22.7% without it). EDFL serves as a first-stage filter reducing verification burden by 83%, not as a standalone solution for safety-critical domains.

[390] GUIDE: Guided Initialization and Distillation of Embeddings

Khoa Trinh, Gaurav Menghani, Erik Vee

Main category: cs.LG

TL;DR: GUIDE is a distillation technique that forces student models to match teacher models in parameter space, achieving 25-26% reduction in quality gap with no training or inference overhead.

Details

Motivation: Standard distillation methods only make students match teacher outputs, but given the high cost of training large teacher models, more useful information should be extracted from teachers.

Method: GUIDE (Guided Initialization and Distillation of Embeddings) forces student models to match teacher models in the parameter space rather than just output space.

Result: 25-26% reduction in teacher-student quality gap for large student models (400M-1B parameters) trained on ~20B tokens. GUIDE alone performs substantially better than knowledge distillation alone, and can be combined with KD for near additive improvements.

Conclusion: GUIDE provides significant model quality improvements with virtually free cost since it introduces no training or inference overhead.

Abstract: Algorithmic efficiency techniques such as distillation (\cite{hinton2015distillation}) are useful in improving model quality without increasing serving costs, provided a larger teacher model is available for a smaller student model to learn from during training. Standard distillation methods are limited to only forcing the student to match the teacher’s outputs. Given the costs associated with training a large model, we believe we should be extracting more useful information from a teacher model than by just making the student match the teacher’s outputs. In this paper, we introduce \guide (Guided Initialization and Distillation of Embeddings). \guide can be considered a distillation technique that forces the student to match the teacher in the parameter space. Using \guide we show 25-26% reduction in the teacher-student quality gap when using large student models (400M - 1B parameters) trained on $\approx$ 20B tokens. We also present a thorough analysis demonstrating that \guide can be combined with knowledge distillation with near additive improvements. Furthermore, we show that applying \guide alone leads to substantially better model quality than applying knowledge distillation by itself. Most importantly, \guide introduces no training or inference overhead and hence any model quality gains from our method are virtually free.

[391] ATLO-ML: Adaptive Time-Length Optimizer for Machine Learning – Insights from Air Quality Forecasting

I-Hsi Kao, Kanji Uchino

Main category: cs.LG

TL;DR: ATLO-ML is an adaptive system that automatically optimizes input time length and sampling rate for time-series predictions based on user-defined output time length, improving model accuracy compared to fixed parameters.

Details

Motivation: Accurate time-series predictions depend heavily on selecting appropriate input time length and sampling rate, which are often chosen manually or fixed, potentially limiting model performance.

Method: ATLO-ML automatically determines optimal input time length and sampling rate through adaptive optimization, providing flexible time-series data pre-processing that dynamically adjusts these parameters.

Result: Validation using air quality datasets (GAMS-dataset and proprietary data center data) shows that optimized time length and sampling rate significantly improve machine learning model accuracy compared to fixed time lengths.

Conclusion: ATLO-ML demonstrates potential for generalization across various time-sensitive applications and offers a robust solution for optimizing temporal input parameters in machine learning workflows.

Abstract: Accurate time-series predictions in machine learning are heavily influenced by the selection of appropriate input time length and sampling rate. This paper introduces ATLO-ML, an adaptive time-length optimization system that automatically determines the optimal input time length and sampling rate based on user-defined output time length. The system provides a flexible approach to time-series data pre-processing, dynamically adjusting these parameters to enhance predictive performance. ATLO-ML is validated using air quality datasets, including both GAMS-dataset and proprietary data collected from a data center, both in time series format. Results demonstrate that utilizing the optimized time length and sampling rate significantly improves the accuracy of machine learning models compared to fixed time lengths. ATLO-ML shows potential for generalization across various time-sensitive applications, offering a robust solution for optimizing temporal input parameters in machine learning workflows.

[392] A Median Perspective on Unlabeled Data for Out-of-Distribution Detection

Momin Abbas, Ali Falahati, Hossein Goli, Mohammad Mohammadi Amiri

Main category: cs.LG

TL;DR: Medix is a novel framework for out-of-distribution (OOD) detection that uses median operations to identify outliers from unlabeled data and trains a robust OOD classifier.

Details

Motivation: Existing OOD detection methods struggle with unlabeled in-the-wild data containing mixed in-distribution and OOD samples, making it difficult to train optimal classifiers without distinct OOD samples.

Method: Uses median operation to identify potential outliers from unlabeled data due to its robustness against noise and outliers, then trains an OOD classifier using these identified outliers along with labeled in-distribution data.

Result: Medix achieves low error rates theoretically and empirically outperforms existing methods across the board in open-world settings.

Conclusion: The median-based approach effectively handles mixed unlabeled data and enables robust OOD detection, with both theoretical and empirical validation.

Abstract: Out-of-distribution (OOD) detection plays a crucial role in ensuring the robustness and reliability of machine learning systems deployed in real-world applications. Recent approaches have explored the use of unlabeled data, showing potential for enhancing OOD detection capabilities. However, effectively utilizing unlabeled in-the-wild data remains challenging due to the mixed nature of both in-distribution (InD) and OOD samples. The lack of a distinct set of OOD samples complicates the task of training an optimal OOD classifier. In this work, we introduce Medix, a novel framework designed to identify potential outliers from unlabeled data using the median operation. We use the median because it provides a stable estimate of the central tendency, as an OOD detection mechanism, due to its robustness against noise and outliers. Using these identified outliers, along with labeled InD data, we train a robust OOD classifier. From a theoretical perspective, we derive error bounds that demonstrate Medix achieves a low error rate. Empirical results further substantiate our claims, as Medix outperforms existing methods across the board in open-world settings, confirming the validity of our theoretical insights.

[393] Text-to-Image Models Leave Identifiable Signatures: Implications for Leaderboard Security

Ali Naseh, Anshuman Suri, Yuefeng Peng, Harsh Chaudhari, Alina Oprea, Amir Houmansadr

Main category: cs.LG

TL;DR: Text-to-image leaderboards are highly vulnerable to model deanonymization attacks using simple CLIP-based classification, enabling easy rank manipulation.

Details

Motivation: To demonstrate that text-to-image leaderboards are more vulnerable to deanonymization attacks than LLM leaderboards, making rank manipulation easier than previously recognized.

Method: Used 150,000+ generated images from 280 prompts and 19 diverse models, performing real-time classification in CLIP embedding space without prompt control or historical data.

Result: Simple CLIP-based classification achieved high accuracy in identifying generating models, with some prompts enabling near-perfect deanonymization using a new prompt-level separability metric.

Conclusion: Text-to-image leaderboards require stronger defenses against deanonymization attacks to prevent rank manipulation.

Abstract: Generative AI leaderboards are central to evaluating model capabilities, but remain vulnerable to manipulation. Among key adversarial objectives is rank manipulation, where an attacker must first deanonymize the models behind displayed outputs – a threat previously demonstrated and explored for large language models (LLMs). We show that this problem can be even more severe for text-to-image leaderboards, where deanonymization is markedly easier. Using over 150,000 generated images from 280 prompts and 19 diverse models spanning multiple organizations, architectures, and sizes, we demonstrate that simple real-time classification in CLIP embedding space identifies the generating model with high accuracy, even without prompt control or historical data. We further introduce a prompt-level separability metric and identify prompts that enable near-perfect deanonymization. Our results indicate that rank manipulation in text-to-image leaderboards is easier than previously recognized, underscoring the need for stronger defenses.

[394] Wide Neural Networks as a Baseline for the Computational No-Coincidence Conjecture

John Dunbar, Scott Aaronson

Main category: cs.LG

TL;DR: Randomly initialized neural networks with zero-mean activation functions have nearly independent outputs when they are sufficiently wide, supporting computational no-coincidence conjecture.

Details

Motivation: To investigate when neural networks have independent outputs and test the Alignment Research Center's computational no-coincidence conjecture about AI interpretability limits.

Method: Analyze randomly initialized neural networks with large width and specific hyperparameters, focusing on activation functions that have zero mean under Gaussian measure.

Result: Found that neural networks have nearly independent outputs exactly when their activation function is nonlinear with zero mean under Gaussian measure (e.g., shifted ReLU/GeLU, tanh).

Conclusion: Neural networks with zero-mean activation functions are promising candidates for testing the computational no-coincidence conjecture due to their independent output behavior.

Abstract: We establish that randomly initialized neural networks, with large width and a natural choice of hyperparameters, have nearly independent outputs exactly when their activation function is nonlinear with zero mean under the Gaussian measure: $\mathbb{E}_{z \sim \mathcal{N}(0,1)}[\sigma(z)]=0$. For example, this includes ReLU and GeLU with an additive shift, as well as tanh, but not ReLU or GeLU by themselves. Because of their nearly independent outputs, we propose neural networks with zero-mean activation functions as a promising candidate for the Alignment Research Center’s computational no-coincidence conjecture – a conjecture that aims to measure the limits of AI interpretability.

[395] Scalable Policy-Based RL Algorithms for POMDPs

Ameya Anjarlekar, Rasoul Etesami, R Srikant

Main category: cs.LG

TL;DR: The paper proposes transforming POMDPs into finite-state Superstate MDPs using finite history, with theoretical guarantees showing exponentially decreasing approximation error with history length, and enables using standard TD-learning and policy optimization methods.

Details

Motivation: The continuous nature of belief states in POMDPs creates computational challenges for learning optimal policies, motivating the need for approximation methods that can leverage standard MDP techniques.

Method: Transform POMDPs into finite-state Superstate MDPs using finite history windows, then apply policy-based learning with linear function approximation using TD-learning followed by Policy Optimization.

Result: Derived improved theoretical guarantees relating Superstate MDP value functions to original POMDP, and showed approximation error decreases exponentially with history length. Provides first finite-time bounds quantifying error when applying TD learning to non-Markovian dynamics.

Conclusion: POMDPs can be approximately solved by treating them as MDPs with finite history states, enabling use of standard reinforcement learning methods with provable error bounds that improve with longer history windows.

Abstract: The continuous nature of belief states in POMDPs presents significant computational challenges in learning the optimal policy. In this paper, we consider an approach that solves a Partially Observable Reinforcement Learning (PORL) problem by approximating the corresponding POMDP model into a finite-state Markov Decision Process (MDP) (called Superstate MDP). We first derive theoretical guarantees that improve upon prior work that relate the optimal value function of the transformed Superstate MDP to the optimal value function of the original POMDP. Next, we propose a policy-based learning approach with linear function approximation to learn the optimal policy for the Superstate MDP. Consequently, our approach shows that a POMDP can be approximately solved using TD-learning followed by Policy Optimization by treating it as an MDP, where the MDP state corresponds to a finite history. We show that the approximation error decreases exponentially with the length of this history. To the best of our knowledge, our finite-time bounds are the first to explicitly quantify the error introduced when applying standard TD learning to a setting where the true dynamics are not Markovian.

[396] Incoherence in goal-conditioned autoregressive models

Jacek Karwowski, Raymond Douglas

Main category: cs.LG

TL;DR: The paper mathematically analyzes incoherence in reinforcement learning policies from naive goal-conditioning of autoregressive models, showing that online RL fine-tuning reduces incoherence and improves returns, with connections to control-as-inference and soft Q learning.

Details

Motivation: To understand the structural issue of incoherence in reinforcement learning policies derived from naive goal-conditioning of autoregressive models, and to characterize how online RL fine-tuning addresses this problem.

Method: Mathematical investigation of incoherence, analysis of re-training offline-learned policies with online RL, reframing control-as-inference and soft Q learning concepts, establishing three-way correspondence between different interpretations of iterative re-training.

Result: Proved that online RL fine-tuning decreases incoherence and improves return, established correspondence between iterative re-training as folding posterior into reward and decreasing temperature parameter in deterministic case, linked incoherence to effective horizon through soft-conditioning generative models.

Conclusion: Online RL fine-tuning effectively addresses incoherence in goal-conditioned policies, with computational implications through training-inference trade-off and connections to control theory concepts.

Abstract: We investigate mathematically the notion of incoherence: a structural issue with reinforcement learning policies derived by naive goal-conditioning of autoregressive models. We focus on the process of re-training models on their own actions, that is, fine-tuning offline-learned policies with online RL. We prove that it decreases incoherence and leads to an improvement in return, and we aim to characterize the resulting trajectory of policies. By re-framing standard notions of control-as-inference and soft Q learning, we establish a three-way correspondence with two other ways of understanding the iterative re-training process: as folding the posterior into the reward and, in the deterministic case, as decreasing the temperature parameter; the correspondence has computational content via the training-inference trade-off. Through soft-conditioning generative models, we discuss the link between incoherence and the effective horizon.

[397] The Markovian Thinker

Milad Aghajohari, Kamran Chitsaz, Amirhossein Kazemnejad, Sarath Chandar, Alessandro Sordoni, Aaron Courville, Siva Reddy

Main category: cs.LG

TL;DR: Delethink introduces Markovian Thinking, a new RL environment that enables long-chain reasoning with linear compute by structuring thoughts into fixed-size chunks with state carryover, overcoming quadratic attention costs of traditional LongCoT approaches.

Details

Motivation: Standard RL environments for reasoning LLMs have unbounded states that require quadratic compute as thought chains lengthen, making long reasoning sequences computationally expensive and impractical.

Method: Proposes Markovian Thinking paradigm with Delethink environment that organizes reasoning into fixed-size chunks. At chunk boundaries, context resets with short carryover text, and RL trains policies to write sufficient state summaries for seamless continuation.

Result: A 1.5B model trained with Delethink achieves reasoning up to 24K tokens using 8K-token chunks, matching LongCoT-RL performance while using significantly less compute. At 96K thinking length, Delethink costs 7 H100-months vs 27 for LongCoT-RL.

Conclusion: Redesigning the thinking environment enables efficient long reasoning without quadratic overhead, opening path to scalable reasoning LLMs. Existing reasoning models often produce Markovian traces zero-shot, making RL effective.

Abstract: Reinforcement learning (RL) has recently become a strong recipe for training reasoning LLMs that produce long chains of thought (LongCoT). Yet the standard RL “thinking environment”, where the state is the prompt plus all prior reasoning tokens, makes the state unbounded and forces attention-based policies to pay quadratic compute as thoughts lengthen. We revisit the environment itself. We propose Markovian Thinking, a paradigm in which the policy advances reasoning while conditioning on a constant-size state, decoupling thinking length from context size. As an immediate consequence this yields linear compute with constant memory. We instantiate this idea with Delethink, an RL environment that structures reasoning into fixed-size chunks. Within each chunk, the model thinks as usual; at the boundary, the environment resets the context and reinitializes the prompt with a short carryover. Through RL, the policy learns to write a textual state near the end of each chunk sufficient for seamless continuation of reasoning after reset. Trained in this environment, an R1-Distill 1.5B model reasons in 8K-token chunks yet thinks up to 24K tokens, matching or surpassing LongCoT-RL trained with a 24K budget. With test-time scaling, Delethink continues to improve where LongCoT plateaus. The effect of linear compute is substantial: we empirically estimate at 96K average thinking length LongCoT-RL costs 27 H100-months vs. 7 for Delethink. Analysis at RL initialization shows off-the-shelf reasoning models (1.5B-120B) often sample Markovian traces zero-shot across diverse benchmarks, providing positive samples that make RL effective at scale. Our results show that redesigning the thinking environment is a powerful lever: it enables very long reasoning without quadratic overhead and opens a path toward efficient, scalable reasoning LLMs.

[398] DPA-Net: A Dual-Path Attention Neural Network for Inferring Glycemic Control Metrics from Self-Monitored Blood Glucose Data

Canyu Lei, Benjamin Lobo, Jianxin Xie

Main category: cs.LG

TL;DR: DPA-Net estimates Ambulatory Glucose Profile metrics from sparse SMBG data using dual-path attention network with spatial-channel reconstruction and multi-scale ResNet, achieving robust accuracy comparable to CGM.

Details

Motivation: CGM provides reliable glucose metrics but is expensive and inaccessible in low-income regions, while SMBG is cheap but produces sparse data that cannot directly estimate clinical metrics.

Method: Dual-Path Attention Neural Network with spatial-channel attention path for CGM trajectory reconstruction, multi-scale ResNet path for direct AGP prediction, alignment mechanism to reduce bias, and active point selector for realistic SMBG sampling.

Result: DPA-Net achieves robust accuracy with low errors and reduced systematic bias on large real-world dataset, providing first supervised ML framework for AGP estimation from SMBG data.

Conclusion: The framework offers practical decision-support tool for settings where CGM is not accessible, enabling clinical glucose monitoring using widely available SMBG devices.

Abstract: Continuous glucose monitoring (CGM) provides dense and dynamic glucose profiles that enable reliable estimation of Ambulatory Glucose Profile (AGP) metrics, such as Time in Range (TIR), Time Below Range (TBR), and Time Above Range (TAR). However, the high cost and limited accessibility of CGM restrict its widespread adoption, particularly in low- and middle-income regions. In contrast, self-monitoring of blood glucose (SMBG) is inexpensive and widely available but yields sparse and irregular data that are challenging to translate into clinically meaningful glycemic metrics. In this work, we propose a Dual-Path Attention Neural Network (DPA-Net) to estimate AGP metrics directly from SMBG data. DPA-Net integrates two complementary paths: (1) a spatial-channel attention path that reconstructs a CGM-like trajectory from sparse SMBG observations, and (2) a multi-scale ResNet path that directly predicts AGP metrics. An alignment mechanism between the two paths is introduced to reduce bias and mitigate overfitting. In addition, we develop an active point selector to identify realistic and informative SMBG sampling points that reflect patient behavioral patterns. Experimental results on a large, real-world dataset demonstrate that DPA-Net achieves robust accuracy with low errors while reducing systematic bias. To the best of our knowledge, this is the first supervised machine learning framework for estimating AGP metrics from SMBG data, offering a practical and clinically relevant decision-support tool in settings where CGM is not accessible.

[399] POME: Post Optimization Model Edit via Muon-style Projection

Yong Liu, Di Fu, Yang Luo, Zirui Zhu, Minhao Cheng, Cho-Jui Hsieh, Yang You

Main category: cs.LG

TL;DR: POME is a post-optimization algorithm that improves fine-tuned LLMs by applying muon-style projection to weight differences using truncated SVD, requiring no extra data or training.

Details

Motivation: To enhance fine-tuned language model performance without additional data collection or optimization overhead, addressing noise in weight updates.

Method: Apply muon-style projection to ΔW (weight difference between fine-tuned and pretrained models) using truncated SVD to equalize dominant update directions and prune small singular values.

Result: Consistent performance gains: +2.5% on GSM8K and +1.0% on code generation; applicable to models from 7B to 72B parameters.

Conclusion: POME provides a practical, zero-cost enhancement for any fine-tuning pipeline with broad applicability across model sizes and types.

Abstract: We introduce Post-Optimization Model Edit (POME), a new algorithm that enhances the performance of fine-tuned large language models using only their pretrained and fine-tuned checkpoints, without requiring extra data or further optimization. The core idea is to apply a muon-style projection to $\Delta W$, the difference between the fine-tuned and pretrained weights. This projection uses truncated singular value decomposition (SVD) to equalize the influence of dominant update directions and prune small singular values, which often represent noise. As a simple post-processing step, POME is completely decoupled from the training pipeline. It requires zero modifications and imposes no overhead, making it universally compatible with any optimizer or distributed framework. POME delivers consistent gains, boosting average performance by +2.5% on GSM8K and +1.0% on code generation. Its broad applicability – from 7B foundation models to 72B RLHF-instructed models – establishes it as a practical, zero-cost enhancement for any fine-tuning pipeline. Code is available at https://github.com/NUS-HPC-AI-Lab/POME.

[400] AI-Driven Forecasting and Monitoring of Urban Water System

Qiming Guo, Bishal Khatri, Hua Zhang, Wenlu Wang

Main category: cs.LG

TL;DR: An AI and remote-sensor framework called HydroNet is proposed for detecting leaks in underground water pipelines using sparse sensor deployments and pipeline attributes in a directed graph model.

Details

Motivation: Underground water pipelines suffer from leaks and infiltrations causing water loss and environmental damage, while conventional inspections are inefficient and dense sensor deployments are too expensive.

Method: Deploy sparse remote sensors to capture real-time flow and depth data, and use HydroNet - a model that incorporates pipeline attributes (material, diameter, slope) in a directed graph structure with edge-aware message passing and hydraulic simulations.

Result: Evaluations on a real-world campus wastewater network show the system collects effective spatio-temporal hydraulic data, enabling HydroNet to outperform advanced baselines in network-wide predictions from limited sensor deployments.

Conclusion: The integration of edge-aware message passing with hydraulic simulations enables accurate predictions from sparse sensors, and this approach can be effectively extended to various underground water pipeline networks.

Abstract: Underground water and wastewater pipelines are vital for city operations but plagued by anomalies like leaks and infiltrations, causing substantial water loss, environmental damage, and high repair costs. Conventional manual inspections lack efficiency, while dense sensor deployments are prohibitively expensive. In recent years, artificial intelligence has advanced rapidly and is increasingly applied to urban infrastructure. In this research, we propose an integrated AI and remote-sensor framework to address the challenge of leak detection in underground water pipelines, through deploying a sparse set of remote sensors to capture real-time flow and depth data, paired with HydroNet - a dedicated model utilizing pipeline attributes (e.g., material, diameter, slope) in a directed graph for higher-precision modeling. Evaluations on a real-world campus wastewater network dataset demonstrate that our system collects effective spatio-temporal hydraulic data, enabling HydroNet to outperform advanced baselines. This integration of edge-aware message passing with hydraulic simulations enables accurate network-wide predictions from limited sensor deployments. We envision that this approach can be effectively extended to a wide range of underground water pipeline networks.

[401] Chem-NMF: Multi-layer $α$-divergence Non-Negative Matrix Factorization for Cardiorespiratory Disease Clustering, with Improved Convergence Inspired by Chemical Catalysts and Rigorous Asymptotic Analysis

Yasaman Torabi, Shahram Shirani, James P. Reilly

Main category: cs.LG

TL;DR: Chem-NMF: A novel multi-layer NMF method using Boltzmann probability from chemical reactions to ensure convergence, improving clustering accuracy on biomedical signals and face images.

Details

Motivation: Extending NMF with α-divergence to multi-layer architectures faces convergence challenges, requiring theoretical analysis from a physical chemistry perspective.

Method: Introduces Chem-NMF with a bounding factor inspired by Boltzmann probability of energy barriers in chemical reactions to stabilize convergence.

Result: Improves clustering accuracy by 5.6% ± 2.7% on biomedical signals and 11.1% ± 7.2% on face images compared to baseline methods.

Conclusion: First study to apply physical chemistry perspective for rigorous NMF convergence analysis, demonstrating practical improvements in real-world applications.

Abstract: Non-Negative Matrix Factorization (NMF) is an unsupervised learning method offering low-rank representations across various domains such as audio processing, biomedical signal analysis, and image recognition. The incorporation of $\alpha$-divergence in NMF formulations enhances flexibility in optimization, yet extending these methods to multi-layer architectures presents challenges in ensuring convergence. To address this, we introduce a novel approach inspired by the Boltzmann probability of the energy barriers in chemical reactions to theoretically perform convergence analysis. We introduce a novel method, called Chem-NMF, with a bounding factor which stabilizes convergence. To our knowledge, this is the first study to apply a physical chemistry perspective to rigorously analyze the convergence behaviour of the NMF algorithm. We start from mathematically proven asymptotic convergence results and then show how they apply to real data. Experimental results demonstrate that the proposed algorithm improves clustering accuracy by 5.6% $\pm$ 2.7% on biomedical signals and 11.1% $\pm$ 7.2% on face images (mean $\pm$ std).

[402] Three Forms of Stochastic Injection for Improved Distribution-to-Distribution Generative Modeling

Shiye Su, Yuhui Zhang, Linqi Zhou, Rajesh Ranganath, Serena Yeung-Levy

Main category: cs.LG

TL;DR: A method to improve flow matching for general distribution-to-distribution transformations by injecting stochasticity through source sample perturbations, achieving better generation quality and reduced transport cost.

Details

Motivation: Flow matching has been primarily used for noise-to-data transformations, but its application to general distribution-to-distribution settings is underexplored, especially when source distributions have limited samples causing sparse supervision issues.

Method: Proposes injecting stochasticity into training by perturbing source samples and flow interpolants to address sparse supervision in distribution-to-distribution flow matching.

Result: Significantly improves generation quality on five diverse imaging tasks (biology, radiology, astronomy), outperforming baselines by average 9 FID points and reducing transport cost between input and generated samples.

Conclusion: The approach makes flow matching more practical for simulating diverse distribution transformations in scientific applications by better highlighting true transformation effects.

Abstract: Modeling transformations between arbitrary data distributions is a fundamental scientific challenge, arising in applications like drug discovery and evolutionary simulation. While flow matching offers a natural framework for this task, its use has thus far primarily focused on the noise-to-data setting, while its application in the general distribution-to-distribution setting is underexplored. We find that in the latter case, where the source is also a data distribution to be learned from limited samples, standard flow matching fails due to sparse supervision. To address this, we propose a simple and computationally efficient method that injects stochasticity into the training process by perturbing source samples and flow interpolants. On five diverse imaging tasks spanning biology, radiology, and astronomy, our method significantly improves generation quality, outperforming existing baselines by an average of 9 FID points. Our approach also reduces the transport cost between input and generated samples to better highlight the true effect of the transformation, making flow matching a more practical tool for simulating the diverse distribution transformations that arise in science.

[403] StruSR: Structure-Aware Symbolic Regression with Physics-Informed Taylor Guidance

Yunpeng Gong, Sihan Lan, Can Yang, Kunpeng Xu, Min Jiang

Main category: cs.LG

TL;DR: StruSR is a structure-aware symbolic regression framework that uses Physics-Informed Neural Networks (PINNs) to extract physical priors from time series data, guiding symbolic expression evolution through local Taylor expansions and genetic programming with physics-aware fitness functions.

Details

Motivation: Traditional symbolic regression methods lack mechanisms for extracting structured physical priors from time series observations, making it difficult to capture symbolic expressions that reflect the system's global behavior and physical laws.

Method: The framework uses trained PINNs to extract locally structured physical priors via local Taylor expansions, introduces masking-based attribution to quantify subtree contributions, and employs genetic programming with physics-aware mutation/crossover operations guided by a hybrid fitness function minimizing physics residuals and Taylor coefficient mismatch.

Result: Experiments on benchmark PDE systems show that StruSR improves convergence speed, structural fidelity, and expression interpretability compared to conventional baselines.

Conclusion: StruSR offers a principled paradigm for physics-grounded symbolic discovery by effectively integrating neural network priors with symbolic regression through structure-aware evolutionary algorithms.

Abstract: Symbolic regression aims to find interpretable analytical expressions by searching over mathematical formula spaces to capture underlying system behavior, particularly in scientific modeling governed by physical laws. However, traditional methods lack mechanisms for extracting structured physical priors from time series observations, making it difficult to capture symbolic expressions that reflect the system’s global behavior. In this work, we propose a structure-aware symbolic regression framework, called StruSR, that leverages trained Physics-Informed Neural Networks (PINNs) to extract locally structured physical priors from time series data. By performing local Taylor expansions on the outputs of the trained PINN, we obtain derivative-based structural information to guide symbolic expression evolution. To assess the importance of expression components, we introduce a masking-based attribution mechanism that quantifies each subtree’s contribution to structural alignment and physical residual reduction. These sensitivity scores steer mutation and crossover operations within genetic programming, preserving substructures with high physical or structural significance while selectively modifying less informative components. A hybrid fitness function jointly minimizes physics residuals and Taylor coefficient mismatch, ensuring consistency with both the governing equations and the local analytical behavior encoded by the PINN. Experiments on benchmark PDE systems demonstrate that StruSR improves convergence speed, structural fidelity, and expression interpretability compared to conventional baselines, offering a principled paradigm for physics-grounded symbolic discovery.

[404] Control-Augmented Autoregressive Diffusion for Data Assimilation

Prakhar Srivastava, Farrin Marouf Sofian, Francesco Immorlano, Kushagra Pandey, Stephan Mandt

Main category: cs.LG

TL;DR: The paper introduces an amortized framework that adds a lightweight controller to pretrained Auto-Regressive Diffusion Models (ARDMs) for data assimilation in chaotic PDEs, enabling efficient single-forward-rollout inference with on-the-fly corrections.

Details

Motivation: Guidance in Auto-Regressive Diffusion Models (ARDMs) remains underexplored, and existing methods for data assimilation in chaotic spatiotemporal PDEs are computationally prohibitive and prone to forecast drift under sparse observations.

Method: An amortized framework that augments pretrained ARDMs with a lightweight controller network trained offline by previewing future ARDM rollouts and learning stepwise controls that anticipate upcoming observations under a terminal cost objective.

Result: The method consistently outperforms four state-of-the-art baselines in stability, accuracy, and physical fidelity across two canonical PDEs and six observation regimes, while reducing DA inference to a single forward rollout with on-the-fly corrections.

Conclusion: The proposed framework provides an efficient and effective solution for data assimilation in chaotic PDEs, avoiding expensive adjoint computations and optimizations during inference while maintaining high performance.

Abstract: Despite recent advances in test-time scaling and finetuning of diffusion models, guidance in Auto-Regressive Diffusion Models (ARDMs) remains underexplored. We introduce an amortized framework that augments pretrained ARDMs with a lightweight controller network, trained offline by previewing future ARDM rollouts and learning stepwise controls that anticipate upcoming observations under a terminal cost objective. We evaluate this framework in the context of data assimilation (DA) for chaotic spatiotemporal partial differential equations (PDEs), a setting where existing methods are often computationally prohibitive and prone to forecast drift under sparse observations. Our approach reduces DA inference to a single forward rollout with on-the-fly corrections, avoiding expensive adjoint computations and/or optimizations during inference. We demonstrate that our method consistently outperforms four state-of-the-art baselines in stability, accuracy, and physical fidelity across two canonical PDEs and six observation regimes. We will release code and checkpoints publicly.

[405] The False Promise of Zero-Shot Super-Resolution in Machine-Learned Operators

Mansi Sakarvadia, Kareem Hegazy, Amin Totounferoush, Kyle Chard, Yaoqing Yang, Ian Foster, Michael W. Mahoney

Main category: cs.LG

TL;DR: Machine-learned operators (MLOs) fail at zero-shot super-resolution and multi-resolution inference due to aliasing and brittleness, but a simple multi-resolution training protocol can overcome these issues.

Details

Motivation: To evaluate whether MLOs can perform zero-shot super-resolution and multi-resolution inference, as they are designed to model continuous phenomena represented discretely and perform inference at arbitrary resolution.

Method: Comprehensive evaluation of zero-shot sub-resolution and super-resolution inference in MLOs, decoupling multi-resolution inference into extrapolation to varying frequency information and interpolation across varying resolutions.

Result: MLOs fail to perform both extrapolation and interpolation tasks in a zero-shot manner, showing brittleness and susceptibility to aliasing when inferring at resolutions different from training data.

Conclusion: A simple, computationally-efficient, data-driven multi-resolution training protocol is proposed to overcome aliasing and provide robust multi-resolution generalization.

Abstract: A core challenge in scientific machine learning, and scientific computing more generally, is modeling continuous phenomena which (in practice) are represented discretely. Machine-learned operators (MLOs) have been introduced as a means to achieve this modeling goal, as this class of architecture can perform inference at arbitrary resolution. In this work, we evaluate whether this architectural innovation is sufficient to perform “zero-shot super-resolution,” namely to enable a model to serve inference on higher-resolution data than that on which it was originally trained. We comprehensively evaluate both zero-shot sub-resolution and super-resolution (i.e., multi-resolution) inference in MLOs. We decouple multi-resolution inference into two key behaviors: 1) extrapolation to varying frequency information; and 2) interpolating across varying resolutions. We empirically demonstrate that MLOs fail to do both of these tasks in a zero-shot manner. Consequently, we find MLOs are not able to perform accurate inference at resolutions different from those on which they were trained, and instead they are brittle and susceptible to aliasing. To address these failure modes, we propose a simple, computationally-efficient, and data-driven multi-resolution training protocol that overcomes aliasing and that provides robust multi-resolution generalization.

[406] Local Reinforcement Learning with Action-Conditioned Root Mean Squared Q-Functions

Frank Wu, Mengye Ren

Main category: cs.LG

TL;DR: The paper introduces ARQ, a novel value estimation method that adapts the Forward-Forward algorithm’s goodness function for reinforcement learning, achieving state-of-the-art performance without backpropagation.

Details

Motivation: To bridge the gap between the Forward-Forward algorithm (currently limited to supervised settings) and reinforcement learning domains where learning signals occur naturally, by leveraging FF's biological grounding for RL applications.

Method: Proposes Action-conditioned Root mean squared Q-Functions (ARQ), which applies a goodness function and action conditioning for local reinforcement learning using temporal difference learning, eliminating the need for backpropagation.

Result: ARQ achieves superior performance compared to state-of-the-art local backprop-free RL methods on MinAtar and DeepMind Control Suite benchmarks, and outperforms algorithms trained with backpropagation on most tasks.

Conclusion: The Forward-Forward algorithm’s principles can be successfully adapted to reinforcement learning, providing an effective backprop-free alternative that maintains biological plausibility while achieving competitive performance.

Abstract: The Forward-Forward (FF) Algorithm is a recently proposed learning procedure for neural networks that employs two forward passes instead of the traditional forward and backward passes used in backpropagation. However, FF remains largely confined to supervised settings, leaving a gap at domains where learning signals can be yielded more naturally such as RL. In this work, inspired by FF’s goodness function using layer activity statistics, we introduce Action-conditioned Root mean squared Q-Functions (ARQ), a novel value estimation method that applies a goodness function and action conditioning for local RL using temporal difference learning. Despite its simplicity and biological grounding, our approach achieves superior performance compared to state-of-the-art local backprop-free RL methods in the MinAtar and the DeepMind Control Suite benchmarks, while also outperforming algorithms trained with backpropagation on most tasks. Code can be found at https://github.com/agentic-learning-ai-lab/arq.

[407] Rethinking Nonlinearity: Trainable Gaussian Mixture Modules for Modern Neural Architectures

Weiguo Lu, Gangnan Yuan, Hong-kun Zhang, Shangyang Li

Main category: cs.LG

TL;DR: GMNM introduces Gaussian mixture-inspired nonlinear modules that enhance neural networks by leveraging Gaussian mixture models and distance properties, improving performance across various architectures.

Details

Motivation: Conventional neural networks are limited by their activation functions' nonlinearity; GMNM aims to provide more flexible and powerful nonlinear transformations.

Method: GMNM uses relaxed probabilistic constraints and flexible Gaussian projections, integrating seamlessly into neural architectures for end-to-end training with gradient methods.

Result: Incorporating GMNM into MLPs, CNNs, attention mechanisms, and LSTMs consistently improves performance over standard baselines.

Conclusion: GMNM is a powerful and flexible module that enhances efficiency and accuracy across diverse machine learning applications.

Abstract: Neural networks in general, from MLPs and CNNs to attention-based Transformers, are constructed from layers of linear combinations followed by nonlinear operations such as ReLU, Sigmoid, or Softmax. Despite their strength, these conventional designs are often limited in introducing non-linearity by the choice of activation functions. In this work, we introduce Gaussian Mixture-Inspired Nonlinear Modules (GMNM), a new class of differentiable modules that draw on the universal density approximation Gaussian mixture models (GMMs) and distance properties (metric space) of Gaussian kernal. By relaxing probabilistic constraints and adopting a flexible parameterization of Gaussian projections, GMNM can be seamlessly integrated into diverse neural architectures and trained end-to-end with gradient-based methods. Our experiments demonstrate that incorporating GMNM into architectures such as MLPs, CNNs, attention mechanisms, and LSTMs consistently improves performance over standard baselines. These results highlight GMNM’s potential as a powerful and flexible module for enhancing efficiency and accuracy across a wide range of machine learning applications.

[408] The Effect of Attention Head Count on Transformer Approximation

Penghao Yu, Haotian Jiang, Zeyu Bao, Ruoxi Yu, Qianxiao Li

Main category: cs.LG

TL;DR: This paper analyzes how the number of attention heads affects transformers’ expressive power, establishing theoretical bounds on parameter complexity for approximation and showing that sufficient heads enable efficient approximation while too few heads require exponential parameter scaling.

Details

Motivation: Despite transformers being dominant for sequence modeling, there's limited understanding of how structural parameters like attention heads influence expressive power. The paper aims to provide rigorous theoretical analysis of transformers' approximation properties.

Method: Introduces a generalized D-retrieval task as theoretical framework, establishes upper and lower bounds on parameter complexity for ε-approximation, analyzes single-head case with embedding dimension O(T), and validates with experiments on synthetic and real-world data.

Result: Shows transformers with sufficient heads admit efficient approximation, but with too few heads, parameters must scale as O(1/ε^{cT}) (exponential in sequence length). Single-head transformers can achieve memorization with O(T) embedding dimension via feed-forward block.

Conclusion: The number of attention heads is crucial for transformers’ expressive power - sufficient heads enable efficient approximation while insufficient heads require exponential parameter scaling. This provides first rigorous lower bounds in nonlinear, practical settings.

Abstract: Transformer has become the dominant architecture for sequence modeling, yet a detailed understanding of how its structural parameters influence expressive power remains limited. In this work, we study the approximation properties of transformers, with particular emphasis on the role of the number of attention heads. Our analysis begins with the introduction of a generalized $D$-retrieval task, which we prove to be dense in the space of continuous functions, thereby providing the basis for our theoretical framework. We then establish both upper and lower bounds on the parameter complexity required for $\epsilon$-approximation. Specifically, we show that transformers with sufficiently many heads admit efficient approximation, whereas with too few heads, the number of parameters must scale at least as $O(1/\epsilon^{cT})$, for some constant $c$ and sequence length $T$. To the best of our knowledge, this constitutes the first rigorous lower bound of this type in a nonlinear and practically relevant setting. We further examine the single-head case and demonstrate that an embedding dimension of order $O(T)$ allows complete memorization of the input, where approximation is entirely achieved by the feed-forward block. Finally, we validate our theoretical findings with experiments on both synthetic data and real-world tasks, illustrating the practical relevance of our results.

[409] XRPO: Pushing the limits of GRPO with Targeted Exploration and Exploitation

Udbhav Bamba, Minghao Fang, Yifan Yu, Haizhong Zheng, Fan Lai

Main category: cs.LG

TL;DR: XRPO is a reinforcement learning framework that improves LLM reasoning by adaptively allocating rollouts to prompts with higher uncertainty reduction potential and using in-context seeding for better exploration, while leveraging sequence likelihoods to amplify correct responses for enhanced exploitation.

Details

Motivation: Existing RL approaches like GRPO suffer from limited exploration on challenging prompts and underexploited feedback signals due to context-independent rollout allocation and heavy reliance on sparse rewards.

Method: XRPO introduces an adaptive rollout allocator for uncertainty-based prompt prioritization, in-context seeding for difficult reasoning trajectories, and a group-relative, novelty-aware advantage sharpening mechanism using sequence likelihoods.

Result: XRPO outperforms GRPO and GSPO by up to 4% pass@1 and 6% cons@32 across math and coding benchmarks, while accelerating training convergence by up to 2.7X.

Conclusion: XRPO provides a principled exploration-exploitation framework that significantly improves LLM reasoning performance and training efficiency through adaptive rollout allocation and enhanced advantage estimation.

Abstract: Reinforcement learning algorithms such as GRPO have driven recent advances in large language model (LLM) reasoning. While scaling the number of rollouts stabilizes training, existing approaches suffer from limited exploration on challenging prompts and leave informative feedback signals underexploited, due to context-independent rollout allocation across prompts (e.g., generating 16 rollouts per prompt) and relying heavily on sparse rewards. This paper presents XRPO(eXplore - eXploit GRPO), a unified framework that recasts policy optimization through the principled lens of rollout exploration-exploitation. To enhance exploration, XRPO introduces a mathematically grounded rollout allocator that adaptively prioritizes prompts with higher potential for uncertainty reduction. It further addresses stagnation on zero-reward prompts through an in-context seeding strategy that injects curated exemplars, steering the model into more difficult reasoning trajectories. To strengthen exploitation, XRPO develops a group-relative, novelty-aware advantage sharpening mechanism that leverages sequence likelihoods to amplify low-probability yet correct responses, thereby extending the policy’s reach beyond sparse rewards. Experiments across diverse math and coding benchmarks on both reasoning and non-reasoning models demonstrate that XRPO outperforms existing advances (e.g., GRPO and GSPO) up to 4% pass@1 and 6% cons@32, while accelerating training convergence by up to 2.7X.

[410] TimeFormer: Transformer with Attention Modulation Empowered by Temporal Characteristics for Time Series Forecasting

Zhipeng Liu, Peibo Duan, Xuan Tang, Baixin Li, Yongsheng Huang, Mingyang Geng, Changsheng Zhang, Bin Zhang, Binwu Wang

Main category: cs.LG

TL;DR: TimeFormer is a novel Transformer architecture for time series forecasting that incorporates temporal priors through a modulated self-attention mechanism and multi-scale analysis, achieving state-of-the-art performance.

Details

Motivation: Transformers are not well-suited for time series data due to insufficient consideration of temporal characteristics like unidirectional influence and decaying influence over time.

Method: Proposes TimeFormer with modulated self-attention (MoSA) that captures temporal priors under Hawkes process constraints and causal masking, plus multi-scale subsequence analysis for semantic dependencies.

Result: Significantly outperforms state-of-the-art methods with up to 7.45% MSE reduction, achieving new benchmarks on 94.04% of evaluation metrics. MoSA mechanism also enhances other Transformer models.

Conclusion: TimeFormer effectively bridges the gap between Transformers and time series forecasting by incorporating temporal characteristics, setting new performance standards while being broadly applicable to other Transformer architectures.

Abstract: Although Transformers excel in natural language processing, their extension to time series forecasting remains challenging due to insufficient consideration of the differences between textual and temporal modalities. In this paper, we develop a novel Transformer architecture designed for time series data, aiming to maximize its representational capacity. We identify two key but often overlooked characteristics of time series: (1) unidirectional influence from the past to the future, and (2) the phenomenon of decaying influence over time. These characteristics are introduced to enhance the attention mechanism of Transformers. We propose TimeFormer, whose core innovation is a self-attention mechanism with two modulation terms (MoSA), designed to capture these temporal priors of time series under the constraints of the Hawkes process and causal masking. Additionally, TimeFormer introduces a framework based on multi-scale and subsequence analysis to capture semantic dependencies at different temporal scales, enriching the temporal dependencies. Extensive experiments conducted on multiple real-world datasets show that TimeFormer significantly outperforms state-of-the-art methods, achieving up to a 7.45% reduction in MSE compared to the best baseline and setting new benchmarks on 94.04% of evaluation metrics. Moreover, we demonstrate that the MoSA mechanism can be broadly applied to enhance the performance of other Transformer-based models.

[411] Distributed Algorithms for Multi-Agent Multi-Armed Bandits with Collision

Daoyuan Zhou, Xuchuang Wang, Lin Yang, Yang Gao

Main category: cs.LG

TL;DR: A distributed algorithm for multiplayer multi-armed bandits with adaptive communication achieves near-optimal regret with only O(log log T) communication cost, outperforming SOTA methods.

Details

Motivation: To solve the multiplayer multi-armed bandit problem in distributed settings without central coordination, where collisions occur when players select the same arm and only local observations are available.

Method: Proposed a distributed algorithm with adaptive, efficient communication protocol that uses only O(log log T) communication cost while allowing players to observe only their own actions and collision feedback.

Result: Achieves near-optimal group and individual regret, with significant performance improvements over existing baselines and notable reduction in individual regret compared to SOTA methods.

Conclusion: The approach successfully addresses distributed MMAB with minimal communication overhead and extends to periodic asynchronous settings with logarithmic regret guarantees.

Abstract: We study the stochastic Multiplayer Multi-Armed Bandit (MMAB) problem, where multiple players select arms to maximize their cumulative rewards. Collisions occur when two or more players select the same arm, resulting in no reward, and are observed by the players involved. We consider a distributed setting without central coordination, where each player can only observe their own actions and collision feedback. We propose a distributed algorithm with an adaptive, efficient communication protocol. The algorithm achieves near-optimal group and individual regret, with a communication cost of only $\mathcal{O}(\log\log T)$. Our experiments demonstrate significant performance improvements over existing baselines. Compared to state-of-the-art (SOTA) methods, our approach achieves a notable reduction in individual regret. Finally, we extend our approach to a periodic asynchronous setting, proving the lower bound for this problem and presenting an algorithm that achieves logarithmic regret.

[412] AutoBalance: An Automatic Balancing Framework for Training Physics-Informed Neural Networks

Kang An, Chenhao Si, Ming Yan, Shiqian Ma

Main category: cs.LG

TL;DR: AutoBalance introduces a post-combine training paradigm for PINNs that assigns independent adaptive optimizers to each loss component, overcoming limitations of existing pre-combine methods that struggle with conflicting loss terms.

Details

Motivation: Training PINNs is difficult due to conflicting loss terms (PDE residuals and boundary conditions) with different curvatures. Existing pre-combine gradient manipulation methods are fundamentally limited as they disrupt optimizer preconditioning.

Method: AutoBalance uses a post-combine strategy where each loss component gets its own independent adaptive optimizer, and the resulting preconditioned updates are aggregated afterwards.

Result: Extensive experiments show AutoBalance consistently outperforms existing frameworks with significant reductions in solution error (MSE and L∞ norms). It also amplifies effectiveness of other PINN methodologies.

Conclusion: AutoBalance provides an effective training paradigm for PINNs that addresses fundamental limitations of existing methods and works complementarily with other approaches.

Abstract: Physics-Informed Neural Networks (PINNs) provide a powerful and general framework for solving Partial Differential Equations (PDEs) by embedding physical laws into loss functions. However, training PINNs is notoriously difficult due to the need to balance multiple loss terms, such as PDE residuals and boundary conditions, which often have conflicting objectives and vastly different curvatures. Existing methods address this issue by manipulating gradients before optimization (a “pre-combine” strategy). We argue that this approach is fundamentally limited, as forcing a single optimizer to process gradients from spectrally heterogeneous loss landscapes disrupts its internal preconditioning. In this work, we introduce AutoBalance, a novel “post-combine” training paradigm. AutoBalance assigns an independent adaptive optimizer to each loss component and aggregates the resulting preconditioned updates afterwards. Extensive experiments on challenging PDE benchmarks show that AutoBalance consistently outperforms existing frameworks, achieving significant reductions in solution error, as measured by both the MSE and $L^{\infty}$ norms. Moreover, AutoBalance is orthogonal to and complementary with other popular PINN methodologies, amplifying their effectiveness on demanding benchmarks.

[413] Is the Hard-Label Cryptanalytic Model Extraction Really Polynomial?

Akira Ito, Takayuki Miura, Yosuke Todo

Main category: cs.LG

TL;DR: This paper proposes CrossLayer Extraction, a novel model extraction attack that overcomes the exponential query complexity limitations of previous hard-label attacks by exploiting neuron interactions across layers.

Details

Motivation: Previous model extraction attacks in hard-label settings (Eurocrypt 2025) were shown to require exponential queries as network depth increases, making them impractical for deep networks. The authors aim to address this critical limitation.

Method: The proposed CrossLayer Extraction method avoids directly extracting secret parameters of specific neurons, which incurs exponential cost. Instead, it exploits neuron interactions across layers to extract information from deeper layers, significantly reducing query complexity.

Result: The new attack method significantly reduces query complexity compared to existing approaches and mitigates the limitations of previous model extraction methods, making extraction feasible even for deep networks.

Conclusion: CrossLayer Extraction provides a practical solution to model extraction in hard-label settings by overcoming the exponential query complexity barrier that plagued previous approaches, enabling efficient extraction of deep neural network models.

Abstract: Deep Neural Networks (DNNs) have attracted significant attention, and their internal models are now considered valuable intellectual assets. Extracting these internal models through access to a DNN is conceptually similar to extracting a secret key via oracle access to a block cipher. Consequently, cryptanalytic techniques, particularly differential-like attacks, have been actively explored recently. ReLU-based DNNs are the most commonly and widely deployed architectures. While early works (e.g., Crypto 2020, Eurocrypt 2024) assume access to exact output logits, which are usually invisible, more recent works (e.g., Asiacrypt 2024, Eurocrypt 2025) focus on the hard-label setting, where only the final classification result (e.g., “dog” or “car”) is available to the attacker. Notably, Carlini et al. (Eurocrypt 2025) demonstrated that model extraction is feasible in polynomial time even under this restricted setting. In this paper, we first show that the assumptions underlying their attack become increasingly unrealistic as the attack-target depth grows. In practice, satisfying these assumptions requires an exponential number of queries with respect to the attack depth, implying that the attack does not always run in polynomial time. To address this critical limitation, we propose a novel attack method called CrossLayer Extraction. Instead of directly extracting the secret parameters (e.g., weights and biases) of a specific neuron, which incurs exponential cost, we exploit neuron interactions across layers to extract this information from deeper layers. This technique significantly reduces query complexity and mitigates the limitations of existing model extraction approaches.

[414] A Diffusion Model for Regular Time Series Generation from Irregular Data with Completion and Masking

Gal Fadlon, Idan Arbiv, Nimrod Berman, Omri Azencot

Main category: cs.LG

TL;DR: A novel two-step framework for generating realistic irregular time series data that combines completion and masking to overcome limitations of prior methods.

Details

Motivation: Irregular sampling and missing values in time series data present challenges for generation methods, and existing approaches yield suboptimal results with high computational costs.

Method: Two-step framework: 1) Time Series Transformer completes irregular sequences to create natural neighborhoods, 2) Vision-based diffusion model with masking minimizes dependence on completed values.

Result: State-of-the-art performance with 70% relative improvement in discriminative score and 85% improvement in computational cost.

Conclusion: The proposed approach effectively leverages both completion and masking strategies to enable robust and efficient generation of realistic time series data.

Abstract: Generating realistic time series data is critical for applications in healthcare, finance, and science. However, irregular sampling and missing values present significant challenges. While prior methods address these irregularities, they often yield suboptimal results and incur high computational costs. Recent advances in regular time series generation, such as the diffusion-based ImagenTime model, demonstrate strong, fast, and scalable generative capabilities by transforming time series into image representations, making them a promising solution. However, extending ImagenTime to irregular sequences using simple masking introduces “unnatural” neighborhoods, where missing values replaced by zeros disrupt the learning process. To overcome this, we propose a novel two-step framework: first, a Time Series Transformer completes irregular sequences, creating natural neighborhoods; second, a vision-based diffusion model with masking minimizes dependence on the completed values. This approach leverages the strengths of both completion and masking, enabling robust and efficient generation of realistic time series. Our method achieves state-of-the-art performance, achieving a relative improvement in discriminative score by $70%$ and in computational cost by $85%$. Code is at https://github.com/azencot-group/ImagenI2R.

[415] Dual Goal Representations

Seohong Park, Deepinder Mann, Sergey Levine

Main category: cs.LG

TL;DR: Dual goal representations for GCRL encode states by their temporal distances to all other states, providing dynamics-invariant representations that improve goal-reaching performance across diverse tasks.

Details

Motivation: To create goal representations that are invariant to state representation and contain sufficient information for optimal goal-reaching policies while filtering out noise.

Method: Develop dual goal representations that characterize states by temporal distances to all other states, and combine this with existing GCRL algorithms.

Result: Consistent improvement in offline goal-reaching performance across 20 state- and pixel-based tasks in the OGBench task suite.

Conclusion: Dual goal representations provide theoretically sound and practically effective representations for GCRL that enhance performance across diverse environments.

Abstract: In this work, we introduce dual goal representations for goal-conditioned reinforcement learning (GCRL). A dual goal representation characterizes a state by “the set of temporal distances from all other states”; in other words, it encodes a state through its relations to every other state, measured by temporal distance. This representation provides several appealing theoretical properties. First, it depends only on the intrinsic dynamics of the environment and is invariant to the original state representation. Second, it contains provably sufficient information to recover an optimal goal-reaching policy, while being able to filter out exogenous noise. Based on this concept, we develop a practical goal representation learning method that can be combined with any existing GCRL algorithm. Through diverse experiments on the OGBench task suite, we empirically show that dual goal representations consistently improve offline goal-reaching performance across 20 state- and pixel-based tasks.

[416] Incorporating Expert Knowledge into Bayesian Causal Discovery of Mixtures of Directed Acyclic Graphs

Zachris Björkman, Jorge Loría, Sophie Wharrie, Samuel Kaski

Main category: cs.LG

TL;DR: The paper proposes a causal elicitation strategy for heterogeneous domains using Bayesian experimental design and a variational mixture structure learning method to infer mixtures of causal Bayesian networks.

Details

Motivation: Existing prior elicitation approaches assume a single causal graph, which is insufficient for heterogeneous domains where multiple causal models may exist. Domain expert knowledge is needed but current methods don't handle heterogeneity.

Method: Proposed causal elicitation strategy based on Bayesian experimental design (BED) principles and variational mixture structure learning (VaMSL) method, extending differentiable Bayesian structure learning (DiBS) to iteratively infer mixtures of causal Bayesian networks.

Result: The method successfully produces sets of alternative causal models (mixture components/clusters) and achieves improved structure learning performance on heterogeneous synthetic data when informed by simulated expert feedback.

Conclusion: The approach is capable of capturing complex distributions in real-world applications, as demonstrated with a breast cancer database, providing a framework for incorporating expert knowledge in heterogeneous causal discovery.

Abstract: Bayesian causal discovery benefits from prior information elicited from domain experts, and in heterogeneous domains any prior knowledge would be badly needed. However, so far prior elicitation approaches have assumed a single causal graph and hence are not suited to heterogeneous domains. We propose a causal elicitation strategy for heterogeneous settings, based on Bayesian experimental design (BED) principles, and a variational mixture structure learning (VaMSL) method – extending the earlier differentiable Bayesian structure learning (DiBS) method – to iteratively infer mixtures of causal Bayesian networks (CBNs). We construct an informative graph prior incorporating elicited expert feedback in the inference of mixtures of CBNs. Our proposed method successfully produces a set of alternative causal models (mixture components or clusters), and achieves an improved structure learning performance on heterogeneous synthetic data when informed by a simulated expert. Finally, we demonstrate that our approach is capable of capturing complex distributions in a breast cancer database.

[417] Function regression using the forward forward training and inferring paradigm

Shivam Padmani, Akshay Joshi

Main category: cs.LG

TL;DR: This paper introduces a new methodology for function regression using the Forward-Forward algorithm, extending it beyond classification tasks to function approximation for both univariate and multivariate functions.

Details

Motivation: Function regression is fundamental in machine learning, but the Forward-Forward algorithm has only been applied to classification tasks. The authors aim to extend this novel training approach to function approximation problems.

Method: Developed a new methodology for function regression using the Forward-Forward algorithm, which trains neural networks without backpropagation. The approach was tested on univariate and multivariate functions, with preliminary extensions to Kolmogorov Arnold Networks and Deep Physical Neural Networks.

Result: The paper successfully demonstrates that the Forward-Forward algorithm can be adapted for function regression tasks, providing a viable alternative to backpropagation-based training methods.

Conclusion: The Forward-Forward algorithm can be effectively extended from classification to function regression, opening new possibilities for neuromorphic computing and physical neural network implementations where backpropagation is impractical.

Abstract: Function regression/approximation is a fundamental application of machine learning. Neural networks (NNs) can be easily trained for function regression using a sufficient number of neurons and epochs. The forward-forward learning algorithm is a novel approach for training neural networks without backpropagation, and is well suited for implementation in neuromorphic computing and physical analogs for neural networks. To the best of the authors’ knowledge, the Forward Forward paradigm of training and inferencing NNs is currently only restricted to classification tasks. This paper introduces a new methodology for approximating functions (function regression) using the Forward-Forward algorithm. Furthermore, the paper evaluates the developed methodology on univariate and multivariate functions, and provides preliminary studies of extending the proposed Forward-Forward regression to Kolmogorov Arnold Networks, and Deep Physical Neural Networks.

[418] Modeling COVID-19 Dynamics in German States Using Physics-Informed Neural Networks

Phillip Rothenbeck, Sai Karthikeya Vemuri, Niklas Penzel, Joachim Denzler

Main category: cs.LG

TL;DR: PINNs used for spatio-temporal SIR model analysis of COVID-19 across German states, estimating transmission parameters and R_t over 3 years.

Details

Motivation: Need for quantitative modeling of disease dynamics and limitations of traditional compartmental models in handling noisy observational data.

Method: Physics-Informed Neural Networks (PINNs) applied to solve inverse SIR model using RKI infection data for fine-grained spatio-temporal analysis.

Result: Strong regional variations in transmission behavior correlated with vaccination uptake and pandemic phases; successful parameter estimation across all German states.

Conclusion: PINNs demonstrate utility for localized, long-term epidemiological modeling with ability to track pandemic progression through time-varying parameters.

Abstract: The COVID-19 pandemic has highlighted the need for quantitative modeling and analysis to understand real-world disease dynamics. In particular, post hoc analyses using compartmental models offer valuable insights into the effectiveness of public health interventions, such as vaccination strategies and containment policies. However, such compartmental models like SIR (Susceptible-Infectious-Recovered) often face limitations in directly incorporating noisy observational data. In this work, we employ Physics-Informed Neural Networks (PINNs) to solve the inverse problem of the SIR model using infection data from the Robert Koch Institute (RKI). Our main contribution is a fine-grained, spatio-temporal analysis of COVID-19 dynamics across all German federal states over a three-year period. We estimate state-specific transmission and recovery parameters and time-varying reproduction number (R_t) to track the pandemic progression. The results highlight strong variations in transmission behavior across regions, revealing correlations with vaccination uptake and temporal patterns associated with major pandemic phases. Our findings demonstrate the utility of PINNs in localized, long-term epidemiological modeling.

[419] Get RICH or Die Scaling: Profitably Trading Inference Compute for Robustness

Tavish McDonald, Bo Lei, Stanislav Fort, Bhavya Kailkhura, Brian Bartoldson

Main category: cs.LG

TL;DR: The paper introduces the Robustness from Inference Compute Hypothesis (RICH), showing that inference-compute defenses benefit when training data better reflects attacked data components, enabling compositional generalization to out-of-distribution adversarial inputs.

Details

Motivation: Address the limitation that test-time compute benefits fade when attackers have access to gradients or multimodal inputs, and clarify that inference-compute offers robustness benefits even in such challenging scenarios.

Method: Propose the RICH hypothesis and empirically validate it across vision language models and attack types, examining how compositional generalization enables adherence to defensive specifications on adversarially OOD inputs.

Result: Robustness gains from test-time compute occur when specification following on OOD data is unlocked by compositional generalization, with RL finetuning and protracted reasoning not being critical. Prompting emphasis on defensive specifications lowers success rates of gradient-based multimodal attacks on robustified VLMs.

Conclusion: Inference-compute’s robustness benefit correlates with base model robustness, creating a rich-get-richer dynamic. The paper advises layering train-time and test-time defenses for synergistic benefits.

Abstract: Models are susceptible to adversarially out-of-distribution (OOD) data despite large training-compute investments into their robustification. Zaremba et al. (2025) make progress on this problem at test time, showing LLM reasoning improves satisfaction of model specifications designed to thwart attacks, resulting in a correlation between reasoning effort and robustness to jailbreaks. However, this benefit of test compute fades when attackers are given access to gradients or multimodal inputs. We address this gap, clarifying that inference-compute offers benefits even in such cases. Our approach argues that compositional generalization, through which OOD data is understandable via its in-distribution (ID) components, enables adherence to defensive specifications on adversarially OOD inputs. Namely, we posit the Robustness from Inference Compute Hypothesis (RICH): inference-compute defenses profit as the model’s training data better reflects the attacked data’s components. We empirically support this hypothesis across vision language model and attack types, finding robustness gains from test-time compute if specification following on OOD data is unlocked by compositional generalization, while RL finetuning and protracted reasoning are not critical. For example, increasing emphasis on defensive specifications via prompting lowers the success rate of gradient-based multimodal attacks on VLMs robustified by adversarial pretraining, but this same intervention provides no such benefit to not-robustified models. This correlation of inference-compute’s robustness benefit with base model robustness is the rich-get-richer dynamic of the RICH: attacked data components are more ID for robustified models, aiding compositional generalization to OOD data. Accordingly, we advise layering train-time and test-time defenses to obtain their synergistic benefit.

[420] The Unreasonable Effectiveness of Randomized Representations in Online Continual Graph Learning

Giovanni Donghi, Daniele Zambon, Luca Pasa, Cesare Alippi, Nicolò Navarin

Main category: cs.LG

TL;DR: A simple approach for Online Continual Graph Learning that uses a fixed random encoder and trains only a lightweight classifier online, achieving state-of-the-art performance without memory buffers.

Details

Motivation: Catastrophic forgetting is a major challenge in Online Continual Graph Learning where nodes arrive sequentially and distribution drifts occur, making offline training infeasible.

Method: Use a fixed, randomly initialized encoder to generate node embeddings by aggregating neighborhood information, while training only a lightweight classifier online. This eliminates representation parameter drifts that cause forgetting.

Result: Consistent gains over state-of-the-art methods across several benchmarks, with up to 30% improvement, often approaching joint offline-training upper bound performance.

Conclusion: Catastrophic forgetting in OCGL can be effectively minimized through architectural simplicity and stability rather than complex replay or regularization techniques.

Abstract: Catastrophic forgetting is one of the main obstacles for Online Continual Graph Learning (OCGL), where nodes arrive one by one, distribution drifts may occur at any time and offline training on task-specific subgraphs is not feasible. In this work, we explore a surprisingly simple yet highly effective approach for OCGL: we use a fixed, randomly initialized encoder to generate robust and expressive node embeddings by aggregating neighborhood information, training online only a lightweight classifier. By freezing the encoder, we eliminate drifts of the representation parameters, a key source of forgetting, obtaining embeddings that are both expressive and stable. When evaluated across several OCGL benchmarks, despite its simplicity and lack of memory buffer, this approach yields consistent gains over state-of-the-art methods, with surprising improvements of up to 30% and performance often approaching that of the joint offline-training upper bound. These results suggest that in OCGL, catastrophic forgetting can be minimized without complex replay or regularization by embracing architectural simplicity and stability.

[421] Efficient numeracy in language models through single-token number embeddings

Linus Kreitner, Paul Hager, Jonathan Mengedoht, Georgios Kaissis, Daniel Rueckert, Martin J. Menten

Main category: cs.LG

TL;DR: BitTokens is a novel tokenization method that encodes numbers as single tokens using IEEE 754 binary floating-point representation, enabling LLMs to handle numerical data more efficiently and solve arithmetic operations nearly perfectly.

Details

Motivation: Current LLMs struggle with numerical data processing due to excessive reasoning tokens and poor tokenization strategies that split numbers into multiple tokens, limiting their numerical intuition and problem-solving capabilities.

Method: Proposed BitTokens - a tokenization strategy that embeds any number into a single token using its IEEE 754 binary floating-point representation, with defined desiderata for effective number encodings.

Result: Extensive experiments show BitTokens enable even small language models to learn algorithms that solve basic arithmetic operations nearly perfectly, improving efficiency and expanding problem-solving capabilities.

Conclusion: BitTokens provide an efficient single-token number encoding that could significantly expand the length and complexity of problems language models can solve by addressing current limitations in numerical data processing.

Abstract: To drive progress in science and engineering, large language models (LLMs) must be able to process large amounts of numerical data and solve long calculations efficiently. This is currently only possible through the use of external tools or extensive reasoning chains, either limiting the numerical intuition of LLMs or limiting the length of problems they can solve. We show that frontier LLMs require excessive amounts of reasoning tokens to solve even basic calculations, which is exacerbated by their tokenization strategies that split single numbers into multiple tokens. This motivates the need for efficient and effective single-token number encodings. We introduce a set of desiderata for such encodings and show that existing approaches fail to fulfill them. To address these shortcomings, we propose BitTokens, a novel tokenization strategy that embeds any number into a single token using its IEEE 754 binary floating-point representation. Through extensive experiments we show that our BitTokens allow even small language models to learn algorithms that solve basic arithmetic operations nearly perfectly. This newly gained efficiency could expand the length and complexity of problems language models can solve.

[422] Recurrence-Complete Frame-based Action Models

Michael Keiblinger

Main category: cs.LG

TL;DR: This paper challenges the view that attention mechanisms alone are sufficient, arguing that recurrence is necessary for long-running agentic tasks. It introduces a recurrence-complete architecture that shows improved performance with longer training sequences.

Details

Motivation: The authors challenge the claim from "Attention Is All You Need" that RNN cells are unnecessary with attention, pointing to proofs that fully parallelizable architectures cannot handle certain classes of problems important for long-running agentic tasks.

Method: The paper introduces a recurrence-complete architecture and trains it on GitHub-derived action sequences. The approach maintains fixed parameter count while training on increasingly longer sequences.

Result: The loss follows a power law in the trained sequence length, and longer-sequence training amortizes its linearly increasing wall-time cost, yielding lower loss as a function of wall time.

Conclusion: Recurrence is essential for handling long-running agentic tasks, and the proposed recurrence-complete architecture demonstrates improved scaling properties with longer training sequences.

Abstract: In recent years, attention-like mechanisms have been used to great success in the space of large language models, unlocking scaling potential to a previously unthinkable extent. “Attention Is All You Need” famously claims RNN cells are not needed in conjunction with attention. We challenge this view. In this paper, we point to existing proofs that architectures with fully parallelizable forward or backward passes cannot represent classes of problems specifically interesting for long-running agentic tasks. We further conjecture a critical time t beyond which non-recurrence-complete models fail to aggregate inputs correctly, with concrete implications for agentic systems (e.g., software engineering agents). To address this, we introduce a recurrence-complete architecture and train it on GitHub-derived action sequences. Loss follows a power law in the trained sequence length while the parameter count remains fixed. Moreover, longer-sequence training always amortizes its linearly increasing wall-time cost, yielding lower loss as a function of wall time.

[423] Early wind turbine alarm prediction based on machine learning: AlarmForecasting

Syed Shazaib Shah, Daoliang Tan

Main category: cs.LG

TL;DR: The paper proposes an Alarm Forecasting and Classification (AFC) framework that predicts wind turbine alarms before they trigger, using LSTM-based regression for time-series forecasting followed by classification for alarm tagging.

Details

Motivation: Traditional approaches use alarm data only for diagnostics after faults occur, but this study aims to prevent alarms from triggering altogether to avert impending failures and enhance operational efficiency.

Method: A two-module AFC framework: (1) LSTM-based regression module for time-series alarm forecasting, (2) classification module for alarm tagging on forecasted alarms, enabling forecasting of entire alarm taxonomy rather than specific alarms.

Result: Tested on 14 Senvion MM82 turbines over 5 years, achieving 82%, 52%, and 41% accuracy for 10, 20, and 30-minute alarm forecasts respectively.

Conclusion: The framework successfully anticipates and averts alarms, significantly reducing alarm frequency and enabling proactive intervention to enhance operational efficiency.

Abstract: Alarm data is pivotal in curbing fault behavior in Wind Turbines (WTs) and forms the backbone for advancedpredictive monitoring systems. Traditionally, research cohorts have been confined to utilizing alarm data solelyas a diagnostic tool, merely indicative of unhealthy status. However, this study aims to offer a transformativeleap towards preempting alarms, preventing alarms from triggering altogether, and consequently avertingimpending failures. Our proposed Alarm Forecasting and Classification (AFC) framework is designed on twosuccessive modules: first, the regression module based on long short-term memory (LSTM) for time-series alarmforecasting, and thereafter, the classification module to implement alarm tagging on the forecasted alarm. Thisway, the entire alarm taxonomy can be forecasted reliably rather than a few specific alarms. 14 Senvion MM82turbines with an operational period of 5 years are used as a case study; the results demonstrated 82%, 52%,and 41% accurate forecasts for 10, 20, and 30 min alarm forecasts, respectively. The results substantiateanticipating and averting alarms, which is significant in curbing alarm frequency and enhancing operationalefficiency through proactive intervention.

[424] Vectorized FlashAttention with Low-cost Exponential Computation in RISC-V Vector Processors

Vasileios Titopoulos, Kosmas Alexandridis, Giorgos Dimitrakopoulos

Main category: cs.LG

TL;DR: This paper presents the first vectorized implementation of FlashAttention algorithm for RISC-V vector processors, using exponential approximations and tiling strategies to improve performance.

Details

Motivation: To accelerate attention kernels in machine learning models by vectorizing FlashAttention for RISC-V processors, reducing scalar code and computational complexity.

Method: Vectorized FlashAttention implementation using low-cost exponential approximations in floating-point arithmetic without custom ISA extensions, combined with tiling strategies for memory locality.

Result: Experimental results show scalable approach with significant performance gains in processing attention layers for practical applications.

Conclusion: The vectorized FlashAttention implementation successfully accelerates attention kernels on RISC-V vector processors through efficient exponential approximations and memory optimization strategies.

Abstract: Attention is a core operation in numerous machine learning and artificial intelligence models. This work focuses on the acceleration of attention kernel using FlashAttention algorithm, in vector processors, particularly those based on the RISC-V instruction set architecture (ISA). This work represents the first effort to vectorize FlashAttention, minimizing scalar code and simplifying the computational complexity of evaluating exponentials needed by softmax used in attention. By utilizing a low-cost approximation for exponentials in floating-point arithmetic, we reduce the cost of computing the exponential function without the need to extend baseline vector ISA with new custom instructions. Also, appropriate tiling strategies are explored with the goal to improve memory locality. Experimental results highlight the scalability of our approach, demonstrating significant performance gains with the vectorized implementations when processing attention layers in practical applications.

[425] CNN-TFT explained by SHAP with multi-head attention weights for time series forecasting

Stefano F. Stefenon, João P. Matos-Carvalho, Valderi R. Q. Leithardt, Kin-Choong Yow

Main category: cs.LG

TL;DR: Proposes CNN-TFT-SHAP-MHAW, a hybrid architecture combining CNN for local pattern extraction and Temporal Fusion Transformer for long-range dependencies, achieving 2.2% MAPE on hydroelectric flow forecasting with explainability via SHAP-MHAW.

Details

Motivation: To leverage complementary strengths of CNNs (local patterns, translational invariances) and transformers (long-range dependencies) for improved multivariate time series forecasting.

Method: Hybrid CNN-TFT architecture: CNN module applies 1D convolutions for local pattern extraction and noise reduction, then feeds feature maps to TFT with multi-head attention for capturing short- and long-term dependencies and adaptive covariate weighting.

Result: Outperforms established deep learning models with 2.2% mean absolute percentage error on hydroelectric natural flow dataset. Model explainability achieved through proposed SHAP-MHAW method.

Conclusion: CNN-TFT-SHAP-MHAW is promising for high-fidelity multivariate time series forecasting applications, with code available for future analysis.

Abstract: Convolutional neural networks (CNNs) and transformer architectures offer strengths for modeling temporal data: CNNs excel at capturing local patterns and translational invariances, while transformers effectively model long-range dependencies via self-attention. This paper proposes a hybrid architecture integrating convolutional feature extraction with a temporal fusion transformer (TFT) backbone to enhance multivariate time series forecasting. The CNN module first applies a hierarchy of one-dimensional convolutional layers to distill salient local patterns from raw input sequences, reducing noise and dimensionality. The resulting feature maps are then fed into the TFT, which applies multi-head attention to capture both short- and long-term dependencies and to weigh relevant covariates adaptively. We evaluate the CNN-TFT on a hydroelectric natural flow time series dataset. Experimental results demonstrate that CNN-TFT outperforms well-established deep learning models, with a mean absolute percentage error of up to 2.2%. The explainability of the model is obtained by a proposed Shapley additive explanations with multi-head attention weights (SHAP-MHAW). Our novel architecture, named CNN-TFT-SHAP-MHAW, is promising for applications requiring high-fidelity, multivariate time series forecasts, being available for future analysis at https://github.com/SFStefenon/CNN-TFT-SHAP-MHAW .

[426] Enhancing Bankruptcy Prediction of Banks through Advanced Machine Learning Techniques: An Innovative Approach and Analysis

Zuherman Rustam, Sri Hartini, Sardar M. N. Islam, Fevi Novkaniza, Fiftitah R. Aszhari, Muhammad Rifqi

Main category: cs.LG

TL;DR: Machine learning models (LR, RF, SVM) outperform statistical methods for bank bankruptcy prediction, achieving 90% accuracy with RF on commercial bank data and accurate predictions for rural banks.

Details

Motivation: Traditional statistical methods like Altman's Z-Score have limitations with rigid assumptions and low accuracy, necessitating more effective approaches for financial system stability through better bankruptcy prediction.

Method: Used logistic regression, random forest, and support vector machines on commercial bank data from Turkey (1994-2004) and rural bank data from Indonesia (2013-2019) to develop bankruptcy prediction models.

Result: Random forest achieved 90% accuracy in predicting commercial bank bankruptcy, and all three machine learning methods accurately predicted rural bank bankruptcy likelihood.

Conclusion: The machine learning approach provides an effective tool for bankruptcy prediction, helping implement policies to reduce bankruptcy costs and maintain financial system stability.

Abstract: Context: Financial system stability is determined by the condition of the banking system. A bank failure can destroy the stability of the financial system, as banks are subject to systemic risk, affecting not only individual banks but also segments or the entire financial system. Calculating the probability of a bank going bankrupt is one way to ensure the banking system is safe and sound. Existing literature and limitations: Statistical models, such as Altman’s Z-Score, are one of the common techniques for developing a bankruptcy prediction model. However, statistical methods rely on rigid and sometimes irrelevant assumptions, which can result in low forecast accuracy. New approaches are necessary. Objective of the research: Bankruptcy models are developed using machine learning techniques, such as logistic regression (LR), random forest (RF), and support vector machines (SVM). According to several studies, machine learning is also more accurate and effective than statistical methods for categorising and forecasting banking risk management. Present Research: The commercial bank data are derived from the annual financial statements of 44 active banks and 21 bankrupt banks in Turkey from 1994 to 2004, and the rural bank data are derived from the quarterly financial reports of 43 active and 43 bankrupt rural banks in Indonesia between 2013 and 2019. Five rural banks in Indonesia have also been selected to demonstrate the feasibility of analysing bank bankruptcy trends. Findings and implications: The results of the research experiments show that RF can forecast data from commercial banks with a 90% accuracy rate. Furthermore, the three machine learning methods proposed accurately predict the likelihood of rural bank bankruptcy. Contribution and Conclusion: The proposed innovative machine learning approach help to implement policies that reduce the costs of bankruptcy.

[427] Towards Generalization of Graph Neural Networks for AC Optimal Power Flow

Olayiwola Arowolo, Jochen L. Cremer

Main category: cs.LG

TL;DR: Proposes HH-MPNN for scalable ACOPF solving, achieving 1-3% optimality gap and 1,000-10,000x speedup over conventional solvers.

Details

Motivation: ACOPF is computationally expensive for large power systems, and existing ML methods lack scalability and topology adaptability without costly retraining.

Method: Hybrid Heterogeneous Message Passing Neural Network that models different grid components as distinct node/edge types, combined with transformer for long-range dependencies.

Result: Achieves <1% optimality gap on default topologies (14-2,000 buses), <3% gap zero-shot on unseen topologies, and 1,000-10,000x computational speedup.

Conclusion: HH-MPNN advances practical, generalizable ML for real-time power system operations with strong scalability and topology adaptability.

Abstract: AC Optimal Power Flow (ACOPF) is computationally expensive for large-scale power systems, with conventional solvers requiring prohibitive solution times. Machine learning approaches offer computational speedups but struggle with scalability and topology adaptability without expensive retraining. To enable scalability across grid sizes and adaptability to topology changes, we propose a Hybrid Heterogeneous Message Passing Neural Network (HH-MPNN). HH-MPNN models buses, generators, loads, shunts, transmission lines and transformers as distinct node or edge types, combined with a scalable transformer model for handling long-range dependencies. On grids from 14 to 2,000 buses, HH-MPNN achieves less than 1% optimality gap on default topologies. Applied zero-shot to thousands of unseen topologies, HH-MPNN achieves less than 3% optimality gap despite training only on default topologies. Pre-training on smaller grids also improves results on a larger grid. Computational speedups reach 1,000x to 10,000x compared to interior point solvers. These results advance practical, generalizable machine learning for real-time power system operations.

[428] SaFeR-VLM: Toward Safety-aware Fine-grained Reasoning in Multimodal Models

Huahui Yi, Kun Wang, Qiankun Li, Miao Yu, Liang Lin, Gongli Xi, Hao Wu, Xuming Hu, Kang Li, Yang Liu

Main category: cs.LG

TL;DR: SaFeR-VLM is a safety-aligned reinforcement learning framework that embeds safety directly into multimodal reasoning, addressing the “Reasoning Tax” phenomenon where MLRMs amplify safety risks.

Details

Motivation: Existing defenses mainly act at output level without constraining reasoning process, leaving models exposed to implicit safety risks in multimodal reasoning.

Method: Four-component framework: QI-Safe-10K dataset, safety-aware rollout with reflection/correction, structured reward modeling with penalties, and GRPO optimization to reinforce safe trajectories.

Result: SaFeR-VLM-3B achieves 70.13 and 78.97 on safety/helpfulness across benchmarks, surpassing larger models. SaFeR-VLM-7B exceeds GPT-5-mini and Gemini-2.5-Flash by 6.47 and 16.76 points on safety without helpfulness degradation.

Conclusion: The framework shifts safety from passive safeguard to active driver of reasoning, enabling scalable and generalizable safety-aware reasoning with robustness against explicit and implicit risks.

Abstract: Multimodal Large Reasoning Models (MLRMs) demonstrate impressive cross-modal reasoning but often amplify safety risks under adversarial or unsafe prompts, a phenomenon we call the \textit{Reasoning Tax}. Existing defenses mainly act at the output level and do not constrain the reasoning process, leaving models exposed to implicit risks. In this paper, we propose SaFeR-VLM, a safety-aligned reinforcement learning framework that embeds safety directly into multimodal reasoning. The framework integrates four components: (I) QI-Safe-10K, a curated dataset emphasizing safety-critical and reasoning-sensitive cases; (II) safety-aware rollout, where unsafe generations undergo reflection and correction instead of being discarded; (III) structured reward modeling with multi-dimensional weighted criteria and explicit penalties for hallucinations and contradictions; and (IV) GRPO optimization, which reinforces both safe and corrected trajectories. This unified design shifts safety from a passive safeguard to an active driver of reasoning, enabling scalable and generalizable safety-aware reasoning. SaFeR-VLM further demonstrates robustness against both explicit and implicit risks, supporting dynamic and interpretable safety decisions beyond surface-level filtering. SaFeR-VLM-3B achieves average performance $70.13$ and $78.97$ on safety and helpfulness across six benchmarks, surpassing both same-scale and $>10\times$ larger models such as Skywork-R1V3-38B, Qwen2.5VL-72B, and GLM4.5V-106B. Remarkably, SaFeR-VLM-7B benefits from its increased scale to surpass GPT-5-mini and Gemini-2.5-Flash by \num{6.47} and \num{16.76} points respectively on safety metrics, achieving this improvement without any degradation in helpfulness performance. Our codes are available at https://github.com/HarveyYi/SaFeR-VLM.

[429] MoRE-GNN: Multi-omics Data Integration with a Heterogeneous Graph Autoencoder

Zhiyu Wang, Sonia Koszut, Pietro Liò, Francesco Ceccarelli

Main category: cs.LG

TL;DR: MoRE-GNN is a heterogeneous graph autoencoder that uses graph convolution and attention to dynamically build relational graphs from multi-omics single-cell data, achieving superior performance in capturing biological relationships and cross-modal predictions.

Details

Motivation: Multi-omics single-cell data integration is challenging due to high-dimensionality and complex inter-modality relationships, requiring advanced methods to effectively capture these complex interactions.

Method: MoRE-GNN combines graph convolution and attention mechanisms in a heterogeneous graph autoencoder framework to dynamically construct relational graphs directly from multi-omics data.

Result: Evaluations on six datasets show MoRE-GNN captures biologically meaningful relationships, outperforms existing methods especially with strong inter-modality correlations, and enables accurate cross-modal predictions.

Conclusion: MoRE-GNN provides an adaptive, scalable and interpretable framework for multi-omics integration, though performance may vary with dataset complexity.

Abstract: The integration of multi-omics single-cell data remains challenging due to high-dimensionality and complex inter-modality relationships. To address this, we introduce MoRE-GNN (Multi-omics Relational Edge Graph Neural Network), a heterogeneous graph autoencoder that combines graph convolution and attention mechanisms to dynamically construct relational graphs directly from data. Evaluations on six publicly available datasets demonstrate that MoRE-GNN captures biologically meaningful relationships and outperforms existing methods, particularly in settings with strong inter-modality correlations. Furthermore, the learned representations allow for accurate downstream cross-modal predictions. While performance may vary with dataset complexity, MoRE-GNN offers an adaptive, scalable and interpretable framework for advancing multi-omics integration.

[430] Angular Constraint Embedding via SpherePair Loss for Constrained Clustering

Shaojie Zhang, Ke Chen

Main category: cs.LG

TL;DR: SpherePair is a novel deep constrained clustering method that uses angular constraint embedding to effectively separate representation learning from clustering, enabling scalable and effective constrained clustering without requiring the exact number of clusters.

Details

Motivation: Existing deep constrained clustering methods are limited by anchors in end-to-end modeling or struggle with learning discriminative Euclidean embedding, restricting their scalability and real-world applicability.

Method: Proposes SpherePair loss with geometric formulation that encodes pairwise constraints and leads to clustering-friendly embeddings in angular space, effectively separating representation learning from clustering.

Result: SpherePair achieves superior performance compared to state-of-the-art DCC methods on diverse benchmarks, demonstrating better scalability and real-world effectiveness while preserving pairwise relations without conflict.

Conclusion: SpherePair provides a theoretically-grounded approach for deep constrained clustering that generalizes well to unseen data, enables rapid cluster number inference, and overcomes limitations of existing methods.

Abstract: Constrained clustering integrates domain knowledge through pairwise constraints. However, existing deep constrained clustering (DCC) methods are either limited by anchors inherent in end-to-end modeling or struggle with learning discriminative Euclidean embedding, restricting their scalability and real-world applicability. To avoid their respective pitfalls, we propose a novel angular constraint embedding approach for DCC, termed SpherePair. Using the SpherePair loss with a geometric formulation, our method faithfully encodes pairwise constraints and leads to embeddings that are clustering-friendly in angular space, effectively separating representation learning from clustering. SpherePair preserves pairwise relations without conflict, removes the need to specify the exact number of clusters, generalizes to unseen data, enables rapid inference of the number of clusters, and is supported by rigorous theoretical guarantees. Comparative evaluations with state-of-the-art DCC methods on diverse benchmarks, along with empirical validation of theoretical insights, confirm its superior performance, scalability, and overall real-world effectiveness. Code is available at \href{https://github.com/spherepaircc/SpherePairCC/tree/main}{our repository}.

[431] Vacuum Spiker: A Spiking Neural Network-Based Model for Efficient Anomaly Detection in Time Series

Iago Xabier Vázquez, Javier Sedano, Muhammad Afzal, Ángel Miguel García-Vico

Main category: cs.LG

TL;DR: The paper introduces Vacuum Spiker, a novel Spiking Neural Network-based method for energy-efficient anomaly detection in time series, using global neural activity changes instead of reconstruction errors.

Details

Motivation: Address the high energy consumption of deep learning models in anomaly detection, which limits deployment in resource-constrained environments like IoT devices, edge computing, and wearables.

Method: Proposes Vacuum Spiker algorithm with: 1) New detection criterion based on global neural activity changes, 2) Spike Time-Dependent Plasticity training to induce activity changes during anomalies, 3) Efficient encoding scheme that discretizes input space into non-overlapping intervals with single spike per time step.

Result: Achieves competitive performance on public datasets while significantly reducing energy consumption compared to deep learning and machine learning baselines. Validated in real-world case study detecting power curtailment events in solar inverters.

Conclusion: The method shows potential for sustainable and efficient anomaly detection in resource-constrained environments.

Abstract: Anomaly detection is a key task across domains such as industry, healthcare, and cybersecurity. Many real-world anomaly detection problems involve analyzing multiple features over time, making time series analysis a natural approach for such problems. While deep learning models have achieved strong performance in this field, their trend to exhibit high energy consumption limits their deployment in resource-constrained environments such as IoT devices, edge computing platforms, and wearables. To address this challenge, this paper introduces the \textit{Vacuum Spiker algorithm}, a novel Spiking Neural Network-based method for anomaly detection in time series. It incorporates a new detection criterion that relies on global changes in neural activity rather than reconstruction or prediction error. It is trained using Spike Time-Dependent Plasticity in a novel way, intended to induce changes in neural activity when anomalies occur. A new efficient encoding scheme is also proposed, which discretizes the input space into non-overlapping intervals, assigning each to a single neuron. This strategy encodes information with a single spike per time step, improving energy efficiency compared to conventional encoding methods. Experimental results on publicly available datasets show that the proposed algorithm achieves competitive performance while significantly reducing energy consumption, compared to a wide set of deep learning and machine learning baselines. Furthermore, its practical utility is validated in a real-world case study, where the model successfully identifies power curtailment events in a solar inverter. These results highlight its potential for sustainable and efficient anomaly detection.

[432] Utilizing Large Language Models for Machine Learning Explainability

Alexandros Vassiliades, Nikolaos Polatidis, Stamatios Samaras, Sotiris Diplaris, Ignacio Cabrera Martin, Yannis Manolopoulos, Stefanos Vrochidis, Ioannis Kompatsiaris

Main category: cs.LG

TL;DR: LLMs can autonomously generate effective and interpretable machine learning pipelines for classification tasks, achieving performance and explainability metrics comparable to manually engineered solutions.

Details

Motivation: To explore the explainability capabilities of large language models when used to autonomously generate machine learning solutions, examining whether they can produce both effective and interpretable models.

Method: Used three state-of-the-art LLMs (OpenAI GPT, Anthropic Claude, DeepSeek) to design training pipelines for four classifiers (Random Forest, XGBoost, MLP, LSTM) on two classification tasks (binary driver alertness prediction and multilabel yeast dataset). Evaluated models using predictive performance metrics (recall, precision, F1-score) and explainability using SHAP (measuring fidelity via MSE and sparsity via influential feature count).

Result: LLMs successfully generated effective and interpretable models with high fidelity and consistent sparsity, closely matching manually engineered baselines in both predictive performance and explainability metrics.

Conclusion: LLMs demonstrate significant potential as automated tools for interpretable machine learning pipeline generation, capable of producing solutions that are both effective and explainable.

Abstract: This study explores the explainability capabilities of large language models (LLMs), when employed to autonomously generate machine learning (ML) solutions. We examine two classification tasks: (i) a binary classification problem focused on predicting driver alertness states, and (ii) a multilabel classification problem based on the yeast dataset. Three state-of-the-art LLMs (i.e. OpenAI GPT, Anthropic Claude, and DeepSeek) are prompted to design training pipelines for four common classifiers: Random Forest, XGBoost, Multilayer Perceptron, and Long Short-Term Memory networks. The generated models are evaluated in terms of predictive performance (recall, precision, and F1-score) and explainability using SHAP (SHapley Additive exPlanations). Specifically, we measure Average SHAP Fidelity (Mean Squared Error between SHAP approximations and model outputs) and Average SHAP Sparsity (number of features deemed influential). The results reveal that LLMs are capable of producing effective and interpretable models, achieving high fidelity and consistent sparsity, highlighting their potential as automated tools for interpretable ML pipeline generation. The results show that LLMs can produce effective, interpretable pipelines with high fidelity and consistent sparsity, closely matching manually engineered baselines.

[433] DecompGAIL: Learning Realistic Traffic Behaviors with Decomposed Multi-Agent Generative Adversarial Imitation Learning

Ke Guo, Haochen Liu, Xiaojun Wu, Chen Lv

Main category: cs.LG

TL;DR: DecompGAIL addresses instability in multi-agent imitation learning by decomposing realism into ego-map and ego-neighbor components, filtering irrelevant interactions and using social PPO to improve overall traffic simulation realism.

Details

Motivation: Existing imitation learning approaches fail to model realistic traffic behaviors - behavior cloning suffers from covariate shift while GAIL is unstable in multi-agent settings due to irrelevant interaction misguidance.

Method: Proposes Decomposed Multi-agent GAIL (DecompGAIL) that explicitly decomposes realism into ego-map and ego-neighbor components, filters misleading neighbor interactions, and uses social PPO with distance-weighted neighborhood rewards.

Result: Achieves state-of-the-art performance on the WOMD Sim Agents 2025 benchmark when integrated into a lightweight SMART-based backbone.

Conclusion: DecompGAIL effectively addresses multi-agent imitation learning instability by decomposing realism and filtering irrelevant interactions, leading to superior traffic simulation performance.

Abstract: Realistic traffic simulation is critical for the development of autonomous driving systems and urban mobility planning, yet existing imitation learning approaches often fail to model realistic traffic behaviors. Behavior cloning suffers from covariate shift, while Generative Adversarial Imitation Learning (GAIL) is notoriously unstable in multi-agent settings. We identify a key source of this instability: irrelevant interaction misguidance, where a discriminator penalizes an ego vehicle’s realistic behavior due to unrealistic interactions among its neighbors. To address this, we propose Decomposed Multi-agent GAIL (DecompGAIL), which explicitly decomposes realism into ego-map and ego-neighbor components, filtering out misleading neighbor: neighbor and neighbor: map interactions. We further introduce a social PPO objective that augments ego rewards with distance-weighted neighborhood rewards, encouraging overall realism across agents. Integrated into a lightweight SMART-based backbone, DecompGAIL achieves state-of-the-art performance on the WOMD Sim Agents 2025 benchmark.

[434] Revisiting Node Affinity Prediction in Temporal Graphs

Krishna Sri Ipsit Mantri, Or Feldman, Moshe Eliasof, Chaim Baskin

Main category: cs.LG

TL;DR: NAViS is a node affinity prediction model that uses virtual state and a novel loss function to outperform both state-of-the-art temporal graph neural networks and simple heuristics like Persistent Forecast and Moving Average.

Details

Motivation: Current temporal graph neural networks underperform simple heuristics for node affinity prediction tasks, despite being widely used in applications like social networks, financial networks, and recommender systems.

Method: Developed NAViS by analyzing training challenges in temporal GNNs, exploiting equivalence between heuristics and state space models using virtual state, and introducing a novel loss function specifically for node affinity prediction.

Result: NAViS outperforms state-of-the-art models including heuristics on the TGB benchmark dataset.

Conclusion: The proposed NAViS model successfully addresses the limitations of current temporal GNNs for node affinity prediction through virtual state modeling and specialized loss functions.

Abstract: Node affinity prediction is a common task that is widely used in temporal graph learning with applications in social and financial networks, recommender systems, and more. Recent works have addressed this task by adapting state-of-the-art dynamic link property prediction models to node affinity prediction. However, simple heuristics, such as Persistent Forecast or Moving Average, outperform these models. In this work, we analyze the challenges in training current Temporal Graph Neural Networks for node affinity prediction and suggest appropriate solutions. Combining the solutions, we develop NAViS - Node Affinity prediction model using Virtual State, by exploiting the equivalence between heuristics and state space models. While promising, training NAViS is non-trivial. Therefore, we further introduce a novel loss function for node affinity prediction. We evaluate NAViS on TGB and show that it outperforms the state-of-the-art, including heuristics. Our source code is available at https://github.com/orfeld415/NAVIS

[435] Fisher Information, Training and Bias in Fourier Regression Models

Lorenzo Pastori, Veronika Eyring, Mierk Schwabe

Main category: cs.LG

TL;DR: The paper studies how Fisher information matrix (FIM) metrics predict quantum neural network (QNN) training and performance, showing that higher effective dimension benefits unbiased models while lower effective dimension helps biased models.

Details

Motivation: Growing interest in quantum machine learning and quantum neural networks (QNNs), with a focus on understanding how evaluation metrics based on Fisher information matrix can predict training and prediction performance.

Method: Exploit equivalence between QNNs and Fourier models, derive analytical expression of FIM for Fourier models, identify features controlling effective dimension, construct models with tunable effective dimension and bias, and introduce tensor network representation of Fourier models.

Result: For unbiased models (agnostic to target function), higher effective dimension leads to better trainability and performance. For biased models (aligned with target function), lower effective dimension is beneficial during training.

Conclusion: Findings demonstrate explicit interplay between geometrical properties, model-task alignment and training, providing insights relevant for broader machine learning community beyond quantum applications.

Abstract: Motivated by the growing interest in quantum machine learning, in particular quantum neural networks (QNNs), we study how recently introduced evaluation metrics based on the Fisher information matrix (FIM) are effective for predicting their training and prediction performance. We exploit the equivalence between a broad class of QNNs and Fourier models, and study the interplay between the \emph{effective dimension} and the \emph{bias} of a model towards a given task, investigating how these affect the model’s training and performance. We show that for a model that is completely agnostic, or unbiased, towards the function to be learned, a higher effective dimension likely results in a better trainability and performance. On the other hand, for models that are biased towards the function to be learned a lower effective dimension is likely beneficial during training. To obtain these results, we derive an analytical expression of the FIM for Fourier models and identify the features controlling a model’s effective dimension. This allows us to construct models with tunable effective dimension and bias, and to compare their training. We furthermore introduce a tensor network representation of the considered Fourier models, which could be a tool of independent interest for the analysis of QNN models. Overall, these findings provide an explicit example of the interplay between geometrical properties, model-task alignment and training, which are relevant for the broader machine learning community.

[436] Grouped Differential Attention

Junghwan Lim, Sungmin Lee, Dongseok Kim, Wai Ting Cheung, Beomgyu Kim, Taehwan Kim, Haesol Lee, Junhyeok Lee, Dongpin Oh, Eunhwan Park

Main category: cs.LG

TL;DR: Grouped Differential Attention (GDA) improves Transformer efficiency by using unbalanced head allocation between signal-preserving and noise-control groups, achieving better signal focus with minimal computational overhead.

Details

Motivation: Self-attention mechanisms often waste attention on redundant or noisy context, and existing solutions like Differential Attention impose rigid constraints on flexibility and scalability.

Method: GDA introduces unbalanced head allocation with more heads for signal extraction and fewer for noise-control, stabilized through controlled repetition. It also uses group-differentiated growth for selective expansion of signal-focused heads.

Result: Large-scale experiments show moderate imbalance ratios in GDA yield substantial improvements in generalization and stability compared to symmetric baselines.

Conclusion: Ratio-aware head allocation and selective expansion provide an effective path for designing scalable, computation-efficient Transformer architectures.

Abstract: The self-attention mechanism, while foundational to modern Transformer architectures, suffers from a critical inefficiency: it frequently allocates substantial attention to redundant or noisy context. Differential Attention addressed this by using subtractive attention maps for signal and noise, but its required balanced head allocation imposes rigid constraints on representational flexibility and scalability. To overcome this, we propose Grouped Differential Attention (GDA), a novel approach that introduces unbalanced head allocation between signal-preserving and noise-control groups. GDA significantly enhances signal focus by strategically assigning more heads to signal extraction and fewer to noise-control, stabilizing the latter through controlled repetition (akin to GQA). This design achieves stronger signal fidelity with minimal computational overhead. We further extend this principle to group-differentiated growth, a scalable strategy that selectively replicates only the signal-focused heads, thereby ensuring efficient capacity expansion. Through large-scale pretraining and continual training experiments, we demonstrate that moderate imbalance ratios in GDA yield substantial improvements in generalization and stability compared to symmetric baselines. Our results collectively establish that ratio-aware head allocation and selective expansion offer an effective and practical path toward designing scalable, computation-efficient Transformer architectures.

[437] From Condensation to Rank Collapse: A Two-Stage Analysis of Transformer Training Dynamics

Zheng-An Chen, Tao Luo

Main category: cs.LG

TL;DR: This paper analyzes transformer training dynamics using gradient flow theory, revealing a two-stage process: first, asymmetric weight perturbations enable escape from small initialization, then key-query matrices drive rank collapse.

Details

Motivation: To understand fundamental principles of transformer training dynamics beyond configuration-specific studies, inspired by empirical evidence of improved reasoning under small initialization scales.

Method: Employ gradient flow analytical framework to systematically investigate linearized Transformer training dynamics, dissecting attention module dynamics into two distinct stages.

Result: First stage: asymmetric weight perturbations sustain gradient dynamics enabling escape from small initialization. Second stage: key-query matrices become active, driving normalized matrices toward asymptotic rank collapse.

Conclusion: The two-stage framework generalizes classical directional convergence results and provides systematic understanding of transformer training dynamics.

Abstract: Although transformer-based models have shown exceptional empirical performance, the fundamental principles governing their training dynamics are inadequately characterized beyond configuration-specific studies. Inspired by empirical evidence showing improved reasoning capabilities under small initialization scales in language models, we employ the gradient flow analytical framework established in [Zhou et al. NeurIPS 2022] to systematically investigate linearized Transformer training dynamics. Our theoretical analysis dissects the dynamics of attention modules into two distinct stages. In the first stage, asymmetric weight perturbations from random initialization sustain non-degenerate gradient dynamics in parameter matrices, facilitating systematic escape from small initialization regimes. Subsequently, these matrices undergo condensation, progressively aligning toward the target orientation. In the second stage, the previously static key-query matrices actively participate in training, driving the normalized matrices toward asymptotic rank collapse. This two-stage framework generalizes classical directional convergence results.

[438] High-Rate Mixout: Revisiting Mixout for Robust Domain Generalization

Masih Aminbeidokhti, Heitor Rapela Medeiros, Eric Granger, Marco Pedersoli

Main category: cs.LG

TL;DR: High-rate Mixout is proposed as a lightweight alternative to ensembling for domain generalization, using high masking probabilities (0.9 for ViTs, 0.8 for ResNets) to swap fine-tuned weights with pre-trained counterparts, achieving comparable performance to ensembles with significantly reduced computational costs.

Details

Motivation: Ensembling fine-tuned models improves robustness under distribution shifts but is computationally expensive. Dropout offers a lightweight alternative but tends to over-regularize and disrupt critical representations in pre-trained models.

Method: Mixout is a stochastic regularization technique that probabilistically swaps fine-tuned weights with pre-trained counterparts during training. The study uses high masking probabilities (0.9 for ViTs, 0.8 for ResNets) to balance adaptation and retention of prior knowledge.

Result: High-rate Mixout achieves out-of-domain accuracy comparable to ensemble-based methods across five benchmarks (PACS, VLCS, OfficeHome, TerraIncognita, DomainNet) while reducing gradient computation by up to 45% and gradient memory usage by up to 90%.

Conclusion: High-rate Mixout provides an effective and computationally efficient alternative to ensembling for domain generalization, maintaining strong performance while significantly reducing training costs through high masking probabilities.

Abstract: Ensembling fine-tuned models initialized from powerful pre-trained weights is a common strategy to improve robustness under distribution shifts, but it comes with substantial computational costs due to the need to train and store multiple models. Dropout offers a lightweight alternative by simulating ensembles through random neuron deactivation; however, when applied to pre-trained models, it tends to over-regularize and disrupt critical representations necessary for generalization. In this work, we investigate Mixout, a stochastic regularization technique that provides an alternative to Dropout for domain generalization. Rather than deactivating neurons, Mixout mitigates overfitting by probabilistically swapping a subset of fine-tuned weights with their pre-trained counterparts during training, thereby maintaining a balance between adaptation and retention of prior knowledge. Our study reveals that achieving strong performance with Mixout on domain generalization benchmarks requires a notably high masking probability of 0.9 for ViTs and 0.8 for ResNets. While this may seem like a simple adjustment, it yields two key advantages for domain generalization: (1) higher masking rates more strongly penalize deviations from the pre-trained parameters, promoting better generalization to unseen domains; and (2) high-rate masking substantially reduces computational overhead, cutting gradient computation by up to 45% and gradient memory usage by up to 90%. Experiments across five domain generalization benchmarks, PACS, VLCS, OfficeHome, TerraIncognita, and DomainNet, using ResNet and ViT architectures, show that our approach, High-rate Mixout, achieves out-of-domain accuracy comparable to ensemble-based methods while significantly reducing training costs.

[439] Federated Unlearning in the Wild: Rethinking Fairness and Data Discrepancy

ZiHeng Huang, Di Wu, Jun Bai, Jiale Zhang, Sicong Cao, Ji Zhang, Yingjie Hu

Main category: cs.LG

TL;DR: The paper addresses fairness issues in Federated Unlearning (FU) and proposes FedCCCU, a fairness-aware approach that outperforms existing methods under realistic data heterogeneity conditions.

Details

Motivation: Current Federated Unlearning methods overlook fairness and rely on unrealistic synthetic data assumptions, limiting their real-world applicability and potentially unfairly impacting clients with retained data.

Method: Proposes Federated Cross-Client-Constrains Unlearning (FedCCCU), a fairness-aware approach that explicitly addresses both fairness concerns and realistic data heterogeneity through cross-client constraints.

Result: Experimental results show existing FU methods perform poorly under realistic settings, while FedCCCU consistently outperforms them in both fairness and effectiveness.

Conclusion: FedCCCU provides a practical and scalable solution for real-world Federated Unlearning that addresses fairness concerns and works effectively under realistic data heterogeneity conditions.

Abstract: Machine unlearning is critical for enforcing data deletion rights like the “right to be forgotten.” As a decentralized paradigm, Federated Learning (FL) also requires unlearning, but realistic implementations face two major challenges. First, fairness in Federated Unlearning (FU) is often overlooked. Exact unlearning methods typically force all clients into costly retraining, even those uninvolved. Approximate approaches, using gradient ascent or distillation, make coarse interventions that can unfairly degrade performance for clients with only retained data. Second, most FU evaluations rely on synthetic data assumptions (IID/non-IID) that ignore real-world heterogeneity. These unrealistic benchmarks obscure the true impact of unlearning and limit the applicability of current methods. We first conduct a comprehensive benchmark of existing FU methods under realistic data heterogeneity and fairness conditions. We then propose a novel, fairness-aware FU approach, Federated Cross-Client-Constrains Unlearning (FedCCCU), to explicitly address both challenges. FedCCCU offers a practical and scalable solution for real-world FU. Experimental results show that existing methods perform poorly in realistic settings, while our approach consistently outperforms them.

[440] Revisiting Mixout: An Overlooked Path to Robust Finetuning

Masih Aminbeidokhti, Heitor Rapela Medeiros, Eric Granger, Marco Pedersoli

Main category: cs.LG

TL;DR: GMixout is an improved stochastic regularization method that replaces finetuned weights with adaptive moving-average snapshots to enhance robustness under distribution shift while maintaining in-domain accuracy.

Details

Motivation: Finetuning vision foundation models improves in-domain accuracy but reduces robustness under distribution shift. The paper aims to address this trade-off by enhancing Mixout regularization.

Method: GMixout replaces fixed anchors with exponential moving-average snapshots that adapt during training, regulates masking period via resampling-frequency hyperparameter, and uses sparse-kernel implementation for efficiency.

Result: GMixout consistently improves in-domain accuracy beyond zero-shot performance and surpasses Model Soups and parameter-efficient finetuning baselines under distribution shift across multiple benchmarks.

Conclusion: GMixout provides an effective approach to maintain robustness while improving in-domain accuracy through adaptive stochastic regularization with minimal computational overhead.

Abstract: Finetuning vision foundation models often improves in-domain accuracy but comes at the cost of robustness under distribution shift. We revisit Mixout, a stochastic regularizer that intermittently replaces finetuned weights with their pretrained reference, through the lens of a single-run, weight-sharing implicit ensemble. This perspective reveals three key levers that govern robustness: the \emph{masking anchor}, \emph{resampling frequency}, and \emph{mask sparsity}. Guided by this analysis, we introduce GMixout, which (i) replaces the fixed anchor with an exponential moving-average snapshot that adapts during training, and (ii) regulates masking period via an explicit resampling-frequency hyperparameter. Our sparse-kernel implementation updates only a small fraction of parameters with no inference-time overhead, enabling training on consumer-grade GPUs. Experiments on benchmarks covering covariate shift, corruption, and class imbalance, ImageNet / ImageNet-LT, DomainNet, iWildCam, and CIFAR100-C, GMixout consistently improves in-domain accuracy beyond zero-shot performance while surpassing both Model Soups and strong parameter-efficient finetuning baselines under distribution shift.

[441] Unified Molecule Pre-training with Flexible 2D and 3D Modalities: Single and Paired Modality Integration

Tengwei Song, Min Wu, Yuan Fang

Main category: cs.LG

TL;DR: FlexMol is a flexible molecular pre-training framework that learns unified representations from 2D and 3D molecular data, supporting single-modality input when one modality is unavailable.

Details

Motivation: Existing methods require paired 2D and 3D molecular data for training, which limits their applicability when certain modalities are unavailable or computationally expensive to generate.

Method: Uses separate models for 2D and 3D data with parameter sharing, employs a decoder to generate features for missing modalities, and implements a multistage continuous learning process.

Result: Achieves superior performance across molecular property prediction tasks and demonstrates effectiveness with incomplete data.

Conclusion: FlexMol provides a robust framework for molecular representation learning that works effectively even when only single modalities are available during inference.

Abstract: Molecular representation learning plays a crucial role in advancing applications such as drug discovery and material design. Existing work leverages 2D and 3D modalities of molecular information for pre-training, aiming to capture comprehensive structural and geometric insights. However, these methods require paired 2D and 3D molecular data to train the model effectively and prevent it from collapsing into a single modality, posing limitations in scenarios where a certain modality is unavailable or computationally expensive to generate. To overcome this limitation, we propose FlexMol, a flexible molecule pre-training framework that learns unified molecular representations while supporting single-modality input. Specifically, inspired by the unified structure in vision-language models, our approach employs separate models for 2D and 3D molecular data, leverages parameter sharing to improve computational efficiency, and utilizes a decoder to generate features for the missing modality. This enables a multistage continuous learning process where both modalities contribute collaboratively during training, while ensuring robustness when only one modality is available during inference. Extensive experiments demonstrate that FlexMol achieves superior performance across a wide range of molecular property prediction tasks, and we also empirically demonstrate its effectiveness with incomplete data. Our code and data are available at https://github.com/tewiSong/FlexMol.

[442] Spiral Model Technique For Data Science & Machine Learning Lifecycle

Rohith Mahadevan

Main category: cs.LG

TL;DR: Introduces a spiral technique for data science lifecycles in business, emphasizing versatility, agility, and iterative approaches for projects with clear end goals.

Details

Motivation: Traditional data science lifecycles are often linear or cyclical, but may not adequately address business problems with clear objectives. Companies need more adaptable approaches to improve productivity and competitiveness.

Method: Proposes a new spiral technique that incorporates iterative, agile approaches to data science lifecycles for business applications.

Result: The spiral technique offers a more versatile and agile framework compared to traditional linear or cyclical models.

Conclusion: The spiral data science lifecycle technique provides an improved approach for businesses to handle data-dependent projects with clear end goals through iterative and adaptable processes.

Abstract: Analytics play an important role in modern business. Companies adapt data science lifecycles to their culture to seek productivity and improve their competitiveness among others. Data science lifecycles are fairly an important contributing factor to start and end a project that are data dependent. Data science and Machine learning life cycles comprises of series of steps that are involved in a project. A typical life cycle states that it is a linear or cyclical model that revolves around. It is mostly depicted that it is possible in a traditional data science life cycle to start the process again after reaching the end of cycle. This paper suggests a new technique to incorporate data science life cycle to business problems that have a clear end goal. A new technique called spiral technique is introduced to emphasize versatility, agility and iterative approach to business processes.

[443] Introspection in Learned Semantic Scene Graph Localisation

Manshika Charvi Bissessur, Efimia Panagiotaki, Daniele De Martini

Main category: cs.LG

TL;DR: The paper investigates how semantics affect localization performance in self-supervised contrastive semantic localization, showing models learn noise-robust semantic relationships for explainable registration.

Details

Motivation: To understand how semantics influence localization performance and robustness, and whether models filter environmental noise while prioritizing distinctive landmarks over routine clutter.

Method: Train localization network on original and perturbed maps, conduct post-hoc introspection analysis using interpretability methods (integrated gradients, attention weights), and perform semantic class ablation.

Result: Integrated gradients and attention weights are most reliable probes; models implicitly down-weight frequent objects and learn noise-robust semantically salient relationships.

Conclusion: The model learns noise-robust, semantically salient relations for place definition, enabling explainable registration under challenging visual and structural variations.

Abstract: This work investigates how semantics influence localisation performance and robustness in a learned self-supervised, contrastive semantic localisation framework. After training a localisation network on both original and perturbed maps, we conduct a thorough post-hoc introspection analysis to probe whether the model filters environmental noise and prioritises distinctive landmarks over routine clutter. We validate various interpretability methods and present a comparative reliability analysis. Integrated gradients and Attention Weights consistently emerge as the most reliable probes of learned behaviour. A semantic class ablation further reveals an implicit weighting in which frequent objects are often down-weighted. Overall, the results indicate that the model learns noise-robust, semantically salient relations about place definition, thereby enabling explainable registration under challenging visual and structural variations.

[444] Sharpness-Aware Data Generation for Zero-shot Quantization

Dung Hoang-Anh, Cuong Pham Trung Le, Jianfei Cai, Thanh-Toan Do

Main category: cs.LG

TL;DR: This paper proposes a novel zero-shot quantization method that considers model sharpness during synthetic data generation to improve generalization, using gradient matching between reconstruction loss gradients.

Details

Motivation: Previous zero-shot quantization approaches don't consider the sharpness of quantized models, even though low sharpness is known to improve generalization ability in deep neural networks.

Method: The method minimizes sharpness by maximizing gradient matching between reconstruction loss gradients on synthetic and real validation data, approximated through gradient matching between generated samples and their neighbors when real data is unavailable.

Result: Experimental evaluations on CIFAR-100 and ImageNet datasets show the proposed method outperforms state-of-the-art techniques in low-bit quantization settings.

Conclusion: Considering quantized model sharpness in synthetic data generation enhances generalization performance in zero-shot quantization.

Abstract: Zero-shot quantization aims to learn a quantized model from a pre-trained full-precision model with no access to original real training data. The common idea in zero-shot quantization approaches is to generate synthetic data for quantizing the full-precision model. While it is well-known that deep neural networks with low sharpness have better generalization ability, none of the previous zero-shot quantization works considers the sharpness of the quantized model as a criterion for generating training data. This paper introduces a novel methodology that takes into account quantized model sharpness in synthetic data generation to enhance generalization. Specifically, we first demonstrate that sharpness minimization can be attained by maximizing gradient matching between the reconstruction loss gradients computed on synthetic and real validation data, under certain assumptions. We then circumvent the problem of the gradient matching without real validation set by approximating it with the gradient matching between each generated sample and its neighbors. Experimental evaluations on CIFAR-100 and ImageNet datasets demonstrate the superiority of the proposed method over the state-of-the-art techniques in low-bit quantization settings.

[445] COMPASS: A Multi-Turn Benchmark for Tool-Mediated Planning & Preference Optimization

Tian Qin, Felix Bai, Ting-Yao Hu, Raviteja Vemulapalli, Hema Swetha Koppula, Zhiyang Xu, Bowen Jin, Mert Cemri, Jiarui Lu, Zirui Wang, Meng Cao

Main category: cs.LG

TL;DR: COMPASS is a benchmark for evaluating LLM agents on realistic travel planning tasks, focusing on constrained preference optimization where agents must satisfy hard constraints while optimizing soft user preferences.

Details

Motivation: Real-world LLM agents need to master strategic tool use and user preference optimization through multi-turn interactions for complex planning tasks like travel planning.

Method: Built a realistic travel database covering transportation, accommodation, and ticketing for 20 U.S. National Parks, along with a comprehensive tool ecosystem mirroring commercial booking platforms.

Result: Identified two critical gaps: (i) acceptable-optimal gap where agents meet constraints but fail to optimize preferences, and (ii) plan-coordination gap where performance collapses on multi-service coordination tasks, especially for open-source models.

Conclusion: COMPASS provides a benchmark that directly measures agents’ ability to optimize user preferences in realistic tasks, bridging theoretical advances with real-world impact.

Abstract: Real-world large language model (LLM) agents must master strategic tool use and user preference optimization through multi-turn interactions to assist users with complex planning tasks. We introduce COMPASS (Constrained Optimization through Multi-turn Planning and Strategic Solutions), a benchmark that evaluates agents on realistic travel-planning scenarios. We cast travel planning as a constrained preference optimization problem, where agents must satisfy hard constraints while simultaneously optimizing soft user preferences. To support this, we build a realistic travel database covering transportation, accommodation, and ticketing for 20 U.S. National Parks, along with a comprehensive tool ecosystem that mirrors commercial booking platforms. Evaluating state-of-the-art models, we uncover two critical gaps: (i) an acceptable-optimal gap, where agents reliably meet constraints but fail to optimize preferences, and (ii) a plan-coordination gap, where performance collapses on multi-service (flight and hotel) coordination tasks, especially for open-source models. By grounding reasoning and planning in a practical, user-facing domain, COMPASS provides a benchmark that directly measures an agent’s ability to optimize user preferences in realistic tasks, bridging theoretical advances with real-world impact.

[446] HTMformer: Hybrid Time and Multivariate Transformer for Time Series Forecasting

Tan Wang, Yun Wei Dong, Tao Zhang, Qi Wang

Main category: cs.LG

TL;DR: HTMformer introduces Hybrid Temporal and Multivariate Embeddings (HTME) to enhance Transformer-based time series forecasting by extracting richer multidimensional features, achieving better accuracy and efficiency than existing methods.

Details

Motivation: Existing Transformers overemphasize temporal dependencies in time series forecasting, incurring computational overhead without proportional performance gains. The performance heavily depends on embedding methods for effective sequence representations.

Method: Proposed HTME extractor integrates lightweight temporal feature extraction with multivariate feature extraction to create multidimensional embeddings. Combined with Transformer architecture to form HTMformer, a lightweight forecaster.

Result: Experiments on eight real-world datasets show HTMformer outperforms existing baselines in both accuracy and efficiency.

Conclusion: HTME provides richer sequence representations that enable Transformers to better understand time series, achieving optimal balance between model complexity and performance.

Abstract: Transformer-based methods have achieved impressive results in time series forecasting. However, existing Transformers still exhibit limitations in sequence modeling as they tend to overemphasize temporal dependencies. This incurs additional computational overhead without yielding corresponding performance gains. We find that the performance of Transformers is highly dependent on the embedding method used to learn effective representations. To address this issue, we extract multivariate features to augment the effective information captured in the embedding layer, yielding multidimensional embeddings that convey richer and more meaningful sequence representations. These representations enable Transformer-based forecasters to better understand the series. Specifically, we introduce Hybrid Temporal and Multivariate Embeddings (HTME). The HTME extractor integrates a lightweight temporal feature extraction module with a carefully designed multivariate feature extraction module to provide complementary features, thereby achieving a balance between model complexity and performance. By combining HTME with the Transformer architecture, we present HTMformer, leveraging the enhanced feature extraction capability of the HTME extractor to build a lightweight forecaster. Experiments conducted on eight real-world datasets demonstrate that our approach outperforms existing baselines in both accuracy and efficiency.

[447] Enhancing Speech Emotion Recognition via Fine-Tuning Pre-Trained Models and Hyper-Parameter Optimisation

Aryan Golbaghi, Shuo Zhou

Main category: cs.LG

TL;DR: A workflow combining pre-trained speech representations with automated hyperparameter optimization achieves efficient speech emotion recognition on commodity CPUs, outperforming traditional methods and enabling cross-lingual generalization.

Details

Motivation: To develop an efficient speech emotion recognition workflow that can run on commodity hardware while achieving competitive performance through automated hyperparameter optimization.

Method: Used SpeechBrain wav2vec2-base model fine-tuned on IEMOCAP as encoder, compared Gaussian Process Bayesian Optimization (GP-BO) and Tree-structured Parzen Estimators (TPE) under identical 4D search space with 15-trial budget, using balanced class accuracy on German EmoDB as objective.

Result: GP-BO achieved 0.96 BCA in 11 minutes, TPE achieved 0.97 BCA in 15 minutes, while grid search required 143 trials and 1,680 minutes to exceed 0.9 BCA. Cross-lingual generalization improved zero-shot accuracy by 0.25 on CREMA-D and 0.26 on RAVDESS.

Conclusion: Efficient hyperparameter optimization with pre-trained encoders delivers competitive speech emotion recognition performance on commodity CPUs, significantly outperforming traditional methods and enabling effective cross-lingual generalization.

Abstract: We propose a workflow for speech emotion recognition (SER) that combines pre-trained representations with automated hyperparameter optimisation (HPO). Using SpeechBrain wav2vec2-base model fine-tuned on IEMOCAP as the encoder, we compare two HPO strategies, Gaussian Process Bayesian Optimisation (GP-BO) and Tree-structured Parzen Estimators (TPE), under an identical four-dimensional search space and 15-trial budget, with balanced class accuracy (BCA) on the German EmoDB corpus as the objective. All experiments run on 8 CPU cores with 32 GB RAM. GP-BO achieves 0.96 BCA in 11 minutes, and TPE (Hyperopt implementation) attains 0.97 in 15 minutes. In contrast, grid search requires 143 trials and 1,680 minutes to exceed 0.9 BCA, and the best AutoSpeech 2020 baseline reports only 0.85 in 30 minutes on GPU. For cross-lingual generalisation, an EmoDB-trained HPO-tuned model improves zero-shot accuracy by 0.25 on CREMA-D and 0.26 on RAVDESS. Results show that efficient HPO with pre-trained encoders delivers competitive SER on commodity CPUs. Source code to this work is available at: https://github.com/youngaryan/speechbrain-emotion-hpo.

[448] Generative World Modelling for Humanoids: 1X World Model Challenge Technical Report

Riccardo Mereu, Aidan Scannell, Yuxin Hou, Yi Zhao, Aditya Jitta, Antonio Dominguez, Luigi Acerbi, Amos Storkey, Paul Chang

Main category: cs.LG

TL;DR: The paper presents world models for real-world humanoid interaction, achieving 1st place in both sampling and compression tracks of the 1X World Model Challenge.

Details

Motivation: To develop effective world models that can reason about the future by predicting visual observations or compact latent states in real-world humanoid interaction scenarios.

Method: For sampling track: adapted Wan-2.2 TI2V-5B video generation model with AdaLN-Zero conditioning on robot states and LoRA post-training. For compression track: trained Spatio-Temporal Transformer model from scratch.

Result: Achieved 23.0 dB PSNR in sampling task and Top-500 CE of 6.6386 in compression task, securing 1st place in both challenges.

Conclusion: The proposed approaches successfully address both sampling and compression tasks in the world model benchmark, demonstrating state-of-the-art performance in real-world humanoid interaction prediction.

Abstract: World models are a powerful paradigm in AI and robotics, enabling agents to reason about the future by predicting visual observations or compact latent states. The 1X World Model Challenge introduces an open-source benchmark of real-world humanoid interaction, with two complementary tracks: sampling, focused on forecasting future image frames, and compression, focused on predicting future discrete latent codes. For the sampling track, we adapt the video generation foundation model Wan-2.2 TI2V-5B to video-state-conditioned future frame prediction. We condition the video generation on robot states using AdaLN-Zero, and further post-train the model using LoRA. For the compression track, we train a Spatio-Temporal Transformer model from scratch. Our models achieve 23.0 dB PSNR in the sampling task and a Top-500 CE of 6.6386 in the compression task, securing 1st place in both challenges.

Zheng Xing, Junting Chen

Main category: cs.LG

TL;DR: Unsupervised angular power map construction using massive MIMO CSI data without location labels, enabling mobile localization through HMM modeling of trajectory and CSI evolution.

Details

Motivation: Conventional radio map approaches require location-labeled CSI data, which is challenging to obtain in practice for massive MIMO networks.

Method: Build hidden Markov model (HMM) to connect mobile trajectory with CSI evolution, enabling location estimation from large timescale CSI data without location labels.

Result: Under uniform rectilinear mobility with Poisson-distributed BSs, CRLB for localization error can vanish at any SNR; with BSs in limited region, error remains nonzero. Real network testing achieved 18m average localization error using mainly single serving cell RSRP data.

Conclusion: Unsupervised angular power map construction is feasible using massive MIMO CSI data without location labels, with practical localization performance demonstrated in real network conditions.

Abstract: Channel state information (CSI) acquisition is a challenging problem in massive multiple-input multiple-output (MIMO) networks. Radio maps provide a promising solution for radio resource management by reducing online CSI acquisition. However, conventional approaches for radio map construction require location-labeled CSI data, which is challenging in practice. This paper investigates unsupervised angular power map construction based on large timescale CSI data collected in a massive MIMO network without location labels. A hidden Markov model (HMM) is built to connect the hidden trajectory of a mobile with the CSI evolution of a massive MIMO channel. As a result, the mobile location can be estimated, enabling the construction of an angular power map. We show that under uniform rectilinear mobility with Poisson-distributed base stations (BSs), the Cramer-Rao Lower Bound (CRLB) for localization error can vanish at any signal-to-noise ratios (SNRs), whereas when BSs are confined to a limited region, the error remains nonzero even with infinite independent measurements. Based on reference signal received power (RSRP) data collected in a real multi-cell massive MIMO network, an average localization error of 18 meters can be achieved although measurements are mainly obtained from a single serving cell.

[450] Non-Stationary Online Structured Prediction with Surrogate Losses

Shinsaku Sakaue, Han Bao, Yuzhou Cao

Main category: cs.LG

TL;DR: The paper addresses online structured prediction in non-stationary environments by proving a bound on cumulative target loss that depends on comparator’s surrogate loss and path length, rather than time horizon T.

Details

Motivation: Existing surrogate regret bounds that are independent of time horizon T break down in non-stationary environments where fixed estimators may incur linear growth in surrogate loss with T.

Method: Synthesizes dynamic regret bound of online gradient descent (OGD) with surrogate gap exploitation technique, and introduces a new Polyak-style learning rate for OGD.

Result: Proves a bound of form F_T + C(1 + P_T) on cumulative target loss, where F_T is comparator’s cumulative surrogate loss and P_T is its path length, with tight dependence on these parameters.

Conclusion: The approach provides stronger guarantees in non-stationary environments and extends to broader problems via convolutional Fenchel-Young loss, with proven tightness of the dependence on F_T and P_T.

Abstract: Online structured prediction, including online classification as a special case, is the task of sequentially predicting labels from input features. Therein the surrogate regret – the cumulative excess of the target loss (e.g., 0-1 loss) over the surrogate loss (e.g., logistic loss) of the fixed best estimator – has gained attention, particularly because it often admits a finite bound independent of the time horizon $T$. However, such guarantees break down in non-stationary environments, where every fixed estimator may incur the surrogate loss growing linearly with $T$. We address this by proving a bound of the form $F_T + C(1 + P_T)$ on the cumulative target loss, where $F_T$ is the cumulative surrogate loss of any comparator sequence, $P_T$ is its path length, and $C > 0$ is some constant. This bound depends on $T$ only through $F_T$ and $P_T$, often yielding much stronger guarantees in non-stationary environments. Our core idea is to synthesize the dynamic regret bound of the online gradient descent (OGD) with the technique of exploiting the surrogate gap. Our analysis also sheds light on a new Polyak-style learning rate for OGD, which systematically offers target-loss guarantees and exhibits promising empirical performance. We further extend our approach to a broader class of problems via the convolutional Fenchel–Young loss. Finally, we prove a lower bound showing that the dependence on $F_T$ and $P_T$ is tight.

[451] Non-Asymptotic Analysis of Efficiency in Conformalized Regression

Yunzhen Yao, Lie He, Michael Gastpar

Main category: cs.LG

TL;DR: Establishes non-asymptotic bounds on prediction set length deviation for conformalized quantile and median regression, capturing joint dependence on training size, calibration size, and miscoverage level.

Details

Motivation: Prior work treats miscoverage level as fixed constant, but efficiency of conformal prediction depends on expected prediction set size, requiring better understanding of how it scales with different parameters.

Method: Analyze conformalized quantile and median regression trained via SGD under mild data distribution assumptions, deriving bounds on deviation from oracle interval length.

Result: Obtain bounds of order O(1/√n + 1/(α²n) + 1/√m + exp(-α²m)) that capture phase transitions in convergence rates across different α regimes.

Conclusion: Results provide guidance for data allocation to control excess prediction set length, with empirical validation supporting theoretical findings.

Abstract: Conformal prediction provides prediction sets with coverage guarantees. The informativeness of conformal prediction depends on its efficiency, typically quantified by the expected size of the prediction set. Prior work on the efficiency of conformalized regression commonly treats the miscoverage level $\alpha$ as a fixed constant. In this work, we establish non-asymptotic bounds on the deviation of the prediction set length from the oracle interval length for conformalized quantile and median regression trained via SGD, under mild assumptions on the data distribution. Our bounds of order $\mathcal{O}(1/\sqrt{n} + 1/(\alpha^2 n) + 1/\sqrt{m} + \exp(-\alpha^2 m))$ capture the joint dependence of efficiency on the proper training set size $n$, the calibration set size $m$, and the miscoverage level $\alpha$. The results identify phase transitions in convergence rates across different regimes of $\alpha$, offering guidance for allocating data to control excess prediction set length. Empirical results are consistent with our theoretical findings.

[452] ELMUR: External Layer Memory with Update/Rewrite for Long-Horizon RL

Egor Cherepanov, Alexey K. Kovalev, Aleksandr I. Panov

Main category: cs.LG

TL;DR: ELMUR is a transformer architecture with structured external memory that enables long-term dependency handling in partially observable environments, achieving significant performance improvements over baselines.

Details

Motivation: Real-world robotic agents need to handle partial observability and long horizons, but current approaches struggle with retaining and leveraging long-term dependencies due to context window limitations and memory scaling issues.

Method: ELMUR uses a transformer with structured external memory where each layer maintains memory embeddings, interacts via bidirectional cross-attention, and updates through an LRU memory module using replacement or convex blending.

Result: ELMUR extends effective horizons up to 100,000 times beyond attention windows, achieves 100% success on synthetic T-Maze with million-step corridors, outperforms baselines on most POPGym tasks, and nearly doubles performance on MIKASA-Robo sparse-reward manipulation tasks.

Conclusion: Structured, layer-local external memory provides a simple and scalable approach for decision making under partial observability.

Abstract: Real-world robotic agents must act under partial observability and long horizons, where key cues may appear long before they affect decision making. However, most modern approaches rely solely on instantaneous information, without incorporating insights from the past. Standard recurrent or transformer models struggle with retaining and leveraging long-term dependencies: context windows truncate history, while naive memory extensions fail under scale and sparsity. We propose ELMUR (External Layer Memory with Update/Rewrite), a transformer architecture with structured external memory. Each layer maintains memory embeddings, interacts with them via bidirectional cross-attention, and updates them through an Least Recently Used (LRU) memory module using replacement or convex blending. ELMUR extends effective horizons up to 100,000 times beyond the attention window and achieves a 100% success rate on a synthetic T-Maze task with corridors up to one million steps. In POPGym, it outperforms baselines on more than half of the tasks. On MIKASA-Robo sparse-reward manipulation tasks with visual observations, it nearly doubles the performance of strong baselines. These results demonstrate that structured, layer-local external memory offers a simple and scalable approach to decision making under partial observability.

[453] DPMM-CFL: Clustered Federated Learning via Dirichlet Process Mixture Model Nonparametric Clustering

Mariona Jaramillo-Civill, Peng Wu, Pau Closas

Main category: cs.LG

TL;DR: DPMM-CFL is a clustered federated learning method that automatically determines the number of clusters using Dirichlet Process priors, eliminating the need to pre-specify cluster count.

Details

Motivation: Most clustered federated learning methods require fixing the number of clusters beforehand, which is impractical when the underlying data structure is unknown.

Method: Uses Dirichlet Process prior over cluster parameters to enable nonparametric Bayesian inference, jointly inferring cluster count and client assignments while optimizing federated objectives.

Result: Validated on benchmark datasets under Dirichlet and class-split non-IID partitions, showing effective automatic cluster discovery.

Conclusion: DPMM-CFL successfully addresses the limitation of pre-specified cluster counts in CFL by providing a principled approach to automatically determine the optimal number of clusters.

Abstract: Clustered Federated Learning (CFL) improves performance under non-IID client heterogeneity by clustering clients and training one model per cluster, thereby balancing between a global model and fully personalized models. However, most CFL methods require the number of clusters K to be fixed a priori, which is impractical when the latent structure is unknown. We propose DPMM-CFL, a CFL algorithm that places a Dirichlet Process (DP) prior over the distribution of cluster parameters. This enables nonparametric Bayesian inference to jointly infer both the number of clusters and client assignments, while optimizing per-cluster federated objectives. This results in a method where, at each round, federated updates and cluster inferences are coupled, as presented in this paper. The algorithm is validated on benchmark datasets under Dirichlet and class-split non-IID partitions.

[454] Bridged Clustering for Representation Learning: Semi-Supervised Sparse Bridging

Patrick Peixuan Ye, Chen Shani, Ellen Vitercik

Main category: cs.LG

TL;DR: Bridged Clustering is a semi-supervised framework that learns predictors from unpaired X and Y datasets by clustering them independently and learning sparse bridges between clusters using minimal paired examples.

Details

Motivation: To leverage output-only data in semi-supervised learning while maintaining interpretability and sparsity, unlike traditional SSL methods and dense transport-based approaches.

Method: Clusters input X and output Y independently, learns sparse interpretable bridges between clusters using few paired examples, and predicts by assigning new inputs to nearest input clusters and returning linked output cluster centroids.

Result: Theoretical analysis shows effectiveness with bounded mis-clustering and mis-bridging rates. Empirical results demonstrate competitiveness with SOTA methods while being simple, model-agnostic, and highly label-efficient in low-supervision settings.

Conclusion: Bridged Clustering provides an effective, interpretable, and label-efficient semi-supervised learning approach that explicitly leverages output-only data through sparse cluster alignments.

Abstract: We introduce Bridged Clustering, a semi-supervised framework to learn predictors from any unpaired input $X$ and output $Y$ dataset. Our method first clusters $X$ and $Y$ independently, then learns a sparse, interpretable bridge between clusters using only a few paired examples. At inference, a new input $x$ is assigned to its nearest input cluster, and the centroid of the linked output cluster is returned as the prediction $\hat{y}$. Unlike traditional SSL, Bridged Clustering explicitly leverages output-only data, and unlike dense transport-based methods, it maintains a sparse and interpretable alignment. Through theoretical analysis, we show that with bounded mis-clustering and mis-bridging rates, our algorithm becomes an effective and efficient predictor. Empirically, our method is competitive with SOTA methods while remaining simple, model-agnostic, and highly label-efficient in low-supervision settings.

[455] Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples

Alexandra Souly, Javier Rando, Ed Chapman, Xander Davies, Burak Hasircioglu, Ezzeldin Shereen, Carlos Mougan, Vasilios Mavroudis, Erik Jones, Chris Hicks, Nicholas Carlini, Yarin Gal, Robert Kirk

Main category: cs.LG

TL;DR: Poisoning attacks on LLMs require only a near-constant number of malicious documents (around 250) regardless of model or dataset size, making backdoor injection easier than previously believed.

Details

Motivation: Existing work assumed poisoning attacks require controlling a percentage of training data, but for large models this translates to impractically large amounts of data. This work challenges that assumption by showing poisoning effectiveness doesn't scale with dataset size.

Method: Conducted largest pretraining poisoning experiments to date, training models from 600M to 13B parameters on chinchilla-optimal datasets (6B to 260B tokens). Also ran smaller-scale experiments to test factors like poisoned-to-clean data ratios and non-random poison distributions.

Result: 250 poisoned documents similarly compromised models across all sizes, despite largest models training on 20x more clean data. Same dynamics observed for fine-tuning poisoning.

Conclusion: Backdoor injection through data poisoning may be easier for large models than previously thought, as required poison count doesn’t scale with model size, highlighting urgent need for better defenses.

Abstract: Poisoning attacks can compromise the safety of large language models (LLMs) by injecting malicious documents into their training data. Existing work has studied pretraining poisoning assuming adversaries control a percentage of the training corpus. However, for large models, even small percentages translate to impractically large amounts of data. This work demonstrates for the first time that poisoning attacks instead require a near-constant number of documents regardless of dataset size. We conduct the largest pretraining poisoning experiments to date, pretraining models from 600M to 13B parameters on chinchilla-optimal datasets (6B to 260B tokens). We find that 250 poisoned documents similarly compromise models across all model and dataset sizes, despite the largest models training on more than 20 times more clean data. We also run smaller-scale experiments to ablate factors that could influence attack success, including broader ratios of poisoned to clean data and non-random distributions of poisoned samples. Finally, we demonstrate the same dynamics for poisoning during fine-tuning. Altogether, our results suggest that injecting backdoors through data poisoning may be easier for large models than previously believed as the number of poisons required does not scale up with model size, highlighting the need for more research on defences to mitigate this risk in future models.

[456] An in-depth look at approximation via deep and narrow neural networks

Joris Dommel, Sven A. Wegner

Main category: cs.LG

TL;DR: This paper investigates approximating a counterexample function using neural networks with widths w=n and w=n+1, studying how depth affects approximation quality and identifying dying neurons as a key factor.

Details

Motivation: To understand the approximation capabilities of neural networks around the critical width threshold w>n, specifically examining the counterexample function used to prove the necessity of this condition.

Method: Approximated the counterexample function f:R^n->R using neural networks with widths w=n and w=n+1, varying network depth and analyzing the resulting approximation behavior.

Result: The study reveals how approximation quality changes with depth and identifies that dying neurons (neurons that become inactive) are responsible for the observed behavior patterns.

Conclusion: The research provides insights into the approximation behavior of neural networks at the critical width threshold, highlighting the role of dying neurons in limiting approximation capabilities when w=n.

Abstract: In 2017, Hanin and Sellke showed that the class of arbitrarily deep, real-valued, feed-forward and ReLU-activated networks of width w forms a dense subset of the space of continuous functions on R^n, with respect to the topology of uniform convergence on compact sets, if and only if w>n holds. To show the necessity, a concrete counterexample function f:R^n->R was used. In this note we actually approximate this very f by neural networks in the two cases w=n and w=n+1 around the aforementioned threshold. We study how the approximation quality behaves if we vary the depth and what effect (spoiler alert: dying neurons) cause that behavior.

[457] Guided by the Experts: Provable Feature Learning Dynamic of Soft-Routed Mixture-of-Experts

Fangshuo Liao, Anastasios Kyrillidis

Main category: cs.LG

TL;DR: This paper provides theoretical convergence guarantees for joint training of soft-routed Mixture-of-Experts models with non-linear routers and experts, showing feature learning and parameter recovery in a student-teacher framework.

Details

Motivation: Despite widespread use of MoE architectures, theoretical understanding of their training dynamics is limited to simplified scenarios like separate expert-router optimization or top-1 routing with constructed datasets.

Method: The analysis uses a student-teacher framework with moderate over-parameterization, proving convergence for joint training of soft-routed MoE models with non-linear routers and experts, followed by pruning and fine-tuning.

Result: The student network undergoes a feature learning phase where the router is guided by experts to recover teacher parameters, and post-training pruning eliminates redundant neurons while maintaining convergence.

Conclusion: This is the first analysis providing novel insights into the optimization landscape of MoE architecture with provable convergence guarantees for joint training.

Abstract: Mixture-of-Experts (MoE) architectures have emerged as a cornerstone of modern AI systems. In particular, MoEs route inputs dynamically to specialized experts whose outputs are aggregated through weighted summation. Despite their widespread application, theoretical understanding of MoE training dynamics remains limited to either separate expert-router optimization or only top-1 routing scenarios with carefully constructed datasets. This paper advances MoE theory by providing convergence guarantees for joint training of soft-routed MoE models with non-linear routers and experts in a student-teacher framework. We prove that, with moderate over-parameterization, the student network undergoes a feature learning phase, where the router’s learning process is ``guided’’ by the experts, that recovers the teacher’s parameters. Moreover, we show that a post-training pruning can effectively eliminate redundant neurons, followed by a provably convergent fine-tuning process that reaches global optimality. To our knowledge, our analysis is the first to bring novel insights in understanding the optimization landscape of the MoE architecture.

[458] A Broader View of Thompson Sampling

Yanlin Qu, Hongseok Namkoong, Assaf Zeevi

Main category: cs.LG

TL;DR: Thompson Sampling’s exploration-exploitation mechanism is explained by recasting it as an online optimization algorithm using a ‘faithful’ stationarization approach that preserves the original problem structure.

Details

Motivation: To understand the exact mechanism through which Thompson Sampling balances exploration and exploitation, which remains mysterious despite its widespread use and strong performance.

Method: Recast Thompson Sampling as an online optimization algorithm using ‘faithful’ stationarization of the regret formulation, converting the finite horizon problem into a stationary counterpart that preserves the original objective structure.

Result: Thompson Sampling admits a simple online optimization form that mimics the Bellman-optimal policy, where greediness is regularized by a measure of residual uncertainty based on point-biserial correlation.

Conclusion: This approach reveals how Thompson Sampling balances exploration-exploitation and provides a principled framework to study and improve Thompson’s original idea.

Abstract: Thompson Sampling is one of the most widely used and studied bandit algorithms, known for its simple structure, low regret performance, and solid theoretical guarantees. Yet, in stark contrast to most other families of bandit algorithms, the exact mechanism through which posterior sampling (as introduced by Thompson) is able to “properly” balance exploration and exploitation, remains a mystery. In this paper we show that the core insight to address this question stems from recasting Thompson Sampling as an online optimization algorithm. To distill this, a key conceptual tool is introduced, which we refer to as “faithful” stationarization of the regret formulation. Essentially, the finite horizon dynamic optimization problem is converted into a stationary counterpart which “closely resembles” the original objective (in contrast, the classical infinite horizon discounted formulation, that leads to the Gittins index, alters the problem and objective in too significant a manner). The newly crafted time invariant objective can be studied using Bellman’s principle which leads to a time invariant optimal policy. When viewed through this lens, Thompson Sampling admits a simple online optimization form that mimics the structure of the Bellman-optimal policy, and where greediness is regularized by a measure of residual uncertainty based on point-biserial correlation. This answers the question of how Thompson Sampling balances exploration-exploitation, and moreover, provides a principled framework to study and further improve Thompson’s original idea.

[459] Discriminative Feature Feedback with General Teacher Classes

Omri Bar Oz, Tosca Lechner, Sivan Sabato

Main category: cs.LG

TL;DR: This paper provides the first systematic theoretical analysis of Discriminative Feature Feedback (DFF), comparing it to classical learning protocols and characterizing mistake bounds in both realizable and non-realizable settings.

Details

Motivation: To understand the theoretical properties of DFF learning protocol and compare it with classical protocols like supervised learning and online learning, particularly examining how richer feedback affects learning performance.

Method: Theoretical analysis of DFF protocol using a general framework comparable to classical learning protocols, developing new notions of dimension to characterize mistake bounds.

Result: Characterized mistake bounds in realizable setting using new dimension concept, provided upper bound for non-realizable setting that cannot be improved, and showed that realizable dimension alone is insufficient for non-realizable bounds in DFF unlike online learning.

Conclusion: DFF differs fundamentally from online learning - richer feedback in DFF means realizable dimension alone cannot characterize non-realizable mistake bounds or no-regret algorithm existence, requiring different theoretical treatment.

Abstract: We study the theoretical properties of the interactive learning protocol Discriminative Feature Feedback (DFF) (Dasgupta et al., 2018). The DFF learning protocol uses feedback in the form of discriminative feature explanations. We provide the first systematic study of DFF in a general framework that is comparable to that of classical protocols such as supervised learning and online learning. We study the optimal mistake bound of DFF in the realizable and the non-realizable settings, and obtain novel structural results, as well as insights into the differences between Online Learning and settings with richer feedback such as DFF. We characterize the mistake bound in the realizable setting using a new notion of dimension. In the non-realizable setting, we provide a mistake upper bound and show that it cannot be improved in general. Our results show that unlike Online Learning, in DFF the realizable dimension is insufficient to characterize the optimal non-realizable mistake bound or the existence of no-regret algorithms.

[460] Test-Time Graph Search for Goal-Conditioned Reinforcement Learning

Evgenii Opryshko, Junwei Quan, Claas Voelcker, Yilun Du, Igor Gilitschenski

Main category: cs.LG

TL;DR: TTGS is a test-time planning method for offline goal-conditioned RL that uses graph search over dataset states to plan subgoal sequences, improving long-horizon performance without training changes.

Details

Motivation: Offline GCRL struggles with long-horizon tasks due to temporal credit assignment and error accumulation, which are amplified in the offline setting.

Method: TTGS builds a weighted graph over dataset states using any distance/cost signal, performs fast search to find subgoal sequences, and executes them with a frozen policy. For value-based learners, it uses the learned value function as distance metric.

Result: On OGBench benchmark, TTGS improves success rates of multiple base learners on challenging locomotion tasks.

Conclusion: Simple metric-guided test-time planning effectively enhances offline GCRL performance for long-horizon tasks without requiring training modifications or additional supervision.

Abstract: Offline goal-conditioned reinforcement learning (GCRL) trains policies that reach user-specified goals at test time, providing a simple, unsupervised, domain-agnostic way to extract diverse behaviors from unlabeled, reward-free datasets. Nonetheless, long-horizon decision making remains difficult for GCRL agents due to temporal credit assignment and error accumulation, and the offline setting amplifies these effects. To alleviate this issue, we introduce Test-Time Graph Search (TTGS), a lightweight planning approach to solve the GCRL task. TTGS accepts any state-space distance or cost signal, builds a weighted graph over dataset states, and performs fast search to assemble a sequence of subgoals that a frozen policy executes. When the base learner is value-based, the distance is derived directly from the learned goal-conditioned value function, so no handcrafted metric is needed. TTGS requires no changes to training, no additional supervision, no online interaction, and no privileged information, and it runs entirely at inference. On the OGBench benchmark, TTGS improves success rates of multiple base learners on challenging locomotion tasks, demonstrating the benefit of simple metric-guided test-time planning for offline GCRL.

[461] Dynamic Regret Bounds for Online Omniprediction with Long Term Constraints

Yahav Bechavod, Jiuyao Lu, Aaron Roth

Main category: cs.LG

TL;DR: An algorithm for online omniprediction with long-term constraints that provides dynamic regret guarantees for all agents while ensuring vanishing constraint violation, without requiring agents to maintain state.

Details

Motivation: To develop a framework where a learner generates predictions for downstream decision makers with different utility and constraint functions, enabling worst-case utility guarantees and minimized constraint violations across all agents.

Method: The algorithm produces predictions that allow downstream decision makers to select actions ‘as if’ state predictions are correct, solving one-round constrained optimization problems without maintaining state across rounds.

Result: First algorithm achieving simultaneous dynamic regret guarantees for all agents (measured against changing action sequences) while ensuring vanishing constraint violation for each agent.

Conclusion: The proposed algorithm successfully addresses online omniprediction with long-term constraints, providing dynamic regret bounds and constraint satisfaction without requiring agents to maintain state information.

Abstract: We present an algorithm guaranteeing dynamic regret bounds for online omniprediction with long term constraints. The goal in this recently introduced problem is for a learner to generate a sequence of predictions which are broadcast to a collection of downstream decision makers. Each decision maker has their own utility function, as well as a vector of constraint functions, each mapping their actions and an adversarially selected state to reward or constraint violation terms. The downstream decision makers select actions “as if” the state predictions are correct, and the goal of the learner is to produce predictions such that all downstream decision makers choose actions that give them worst-case utility guarantees while minimizing worst-case constraint violation. Within this framework, we give the first algorithm that obtains simultaneous \emph{dynamic regret} guarantees for all of the agents – where regret for each agent is measured against a potentially changing sequence of actions across rounds of interaction, while also ensuring vanishing constraint violation for each agent. Our results do not require the agents themselves to maintain any state – they only solve one-round constrained optimization problems defined by the prediction made at that round.

[462] GTCN-G: A Residual Graph-Temporal Fusion Network for Imbalanced Intrusion Detection (Preprint)

Tianxiang Xu, Zhichao Wen, Xinyu Zhao, Qi Hu, Yan Li, Chang Liu

Main category: cs.LG

TL;DR: GTCN-G is a novel deep learning framework that combines Gated Temporal Convolutional Networks and Graph Neural Networks with residual learning to address class imbalance in intrusion detection systems, achieving state-of-the-art performance.

Details

Motivation: To overcome challenges in intrusion detection systems caused by network threat complexity and class imbalance in traffic data, and to create a framework that integrates both temporal and structural information while addressing data imbalance.

Method: Proposes GTCN-G framework that fuses Gated TCN for temporal feature extraction with GCN for graph structure learning, incorporating Graph Attention Network with residual connections to preserve original features and mitigate class imbalance.

Result: Extensive experiments on UNSW-NB15 and ToN-IoT datasets show GTCN-G achieves state-of-the-art performance, significantly outperforming existing baseline models in both binary and multi-class classification tasks.

Conclusion: The proposed GTCN-G framework effectively addresses class imbalance in intrusion detection by synergistically integrating temporal and structural information through residual learning mechanisms, demonstrating superior performance over existing approaches.

Abstract: The escalating complexity of network threats and the inherent class imbalance in traffic data present formidable challenges for modern Intrusion Detection Systems (IDS). While Graph Neural Networks (GNNs) excel in modeling topological structures and Temporal Convolutional Networks (TCNs) are proficient in capturing time-series dependencies, a framework that synergistically integrates both while explicitly addressing data imbalance remains an open challenge. This paper introduces a novel deep learning framework, named Gated Temporal Convolutional Network and Graph (GTCN-G), engineered to overcome these limitations. Our model uniquely fuses a Gated TCN (G-TCN) for extracting hierarchical temporal features from network flows with a Graph Convolutional Network (GCN) designed to learn from the underlying graph structure. The core innovation lies in the integration of a residual learning mechanism, implemented via a Graph Attention Network (GAT). This mechanism preserves original feature information through residual connections, which is critical for mitigating the class imbalance problem and enhancing detection sensitivity for rare malicious activities (minority classes). We conducted extensive experiments on two public benchmark datasets, UNSW-NB15 and ToN-IoT, to validate our approach. The empirical results demonstrate that the proposed GTCN-G model achieves state-of-the-art performance, significantly outperforming existing baseline models in both binary and multi-class classification tasks.

[463] Evolutionary Profiles for Protein Fitness Prediction

Jigang Fan, Xiaoran Jiao, Shengdong Lin, Zhanming Liang, Weian Mao, Chenchen Jing, Hao Chen, Chunhua Shen

Main category: cs.LG

TL;DR: EvoIF is a lightweight protein fitness prediction model that combines within-family evolutionary profiles from homologs with cross-family structural constraints from inverse folding, achieving state-of-the-art performance with minimal training data.

Details

Motivation: Protein fitness prediction is limited by small experimental datasets relative to the vast sequence space. Existing protein language models show strong zero-shot prediction, but there's a need for more efficient and robust methods.

Method: Interpret natural evolution as reward maximization and masked language modeling as inverse reinforcement learning. EvoIF integrates two evolutionary signals: within-family profiles from homologs and cross-family structural-evolutionary constraints from inverse folding logits, fused via a compact transition block.

Result: On ProteinGym (217 assays, >2.5M mutants), EvoIF achieves state-of-the-art or competitive performance using only 0.15% of training data and fewer parameters than large models. The two evolutionary profiles are complementary and improve robustness across various conditions.

Conclusion: EvoIF provides an efficient framework for protein fitness prediction by unifying evolutionary and structural constraints, demonstrating that lightweight models can achieve strong performance with minimal training data.

Abstract: Predicting the fitness impact of mutations is central to protein engineering but constrained by limited assays relative to the size of sequence space. Protein language models (pLMs) trained with masked language modeling (MLM) exhibit strong zero-shot fitness prediction; we provide a unifying view by interpreting natural evolution as implicit reward maximization and MLM as inverse reinforcement learning (IRL), in which extant sequences act as expert demonstrations and pLM log-odds serve as fitness estimates. Building on this perspective, we introduce EvoIF, a lightweight model that integrates two complementary sources of evolutionary signal: (i) within-family profiles from retrieved homologs and (ii) cross-family structural-evolutionary constraints distilled from inverse folding logits. EvoIF fuses sequence-structure representations with these profiles via a compact transition block, yielding calibrated probabilities for log-odds scoring. On ProteinGym (217 mutational assays; >2.5M mutants), EvoIF and its MSA-enabled variant achieve state-of-the-art or competitive performance while using only 0.15% of the training data and fewer parameters than recent large models. Ablations confirm that within-family and cross-family profiles are complementary, improving robustness across function types, MSA depths, taxa, and mutation depths. The codes will be made publicly available at https://github.com/aim-uofa/EvoIF.

[464] MolGA: Molecular Graph Adaptation with Pre-trained 2D Graph Encoder

Xingtong Yu, Chang Zhou, Xinming Zhang, Yuan Fang

Main category: cs.LG

TL;DR: MolGA adapts pre-trained 2D graph encoders for molecular applications by incorporating molecular domain knowledge through molecular alignment and conditional adaptation mechanisms.

Details

Motivation: Existing pre-trained 2D graph encoders overlook rich molecular domain knowledge associated with submolecular instances (atoms and bonds), while molecular pre-training approaches lack flexibility to integrate diverse knowledge types.

Method: 1) Molecular alignment strategy to bridge pre-trained topological representations with domain-knowledge representations; 2) Conditional adaptation mechanism that generates instance-specific tokens for fine-grained integration of molecular domain knowledge.

Result: Extensive experiments on eleven public datasets demonstrate the effectiveness of MolGA.

Conclusion: MolGA provides a practical approach to reuse pre-trained 2D encoders while incorporating molecular domain knowledge during downstream adaptation.

Abstract: Molecular graph representation learning is widely used in chemical and biomedical research. While pre-trained 2D graph encoders have demonstrated strong performance, they overlook the rich molecular domain knowledge associated with submolecular instances (atoms and bonds). While molecular pre-training approaches incorporate such knowledge into their pre-training objectives, they typically employ designs tailored to a specific type of knowledge, lacking the flexibility to integrate diverse knowledge present in molecules. Hence, reusing widely available and well-validated pre-trained 2D encoders, while incorporating molecular domain knowledge during downstream adaptation, offers a more practical alternative. In this work, we propose MolGA, which adapts pre-trained 2D graph encoders to downstream molecular applications by flexibly incorporating diverse molecular domain knowledge. First, we propose a molecular alignment strategy that bridge the gap between pre-trained topological representations with domain-knowledge representations. Second, we introduce a conditional adaptation mechanism that generates instance-specific tokens to enable fine-grained integration of molecular domain knowledge for downstream tasks. Finally, we conduct extensive experiments on eleven public datasets, demonstrating the effectiveness of MolGA.

[465] MLE-Smith: Scaling MLE Tasks with Automated Multi-Agent Pipeline

Rushi Qiang, Yuchen Zhuang, Anikait Singh, Percy Liang, Chao Zhang, Sherry Yang, Bo Dai

Main category: cs.LG

TL;DR: MLE-Smith is an automated multi-agent pipeline that transforms raw datasets into competition-style machine learning engineering challenges, addressing the scalability and quality issues of manually curated MLE benchmarks.

Details

Motivation: Current MLE benchmarks suffer from low scalability and limited applicability due to reliance on static, manually curated tasks that require extensive time and manual effort to produce.

Method: A fully automated multi-agent pipeline using generate-verify-execute paradigm with structured task design, standardized refactoring, and hybrid verification mechanism (structural rules + semantic soundness) plus interactive execution for empirical validation.

Result: Applied to 224 real-world datasets, generated 606 tasks spanning multiple categories, objectives, and modalities. Evaluation shows strong correlation between LLM performance on MLE-Smith tasks and human-designed tasks.

Conclusion: MLE-Smith effectively scales up MLE tasks while maintaining quality, demonstrating strong correlation with human-designed benchmarks and broad applicability across diverse real-world datasets.

Abstract: While Language Models (LMs) have made significant progress in automating machine learning engineering (MLE), the acquisition of high-quality MLE training data is significantly constrained. Current MLE benchmarks suffer from low scalability and limited applicability because they rely on static, manually curated tasks, demanding extensive time and manual effort to produce. We introduce MLE-Smith, a fully automated multi-agent pipeline, to transform raw datasets into competition-style MLE challenges through an efficient generate-verify-execute paradigm for scaling MLE tasks with verifiable quality, real-world usability, and rich diversity. The proposed multi-agent pipeline in MLE-Smith drives structured task design and standardized refactoring, coupled with a hybrid verification mechanism that enforces strict structural rules and high-level semantic soundness. It further validates empirical solvability and real-world fidelity through interactive execution. We apply MLE-Smith to 224 of real-world datasets and generate 606 tasks spanning multiple categories, objectives, and modalities, demonstrating that MLE-Smith can work effectively across a wide range of real-world datasets. Evaluation on the generated tasks shows that the performance of eight mainstream and cutting-edge LLMs on MLE-Smith tasks is strongly correlated with their performance on carefully human-designed tasks, highlighting the effectiveness of the MLE-Smith to scaling up MLE tasks, while maintaining task quality.

[466] h1: Bootstrapping LLMs to Reason over Longer Horizons via Reinforcement Learning

Sumeet Ramesh Motwani, Alesia Ivanova, Ziyang Cai, Philip Torr, Riashat Islam, Shital Shah, Christian Schroeder de Witt, Charles London

Main category: cs.LG

TL;DR: A scalable method to bootstrap long-horizon reasoning using only short-horizon data by synthetically composing problems into complex dependency chains and training with outcome-only rewards under an automatic curriculum.

Details

Motivation: Large language models perform well on short-horizon reasoning but struggle with longer reasoning chains, and existing approaches require costly step-level supervision or inference-time scaffolding that doesn't scale well.

Method: Synthetically compose simple problems into complex multi-step dependency chains of arbitrary length, then train models using outcome-only rewards under an automatic curriculum that increases complexity.

Result: Curriculum training on composed 6th-grade math problems boosts accuracy on longer competition-level benchmarks by up to 2.06x, with improvements significantly higher than baselines even at high pass@k.

Conclusion: The method provides an efficient path for scaling RL for long-horizon problems using only existing data, achieving exponential improvement in sample complexity over full-horizon training.

Abstract: Large language models excel at short-horizon reasoning tasks, but performance drops as reasoning horizon lengths increase. Existing approaches to combat this rely on inference-time scaffolding or costly step-level supervision, neither of which scales easily. In this work, we introduce a scalable method to bootstrap long-horizon reasoning capabilities using only existing, abundant short-horizon data. Our approach synthetically composes simple problems into complex, multi-step dependency chains of arbitrary length. We train models on this data using outcome-only rewards under a curriculum that automatically increases in complexity, allowing RL training to be scaled much further without saturating. Empirically, our method generalizes remarkably well: curriculum training on composed 6th-grade level math problems (GSM8K) boosts accuracy on longer, competition-level benchmarks (GSM-Symbolic, MATH-500, AIME) by up to 2.06x. Importantly, our long-horizon improvements are significantly higher than baselines even at high pass@k, showing that models can learn new reasoning paths under RL. Theoretically, we show that curriculum RL with outcome rewards achieves an exponential improvement in sample complexity over full-horizon training, providing training signal comparable to dense supervision. h1 therefore introduces an efficient path towards scaling RL for long-horizon problems using only existing data.

[467] Interpretable Clustering: A Survey

Lianyu Hu, Mudi Jiang, Junjie Dong, Xinying Liu, Zengyou He

Main category: cs.LG

TL;DR: This paper provides a comprehensive review of explainable clustering algorithms to address the growing need for transparency in high-stakes domains like healthcare and finance.

Details

Motivation: Current clustering research focuses on accuracy and efficiency at the expense of interpretability, but high-stakes applications require transparent and justifiable clustering outcomes for user trust and regulatory compliance.

Method: The paper conducts a structured review of explainable clustering algorithms, identifies key criteria to distinguish methods, and creates a taxonomy with an open repository of interpretable clustering methods.

Result: Developed a comprehensive framework for categorizing explainable clustering methods and established an open repository (https://github.com/hulianyu/Awesome-Interpretable-Clustering) to organize representative methods.

Conclusion: The survey helps researchers select appropriate explainable clustering methods for specific applications and promotes the development of both efficient and transparent clustering algorithms.

Abstract: In recent years, much of the research on clustering algorithms has primarily focused on enhancing their accuracy and efficiency, frequently at the expense of interpretability. However, as these methods are increasingly being applied in high-stakes domains such as healthcare, finance, and autonomous systems, the need for transparent and interpretable clustering outcomes has become a critical concern. This is not only necessary for gaining user trust but also for satisfying the growing ethical and regulatory demands in these fields. Ensuring that decisions derived from clustering algorithms can be clearly understood and justified is now a fundamental requirement. To address this need, this paper provides a comprehensive and structured review of the current state of explainable clustering algorithms, identifying key criteria to distinguish between various methods. These insights can effectively assist researchers in making informed decisions about the most suitable explainable clustering methods for specific application contexts, while also promoting the development and adoption of clustering algorithms that are both efficient and transparent. For convenient access and reference, an open repository organizes representative and emerging interpretable clustering methods under the taxonomy proposed in this survey, available at https://github.com/hulianyu/Awesome-Interpretable-Clustering

[468] Error Bounds for Physics-Informed Neural Networks in Fokker-Planck PDEs

Chun-Wei Kong, Luca Laurenti, Jay McMahon, Morteza Lahijanian

Main category: cs.LG

TL;DR: PINNs can approximate Fokker-Planck PDE solutions for stochastic processes, with theoretical and practical error bounds developed that generalize to other linear PDEs.

Details

Motivation: Solving Fokker-Planck PDEs for probability density functions is generally infeasible in closed form, requiring alternative approximation methods.

Method: Use physics-informed neural networks (PINNs) trained to approximate PDF solutions, with theoretical framework for constructing tight error bounds.

Result: Empirical validation on nonlinear, high-dimensional, and chaotic systems shows correct error bounds and significant computational speedup over Monte Carlo methods.

Conclusion: PINNs provide scalable and efficient approximation of PDF solutions with verifiable error bounds that generalize to other linear PDEs.

Abstract: Stochastic differential equations are commonly used to describe the evolution of stochastic processes. The state uncertainty of such processes is best represented by the probability density function (PDF), whose evolution is governed by the Fokker-Planck partial differential equation (FP-PDE). However, it is generally infeasible to solve the FP-PDE in closed form. In this work, we show that physics-informed neural networks (PINNs) can be trained to approximate the solution PDF. Our main contribution is the analysis of PINN approximation error: we develop a theoretical framework to construct tight error bounds using PINNs. In addition, we derive a practical error bound that can be efficiently constructed with standard training methods. We discuss that this error-bound framework generalizes to approximate solutions of other linear PDEs. Empirical results on nonlinear, high-dimensional, and chaotic systems validate the correctness of our error bounds while demonstrating the scalability of PINNs and their significant computational speedup in obtaining accurate PDF solutions compared to the Monte Carlo approach.

[469] Machine Learning and Multi-source Remote Sensing in Forest Aboveground Biomass Estimation: A Review

Autumn Nguyen, Sulagna Saha

Main category: cs.LG

TL;DR: Systematic review of 25 studies on forest aboveground biomass estimation using machine learning and remote sensing, finding Random Forest most commonly used and Extreme Gradient Boosting most effective, with multi-sensor approaches being particularly successful.

Details

Motivation: Quantifying forest aboveground biomass is crucial for environmental protection policies, but there's a lack of systematic review on recent combinations of ML methods and multiple remote sensing sources considering forest ecological characteristics.

Method: Systematic analysis of 25 papers meeting strict inclusion criteria from over 80 related studies, identifying all ML methods and combinations of remote sensing data used.

Result: Random Forest appeared most frequently (88% of studies), Extreme Gradient Boosting showed superior performance in 75% of comparative studies, Sentinel-1 was most utilized remote sensing source, and multi-sensor approaches proved especially effective.

Conclusion: The findings provide recommendations for which sensing sources, variables, and methods to consider when integrating machine learning and remote sensing for forest aboveground biomass estimation.

Abstract: Quantifying forest aboveground biomass (AGB) is crucial for informing decisions and policies that will protect the planet. Machine learning (ML) and remote sensing (RS) techniques have been used to do this task more effectively, yet there lacks a systematic review on the most recent working combinations of ML methods and multiple RS sources, especially with the consideration of the forests’ ecological characteristics. This study systematically analyzed 25 papers that met strict inclusion criteria from over 80 related studies, identifying all ML methods and combinations of RS data used. Random Forest had the most frequent appearance (88% of studies), while Extreme Gradient Boosting showed superior performance in 75% of the studies in which it was compared with other methods. Sentinel-1 emerged as the most utilized remote sensing source, with multi-sensor approaches (e.g., Sentinel-1, Sentinel-2, and LiDAR) proving especially effective. Our findings provide grounds for recommending which sensing sources, variables, and methods to consider using when integrating ML and RS for forest AGB estimation.

[470] FedAGHN: Personalized Federated Learning with Attentive Graph HyperNetworks

Jiarui Song, Yunheng Shen, Chengbin Hou, Pengyu Wang, Jinbao Wang, Ke Tang, Hairong Lv

Main category: cs.LG

TL;DR: FedAGHN uses attentive graph hypernetworks to dynamically capture fine-grained collaborative relationships in personalized federated learning, generating client-specific personalized models through collaboration graphs.

Details

Motivation: Address statistical heterogeneity in federated learning by learning appropriate collaborative relationships that vary across scenarios and FL process stages.

Method: Uses Attentive Graph HyperNetworks (AGHNs) to model client-specific collaborative relationships, construct collaboration graphs, and derive collaboration weights through tunable attentive mechanisms for personalized model aggregation.

Result: Extensive experiments demonstrate FedAGHN’s superiority, with visualizations showing effectiveness of learned collaboration graphs.

Conclusion: FedAGHN effectively captures dynamic collaborative relationships in PFL through attentive graph networks, enabling personalized model generation that adapts to varying client scenarios.

Abstract: Personalized Federated Learning (PFL) aims to address the statistical heterogeneity of data across clients by learning the personalized model for each client. Among various PFL approaches, the personalized aggregation-based approach conducts parameter aggregation in the server-side aggregation phase to generate personalized models, and focuses on learning appropriate collaborative relationships among clients for aggregation. However, the collaborative relationships vary in different scenarios and even at different stages of the FL process. To this end, we propose Personalized Federated Learning with Attentive Graph HyperNetworks (FedAGHN), which employs Attentive Graph HyperNetworks (AGHNs) to dynamically capture fine-grained collaborative relationships and generate client-specific personalized initial models. Specifically, AGHNs empower graphs to explicitly model the client-specific collaborative relationships, construct collaboration graphs, and introduce tunable attentive mechanism to derive the collaboration weights, so that the personalized initial models can be obtained by aggregating parameters over the collaboration graphs. Extensive experiments can demonstrate the superiority of FedAGHN. Moreover, a series of visualizations are presented to explore the effectiveness of collaboration graphs learned by FedAGHN.

[471] A Dual-Agent Adversarial Framework for Robust Generalization in Deep Reinforcement Learning

Zhengpeng Xie, Yulong Zhang

Main category: cs.LG

TL;DR: A dual-agent adversarial policy learning framework is proposed to improve generalization in reinforcement learning by having agents learn to perturb each other’s policies while maintaining their own stability, without requiring human prior knowledge.

Details

Motivation: RL models often fail to generalize to minor task variations like background color changes, showing overfitting despite enhanced decision-making capabilities.

Method: A game process between two agents where each seeks to maximize the impact of perturbing the opponent’s policy by producing representation differences for the same state, while maintaining stability against such perturbations.

Result: Extensive experiments on Procgen benchmark show significant generalization improvement, especially in hard-level tasks, outperforming baseline methods by a large margin.

Conclusion: The adversarial framework marks a significant step forward in generalization capabilities of deep reinforcement learning and can be applied to various RL algorithms like PPO.

Abstract: Recently, empowered with the powerful capabilities of neural networks, reinforcement learning (RL) has successfully tackled numerous challenging tasks. However, while these models demonstrate enhanced decision-making abilities, they are increasingly prone to overfitting. For instance, a trained RL model often fails to generalize to even minor variations of the same task, such as a change in background color or other minor semantic differences. To address this issue, we propose a dual-agent adversarial policy learning framework, which allows agents to spontaneously learn the underlying semantics without introducing any human prior knowledge. Specifically, our framework involves a game process between two agents: each agent seeks to maximize the impact of perturbing on the opponent’s policy by producing representation differences for the same state, while maintaining its own stability against such perturbations. This interaction encourages agents to learn generalizable policies, capable of handling irrelevant features from the high-dimensional observations. Extensive experimental results on the Procgen benchmark demonstrate that the adversarial process significantly improves the generalization performance of both agents, while also being applied to various RL algorithms, e.g., Proximal Policy Optimization (PPO). With the adversarial framework, the RL agent outperforms the baseline methods by a significant margin, especially in hard-level tasks, marking a significant step forward in the generalization capabilities of deep reinforcement learning.

[472] Achieving Hyperbolic-Like Expressiveness with Arbitrary Euclidean Regions: A New Approach to Hierarchical Embeddings

Hui Yang, Jiaoyan Chen

Main category: cs.LG

TL;DR: RegD is a flexible Euclidean framework that uses arbitrary geometric regions as embeddings to represent hierarchical data, achieving hyperbolic-like expressiveness while enabling integration with semantic relationship modeling.

Details

Motivation: Current hyperbolic embedding methods rely on specific geometric constructs, limiting generalizability and making it difficult to integrate with techniques that model semantic relationships beyond pure hierarchies.

Method: RegD operates in Euclidean space using arbitrary geometric regions (boxes, balls) as embedding representations, incorporating depth-based dissimilarity to emulate key properties of hyperbolic geometry including exponential growth.

Result: Empirical evaluation on diverse real-world datasets shows consistent performance gains over state-of-the-art methods and demonstrates RegD’s potential for broader applications like ontology embedding.

Conclusion: RegD provides a flexible Euclidean framework that achieves hyperbolic-like expressiveness while supporting integration with semantic relationship modeling beyond pure hierarchies.

Abstract: Hierarchical data is common in many domains like life sciences and e-commerce, and its embeddings often play a critical role. While hyperbolic embeddings offer a theoretically grounded approach to representing hierarchies in low-dimensional spaces, current methods often rely on specific geometric constructs as embedding candidates. This reliance limits their generalizability and makes it difficult to integrate with techniques that model semantic relationships beyond pure hierarchies, such as ontology embeddings. In this paper, we present RegD, a flexible Euclidean framework that supports the use of arbitrary geometric regions – such as boxes and balls – as embedding representations. Although RegD operates entirely in Euclidean space, we formally prove that it achieves hyperbolic-like expressiveness by incorporating a depth-based dissimilarity between regions, enabling it to emulate key properties of hyperbolic geometry, including exponential growth. Our empirical evaluation on diverse real-world datasets shows consistent performance gains over state-of-the-art methods and demonstrates RegD’s potential for broader applications such as the ontology embedding task that goes beyond hierarchy.

[473] LLM Unlearning via Neural Activation Redirection

William F. Shen, Xinchi Qiu, Meghdad Kurmanji, Alex Iacob, Lorenzo Sani, Yihong Chen, Nicola Cancedda, Nicholas D. Lane

Main category: cs.LG

TL;DR: LUNAR is a novel LLM unlearning method that redirects representations of unlearned data to activation regions expressing inability to answer, achieving state-of-the-art performance with superior controllability and efficiency.

Details

Motivation: Existing LLM unlearning methods struggle with balancing unlearning efficacy and model utility, and lack inference-time controllability to emulate base model behavior as if it had never seen the unlearned data.

Method: Based on Linear Representation Hypothesis, LUNAR redirects representations of unlearned data to activation regions that express inability to answer. It reduces parameter updates to a single down-projection matrix, eliminating the need for contrastive features.

Result: Achieves 2.9x-11.7x improvement in combined unlearning efficacy and model utility score across various base models. Generates coherent, contextually appropriate responses post-unlearning. Enhances efficiency by 20x and demonstrates robustness to white-box adversarial attacks and sequential unlearning requests.

Conclusion: LUNAR provides an effective, efficient, and controllable solution for selective knowledge removal from LLMs, with superior performance, robustness, and practical versatility in real-world scenarios.

Abstract: The ability to selectively remove knowledge from LLMs is highly desirable. However, existing methods often struggle with balancing unlearning efficacy and retain model utility, and lack controllability at inference time to emulate base model behavior as if it had never seen the unlearned data. In this paper, we propose LUNAR, a novel unlearning method grounded in the Linear Representation Hypothesis and operates by redirecting the representations of unlearned data to activation regions that expresses its inability to answer. We show that contrastive features are not a prerequisite for effective activation redirection, and LUNAR achieves state-of-the-art unlearning performance and superior controllability. Specifically, LUNAR achieves between 2.9x and 11.7x improvement in the combined unlearning efficacy and model utility score (Deviation Score) across various base models and generates coherent, contextually appropriate responses post-unlearning. Moreover, LUNAR effectively reduces parameter updates to a single down-projection matrix, a novel design that significantly enhances efficiency by 20x and robustness. Finally, we demonstrate that LUNAR is robust to white-box adversarial attacks and versatile in real-world scenarios, including handling sequential unlearning requests.

[474] MM-PoisonRAG: Disrupting Multimodal RAG with Local and Global Poisoning Attacks

Hyeonjeong Ha, Qiusi Zhan, Jeonghwan Kim, Dimitrios Bralios, Saikrishna Sanniboina, Nanyun Peng, Kai-Wei Chang, Daniel Kang, Heng Ji

Main category: cs.LG

TL;DR: MM-PoisonRAG is the first framework to systematically attack multimodal RAG systems through knowledge poisoning, demonstrating high success rates in manipulating or disrupting model outputs.

Details

Motivation: To expose critical safety vulnerabilities in multimodal RAG systems where adversaries can inject adversarial content into knowledge bases to steer models toward incorrect or harmful responses.

Method: Proposes two attack strategies: Localized Poisoning Attack (LPA) for targeted misinformation on specific queries, and Globalized Poisoning Attack (GPA) for broad disruption across all queries using single adversarial knowledge injections.

Result: LPA achieves up to 56% attack success rate for targeted manipulation, while GPA reduces model accuracy to 0% with just one adversarial injection, revealing significant system fragility.

Conclusion: Multimodal RAG systems are highly vulnerable to knowledge poisoning attacks, highlighting an urgent need for developing effective defense mechanisms.

Abstract: Multimodal large language models with Retrieval Augmented Generation (RAG) have significantly advanced tasks such as multimodal question answering by grounding responses in external text and images. This grounding improves factuality, reduces hallucination, and extends reasoning beyond parametric knowledge. However, this reliance on external knowledge poses a critical yet underexplored safety risk: knowledge poisoning attacks, where adversaries deliberately inject adversarial multimodal content into external knowledge bases to steer model toward generating incorrect or even harmful responses. To expose such vulnerabilities, we propose MM-PoisonRAG, the first framework to systematically design knowledge poisoning in multimodal RAG. We introduce two complementary attack strategies: Localized Poisoning Attack (LPA), which implants targeted multimodal misinformation to manipulate specific queries, and Globalized Poisoning Attack (GPA), which inserts a single adversarial knowledge to broadly disrupt reasoning and induce nonsensical responses across all queries. Comprehensive experiments across tasks, models, and access settings show that LPA achieves targeted manipulation with attack success rates of up to 56%, while GPA completely disrupts model generation to 0% accuracy with just a single adversarial knowledge injection. Our results reveal the fragility of multimodal RAG and highlight the urgent need for defenses against knowledge poisoning.

[475] NdLinear: Preserving Multi-Dimensional Structure for Parameter-Efficient Neural Networks

Alex Reneau, Jerry Yao-Chieh Hu, Zhongfang Zhuang, Ting-Chun Liu, Xiang He, Judah Goldfeder, Nadav Timor, Allen G Roush, Ravid Shwartz-Ziv

Main category: cs.LG

TL;DR: NdLinear is a drop-in replacement for linear layers that operates directly on tensors without flattening, achieving dramatic parameter reductions while preserving expressivity through structured Tucker decomposition.

Details

Motivation: Current deep learning approaches often require flattening multidimensional inputs (images, medical scans, time series) for linear layers, which loses native data structure and can be inefficient.

Method: NdLinear applies transformations separately along each dimension of tensors, preserving data structure while reducing parameters through structured Tucker decomposition. It maintains expressivity and VC-dimension scaling.

Result: Extensive experiments show NdLinear achieves significant parameter reductions (up to 9× fewer parameters) with substantial efficiency gains and minimal memory overhead across CNNs, RNNs, Transformers, and MLPs on vision, language, time-series, and tabular tasks.

Conclusion: NdLinear provides a theoretically grounded, practical component for building more efficient neural architectures by processing data in its original N-dimensional form, though it has limitations with entangled spatial interactions.

Abstract: In deep learning, processing multidimensional inputs (e.g., images, medical scans, and time series) is an important task that often requires flattening the inputs. We introduce $\mathit{NdLinear}$, a drop-in replacement for linear layers that operates directly on tensors, requiring no flattening. By applying transformations separately along each dimension, NdLinear preserves native data structure while achieving dramatic parameter reductions, often by orders of magnitude, with minimal memory overhead. We prove NdLinear maintains expressivity through structured Tucker decomposition while preserving VC-dimension scaling. Extensive experiments demonstrate NdLinear’s capacity to achieve significant parameter reductions with substantial wall-clock efficiency gains and minimal memory overhead. For instance, our $\mathit{NdLinear-LoRA}$ matches or exceeds standard LoRA on language reasoning tasks using up to $9\times$ fewer parameters. Experiments across CNNs, RNNs, Transformers, and MLPs on vision, language, time-series, and tabular tasks consistently demonstrate NdLinear’s efficiency gains. While excelling at axis-separable tasks, NdLinear has limitations with entangled spatial interactions. By processing data in its original N-dimensional form, NdLinear provides a theoretically grounded, practical component for building more efficient neural architectures.

[476] Weight Ensembling Improves Reasoning in Language Models

Xingyu Dang, Christina Baek, Kaiyue Wen, Zico Kolter, Aditi Raghunathan

Main category: cs.LG

TL;DR: WiSE-FT (weight interpolation between early and late SFT checkpoints) recovers diversity loss in reasoning models, improving Pass@k performance while maintaining Pass@1 gains, achieving better test-time scaling and complementary benefits to diversity-inducing decoding.

Details

Motivation: Address the failure mode where reasoning model diversity collapses during training, causing Pass@k to deteriorate despite Pass@1 improvements, limiting test-time scaling performance.

Method: Use WiSE-FT - interpolating weights between latest SFT checkpoint and early checkpoint to recover model diversity while maintaining performance gains.

Result: WiSE-FT almost completely recovers Pass@k while improving Pass@1, achieves better test-time scaling (Best@k, majority vote), and provides superior results with less data after RL tuning.

Conclusion: WiSE-FT reduces both bias and variance in Pass@k simultaneously, unlike temperature scaling which trades off between them, providing complementary performance gains that cannot be achieved through decoding strategies alone.

Abstract: We investigate a failure mode that arises during the training of reasoning models, where the diversity of generations begins to collapse, leading to suboptimal test-time scaling. Notably, the Pass@1 rate reliably improves during supervised finetuning (SFT), but Pass@k rapidly deteriorates. Surprisingly, a simple intervention of interpolating the weights of the latest SFT checkpoint with an early checkpoint, otherwise known as WiSE-FT, almost completely recovers Pass@k while also improving Pass@1. The WiSE-FT variant achieves better test-time scaling (Best@k, majority vote) and achieves superior results with less data when tuned further by reinforcement learning. Finally, we find that WiSE-FT provides complementary performance gains that cannot be achieved only through diversity-inducing decoding strategies, like temperature scaling. We formalize a bias-variance tradeoff of Pass@k with respect to the expectation and variance of Pass@1 over the test distribution. We find that WiSE-FT can reduce bias and variance simultaneously, while temperature scaling inherently trades off between bias and variance.

[477] MONAQ: Multi-Objective Neural Architecture Querying for Time-Series Analysis on Resource-Constrained Devices

Patara Trirat, Jae-Gil Lee

Main category: cs.LG

TL;DR: MONAQ is a framework that uses large language models to automate neural architecture search for time-series analysis on edge devices, achieving better performance and efficiency than handcrafted models and NAS baselines.

Details

Motivation: The growing use of smartphones and IoT devices requires efficient time-series analysis on resource-constrained hardware for applications like human activity recognition and air quality prediction. Existing hardware-aware NAS methods don't focus on general time-series analysis with edge deployment.

Method: MONAQ reformulates NAS into Multi-Objective Neural Architecture Querying tasks using LLMs. It features multimodal query generation for processing time-series inputs and hardware constraints, and an LLM agent-based multi-objective search that generates deployment-ready code. It integrates numerical data, time-series images, and textual descriptions.

Result: Experiments on fifteen datasets show that MONAQ-discovered models outperform both handcrafted models and NAS baselines while being more efficient.

Conclusion: MONAQ successfully leverages LLMs for automated neural architecture discovery in time-series analysis, providing deployment-ready models that are both high-performing and efficient for edge devices.

Abstract: The growing use of smartphones and IoT devices necessitates efficient time-series analysis on resource-constrained hardware, which is critical for sensing applications such as human activity recognition and air quality prediction. Recent efforts in hardware-aware neural architecture search (NAS) automate architecture discovery for specific platforms; however, none focus on general time-series analysis with edge deployment. Leveraging the problem-solving and reasoning capabilities of large language models (LLM), we propose MONAQ, a novel framework that reformulates NAS into Multi-Objective Neural Architecture Querying tasks. MONAQ is equipped with multimodal query generation for processing multimodal time-series inputs and hardware constraints, alongside an LLM agent-based multi-objective search to achieve deployment-ready models via code generation. By integrating numerical data, time-series images, and textual descriptions, MONAQ improves an LLM’s understanding of time-series data. Experiments on fifteen datasets demonstrate that MONAQ-discovered models outperform both handcrafted models and NAS baselines while being more efficient.

[478] AdaDim: Dimensionality Adaptation for SSL Representational Dynamics

Kiran Kokilepersaud, Mohit Prabhushankar, Ghassan AlRegib

Main category: cs.LG

TL;DR: The paper analyzes SSL training dynamics and finds that optimal performance comes from balancing representation dimensionality (H(R)) and mutual information between representation and embedding spaces (I(R;Z)), rather than maximizing H(R) or minimizing I(R;Z). They introduce AdaDim, an adaptive training strategy that achieves this balance without expensive techniques.

Details

Motivation: Current SSL literature views good representations as having high dimensionality (H(R)) and low mutual information between representation and embedding spaces (I(R;Z)), but lacks understanding of the training dynamics that influence their relationship.

Method: Introduces AdaDim, an adaptive training strategy that balances between increasing H(R) through feature decorrelation/sample uniformity and gradually regularizing I(R;Z) as training progresses.

Result: AdaDim achieves performance improvements of up to 3% over common SSL baselines without using expensive techniques like queues, clustering, predictor networks, or student-teacher architectures.

Conclusion: Optimal SSL performance comes from balancing representation dimensionality and mutual information between spaces, not from maximizing one while minimizing the other. AdaDim effectively achieves this balance through adaptive training dynamics.

Abstract: A key factor in effective Self-Supervised learning (SSL) is preventing dimensional collapse, where higher-dimensional representation spaces ($R$) span a lower-dimensional subspace. Therefore, SSL optimization strategies involve guiding a model to produce $R$ with a higher dimensionality ($H(R)$) through objectives that encourage decorrelation of features or sample uniformity in $R$. A higher $H(R)$ indicates that $R$ has greater feature diversity which is useful for generalization to downstream tasks. Alongside dimensionality optimization, SSL algorithms also utilize a projection head that maps $R$ into an embedding space $Z$. Recent work has characterized the projection head as a filter of noisy or irrelevant features from the SSL objective by reducing the mutual information $I(R;Z)$. Therefore, the current literature’s view is that a good SSL representation space should have a high $H(R)$ and a low $I(R;Z)$. However, this view of SSL is lacking in terms of an understanding of the underlying training dynamics that influences the relationship between both terms. Our analysis shows that the best performing SSL models do not have the highest $H(R)$ nor the lowest $I(R;Z)$, but effectively arrive at a balance between both. To take advantage of this analysis, we introduce AdaDim, a training strategy that leverages SSL training dynamics by adaptively balancing between increasing $H(R)$ through feature decorrelation and sample uniformity as well as gradual regularization of $I(R;Z)$ as training progresses. We show performance improvements of up to 3% over common SSL baselines despite our method not utilizing expensive techniques such as queues, clustering, predictor networks, or student-teacher architectures.

[479] Maximising the Utility of Validation Sets for Imbalanced Noisy-label Meta-learning

Dung Anh Hoang, Cuong Nguyen, Belagiannis Vasileios, Thanh-Toan Do, Gustavo Carneiro

Main category: cs.LG

TL;DR: Proposes a new meta-learning method (INOLML) that automatically builds optimal validation sets for imbalanced and noisy-label learning by maximizing utility based on informativeness, class balance, and label correctness.

Details

Motivation: Traditional meta-learning requires manually labeled balanced validation sets, which is sub-optimal and doesn't scale well with many classes. Existing automated heuristics are still sub-optimal.

Method: Analyzes meta-learning algorithm and proposes three criteria for validation set utility: informativeness, class distribution balance, and label correctness. Develops INOLML algorithm that automatically builds validation sets by maximizing these criteria.

Result: Shows significant improvements over previous meta-learning approaches and achieves new state-of-the-art performance on several benchmarks.

Conclusion: The proposed INOLML method effectively addresses limitations of traditional validation set construction in meta-learning for imbalanced and noisy-label scenarios.

Abstract: Meta-learning is an effective method to handle imbalanced and noisy-label learning, but it depends on a validation set containing randomly selected, manually labelled and balanced distributed samples. The random selection and manual labelling and balancing of this validation set is not only sub-optimal for meta-learning, but it also scales poorly with the number of classes. Hence, recent meta-learning papers have proposed ad-hoc heuristics to automatically build and label this validation set, but these heuristics are still sub-optimal for meta-learning. In this paper, we analyse the meta-learning algorithm and propose new criteria to characterise the utility of the validation set, based on: 1) the informativeness of the validation set; 2) the class distribution balance of the set; and 3) the correctness of the labels of the set. Furthermore, we propose a new imbalanced noisy-label meta-learning (INOLML) algorithm that automatically builds a validation set by maximising its utility using the criteria above. Our method shows significant improvements over previous meta-learning approaches and sets the new state-of-the-art on several benchmarks.

[480] MoRE-Brain: Routed Mixture of Experts for Interpretable and Generalizable Cross-Subject fMRI Visual Decoding

Yuxiang Wei, Yanteng Zhang, Xi Xiao, Tianyang Wang, Xiao Wang, Vince D. Calhoun

Main category: cs.LG

TL;DR: MoRE-Brain is a neuro-inspired framework for interpretable visual reconstruction from fMRI using a hierarchical Mixture-of-Experts architecture that mimics brain networks, achieving high fidelity and cross-subject generalization.

Details

Motivation: Current fMRI visual decoding methods prioritize reconstruction fidelity but overlook interpretability, which is essential for deriving neuroscientific insights. There's a need for methods that balance both high fidelity and interpretability.

Method: Uses hierarchical Mixture-of-Experts architecture with distinct experts processing fMRI from functionally related voxel groups. Experts encode fMRI into frozen CLIP space, then a finetuned diffusion model synthesizes images guided by expert outputs through dual-stage routing mechanism that dynamically weighs expert contributions.

Result: Achieves high reconstruction fidelity with effective utilization of fMRI signals. Demonstrates efficient cross-subject generalization by sharing core expert networks while adapting only subject-specific routers. Provides enhanced mechanistic insight into how different brain regions shape semantic and spatial attributes of reconstructed images.

Conclusion: MoRE-Brain represents a substantial advance towards more generalizable and interpretable fMRI-based visual decoding, distinguishing genuine neural decoding from over-reliance on generative priors.

Abstract: Decoding visual experiences from fMRI offers a powerful avenue to understand human perception and develop advanced brain-computer interfaces. However, current progress often prioritizes maximizing reconstruction fidelity while overlooking interpretability, an essential aspect for deriving neuroscientific insight. To address this gap, we propose MoRE-Brain, a neuro-inspired framework designed for high-fidelity, adaptable, and interpretable visual reconstruction. MoRE-Brain uniquely employs a hierarchical Mixture-of-Experts architecture where distinct experts process fMRI signals from functionally related voxel groups, mimicking specialized brain networks. The experts are first trained to encode fMRI into the frozen CLIP space. A finetuned diffusion model then synthesizes images, guided by expert outputs through a novel dual-stage routing mechanism that dynamically weighs expert contributions across the diffusion process. MoRE-Brain offers three main advancements: First, it introduces a novel Mixture-of-Experts architecture grounded in brain network principles for neuro-decoding. Second, it achieves efficient cross-subject generalization by sharing core expert networks while adapting only subject-specific routers. Third, it provides enhanced mechanistic insight, as the explicit routing reveals precisely how different modeled brain regions shape the semantic and spatial attributes of the reconstructed image. Extensive experiments validate MoRE-Brain’s high reconstruction fidelity, with bottleneck analyses further demonstrating its effective utilization of fMRI signals, distinguishing genuine neural decoding from over-reliance on generative priors. Consequently, MoRE-Brain marks a substantial advance towards more generalizable and interpretable fMRI-based visual decoding. Code will be publicly available soon: https://github.com/yuxiangwei0808/MoRE-Brain.

[481] Domain Generalization by Rejecting Extreme Augmentations

Masih Aminbeidokhti, Fidel A. Guerrero Peña, Heitor Rapela Medeiros, Thomas Dubail, Eric Granger, Marco Pedersoli

Main category: cs.LG

TL;DR: This paper proposes a simple yet effective data augmentation training procedure for out-of-domain and domain generalization settings, achieving state-of-the-art performance on benchmark datasets.

Details

Motivation: Data augmentation is effective for in-domain settings but unclear for out-of-domain cases where test data follows different distributions. The paper aims to show that data augmentation can significantly improve performance in domain generalization scenarios.

Method: Proposes a three-step training procedure: (1) uniform sampling on standard data augmentation transformations, (2) increasing transformation strength to handle higher data variance in out-of-domain settings, and (3) using a new reward function to reject extreme transformations that could harm training.

Result: The proposed data augmentation scheme achieves comparable or better accuracy than state-of-the-art methods on benchmark domain generalization datasets.

Conclusion: Data augmentation can provide conspicuous and robust performance improvements in out-of-domain and domain generalization settings when using the proposed training procedure with stronger transformations and rejection of harmful extremes.

Abstract: Data augmentation is one of the most effective techniques for regularizing deep learning models and improving their recognition performance in a variety of tasks and domains. However, this holds for standard in-domain settings, in which the training and test data follow the same distribution. For the out-of-domain case, where the test data follow a different and unknown distribution, the best recipe for data augmentation is unclear. In this paper, we show that for out-of-domain and domain generalization settings, data augmentation can provide a conspicuous and robust improvement in performance. To do that, we propose a simple training procedure: (i) use uniform sampling on standard data augmentation transformations; (ii) increase the strength transformations to account for the higher data variance expected when working out-of-domain, and (iii) devise a new reward function to reject extreme transformations that can harm the training. With this procedure, our data augmentation scheme achieves a level of accuracy that is comparable to or better than state-of-the-art methods on benchmark domain generalization datasets. Code: https://github.com/Masseeh/DCAug

[482] FFT-based Dynamic Subspace Selection for Low-Rank Adaptive Optimization of Large Language Models

Ionut-Vlad Modoranu, Mher Safaryan, Erik Schultheis, Max Ryabinin, Artem Chumachenko, Dan Alistarh

Main category: cs.LG

TL;DR: Proposes a computationally efficient low-rank optimization method for LLMs using Discrete Cosine Transform (DCT) instead of expensive SVD/QR decompositions, achieving similar performance with faster runtime and 25% memory reduction.

Details

Motivation: Existing low-rank optimization methods using SVD/QR decompositions are computationally expensive and memory-intensive when applied individually to each layer in large language models.

Method: Two-step procedure using predefined DCT orthogonal matrix: 1) matmul with DCT matrix in O(n³) time, 2) lightweight sorting to select most relevant basis vectors. For large layers, uses FFT-based DCT computation in O(n² log(n)) time.

Result: Matches performance of costly SVD/QR methods while achieving faster runtime and up to 25% memory reduction across different model sizes in pre-training and fine-tuning tasks.

Conclusion: DCT-based low-rank optimization provides an efficient alternative to SVD/QR methods with rank-independent running time, maintaining performance while reducing computational and memory costs.

Abstract: Low-rank optimization has emerged as a promising direction in training large language models (LLMs) to improve running time and reduce the memory usage of adaptive optimizers by constraining learning to a lower-dimensional space. Prior work typically projects gradients of linear layers using approaches based on Singular Value Decomposition (SVD) or QR-decomposition. Applying these techniques individually to each layer in large models is computationally expensive and incurs additional memory costs due to storing the projection matrices. In this work, we propose a computationally efficient and conceptually simple, two-step procedure to approximate SVD/QR-based gradient projections into lower-dimensional spaces by using a predefined orthogonal matrix of the Discrete Cosine Transform (DCT). We dynamically select columns from the DCT matrix based on their alignment with the gradient of each layer. The effective projection matrices are obtained via a simple matmul with the DCT matrix in $O(n^3)$ time, followed by a lightweight sorting step to identify the most relevant basis vectors. For large layers, DCT can be computed via Makhoul’s $N$-point algorithm based on Fast Fourier Transform (FFT) in $O(n^2 \log(n))$ time. Due to the predefined nature of the orthogonal bases, they are computed once at the start of training. Our numerical experiments on both pre-training and fine-tuning tasks demonstrate the effectiveness of our dual strategy in approximating optimal low-rank projections, obtaining an approach with rank-independent running time that matches the performance of costly SVD/QR-based methods while achieving faster runtime and reduced memory usage by up to $25%$ across different model sizes. Our code is available at \href{https://github.com/IST-DASLab/ISTA-DASLab-Optimizers}{\texttt{https://github.com/IST-DASLab/ISTA-DASLab-Optimizers}}.

[483] Generalizable Physics-Informed Learning for Stochastic Safety-Critical Systems

Zhuoyuan Wang, Albert Chern, Yorie Nakahira

Main category: cs.LG

TL;DR: Proposes a physics-informed learning framework to efficiently estimate long-term risk probabilities using short-term samples with limited risk events, by leveraging PDE characterizations of risk probabilities.

Details

Motivation: Existing risk quantification methods require extensive datasets with risk events over long time horizons, which are expensive to acquire. There's a need for efficient methods using limited short-term data.

Method: Establishes that four classes of long-term risk probabilities are characterized by specific PDEs. Introduces a physics-informed learning framework combining empirical data with PDE constraints to infer risk probabilities.

Result: The framework generalizes effectively beyond sampled states and time horizons, offers improved sample efficiency, rapid online inference under changing dynamics, and stable computation of probability gradients.

Conclusion: Embedding PDE constraints with explicit gradient terms improves interpolation and generalization between/beyond available data, enabling efficient long-term risk estimation from limited short-term samples.

Abstract: Accurate estimation of long-term risk is essential for the design and analysis of stochastic dynamical systems. Existing risk quantification methods typically rely on extensive datasets involving risk events observed over extended time horizons, which can be prohibitively expensive to acquire. Motivated by this gap, we propose an efficient method for learning long-term risk probabilities using short-term samples with limited occurrence of risk events. Specifically, we establish that four distinct classes of long-term risk probabilities are characterized by specific partial differential equations (PDEs). Using this characterization, we introduce a physics-informed learning framework that combines empirical data with physics information to infer risk probabilities. We then analyze the theoretical properties of this framework in terms of generalization and convergence. Through numerical experiments, we demonstrate that our framework not only generalizes effectively beyond the sampled states and time horizons but also offers additional benefits such as improved sample efficiency, rapid online inference capabilities under changing system dynamics, and stable computation of probability gradients. These results highlight how embedding PDE constraints, which contain explicit gradient terms and inform how risk probabilities depend on state, time horizon, and system parameters, improves interpolation and generalization between/beyond the available data.

[484] Want to train KANS at scale? Now UKAN!

Alireza Moradzadeh, Srimukh Prasad Veccham, Lukasz Wawrzyniak, Miles Macklin, Saee G. Paliwal

Main category: cs.LG

TL;DR: UKANs extend KANs to handle unbounded domains by using a coefficient-generator model and GPU acceleration, achieving significant speed and memory improvements.

Details

Motivation: Traditional KANs are limited to bounded domains due to predefined grids, restricting their applicability to real-world problems with unbounded inputs.

Method: Introduce UKANs with a coefficient-generator model for B-spline coefficients on unbounded grids, coupled with MLPs and KANs via positional encoding, plus GPU-accelerated library for efficiency.

Result: 3-30x speed-up and up to 1000x memory reduction compared to vanilla KANs, matching or surpassing KAN accuracy on regression, classification, and generative tasks.

Conclusion: UKANs enable effective function approximation on unbounded domains and large-scale training, demonstrated through molecular property prediction.

Abstract: Kolmogorov-Arnold Networks (KANs) have recently emerged as a powerful alternative to traditional multilayer perceptrons. However, their reliance on predefined, bounded grids restricts their ability to approximate functions on unbounded domains. To address this, we present Unbounded Kolmogorov-Arnold Networks (UKANs), a method that removes the need for bounded grids in traditional Kolmogorov-Arnold Networks (KANs). The key innovation of this method is a coefficient-generator (CG) model that produces, on the fly, only the B-spline coefficients required locally on an unbounded symmetric grid. UKANs couple multilayer perceptrons with KANs by feeding the positional encoding of grid groups into the CG model, enabling function approximation on unbounded domains without requiring data normalization. To reduce the computational cost of both UKANs and KANs, we introduce a GPU-accelerated library that lowers B-spline evaluation complexity by a factor proportional to the grid size, enabling large-scale learning by leveraging efficient memory management, in line with recent software advances such as FlashAttention and FlashFFTConv. Performance benchmarking confirms the superior memory and computational efficiency of our accelerated KAN (warpKAN), and UKANs, showing a 3-30x speed-up and up to 1000x memory reduction compared to vanilla KANs. Experiments on regression, classification, and generative tasks demonstrate the effectiveness of UKANs to match or surpass KAN accuracy. Finally, we use both accelerated KAN and UKAN in a molecular property prediction task, establishing the feasibility of large-scale end-to-end training with our optimized implementation.

[485] Exchangeability in Neural Network and its Application to Dynamic Pruning

Pu, Yi, Tianlang Chen, Yifan Yang, Sara Achour

Main category: cs.LG

TL;DR: ExPrune is a dynamic pruning method that enables multi-granularity partial computation on a per-input basis without changing model architecture or training, achieving 10.98-27.16% FLOPs reduction with minimal accuracy loss.

Details

Motivation: Modern neural networks have growing parameter counts, increasing memory and computational costs for inference. Existing approaches reduce model size before deployment or prune at runtime, but ExPrune aims to provide a more general, theory-grounded dynamic pruning solution.

Method: ExPrune uses statistical exchangeability theory to identify relationships between model parameters and intermediate values. It performs partial network evaluation, analyzes partial results statistics, and makes dynamic pruning decisions on-the-fly without architectural changes.

Result: ExPrune provides 10.98-17.33% FLOPs reduction with negligible accuracy drop and 21.61-27.16% FLOPs reduction with ≤1% accuracy drop across computer vision, graph, and language models. It also composes with static pruning, providing additional 10.24-14.39% FLOPs reduction on pre-pruned models.

Conclusion: ExPrune is an effective, theory-grounded dynamic pruning method that generalizes across architectures and domains, offering significant computational savings with minimal accuracy impact, and can be combined with existing static pruning techniques.

Abstract: Modern neural networks (NN) contain an ever-growing number of parameters, substantially increasing the memory and computational cost of inference. Researchers have explored various ways to reduce the inference cost of NNs by reducing the model size before deployment and dynamically pruning the inference computation at runtime. In this work, we present ExPrune, a general, dynamic pruning optimization that enables multi-granularity partial computation on a per-input basis. ExPrune requires no change to the model architecture or the training algorithm. ExPrune is based on our theoretical results that the relationship between certain model parameters and intermediate values can be described by a statistical property called exchangeability. By identifying exchangeable parameters and values in the model, we are able to first partially evaluate the network, analyze the statistics of the partial results, and make pruning decisions on the fly. Because ExPrune is theory grounded, it generalizes across model architectures in different problem domains. We evaluate ExPrune on one computer vision models, one graph model and one language model. ExPrune provides 10.98–17.33% reduction in FLOPs with negligible accuracy drop and 21.61–27.16% reduction in FLOPs with at most 1% accuracy drop. We also demonstrate that ExPrune composes with static magnitude pruning. On models that have been aggressively statically pruned, ExPrune still provides additional 10.24–11.11% reduction in FLOPs with negligible accuracy drop and 13.91–14.39% reduction in FLOPs with at most 1% accuracy drop.

[486] Dynamic Learning Rate for Deep Reinforcement Learning: A Bandit Approach

Henrique Donâncio, Antoine Barrier, Leah F. South, Florence Forbes

Main category: cs.LG

TL;DR: LRRL is a meta-learning approach that dynamically selects learning rates based on policy performance rather than training steps, achieving competitive or superior performance to tuned baselines and standard schedulers in deep RL.

Details

Motivation: Standard learning rate decay schedulers assume monotonic convergence and often misalign with the evolving dynamics of environments and policies during training, leading to premature or delayed adjustments.

Method: LRRL uses meta-learning to dynamically select learning rates based on policy performance rather than training steps, adaptively favoring rates that improve returns and remaining robust even with candidate values that individually cause divergence.

Result: Across Atari and MuJoCo benchmarks, LRRL achieves performance competitive with or superior to tuned baselines and standard schedulers.

Conclusion: LRRL positions as a practical solution for adapting to non-stationary objectives in deep RL by dynamically adjusting learning rates based on performance feedback.

Abstract: In deep Reinforcement Learning (RL), the learning rate critically influences both stability and performance, yet its optimal value shifts during training as the environment and policy evolve. Standard decay schedulers assume monotonic convergence and often misalign with these dynamics, leading to premature or delayed adjustments. We introduce LRRL, a meta-learning approach that dynamically selects the learning rate based on policy performance rather than training steps. LRRL adaptively favors rates that improve returns, remaining robust even when the candidate set includes values that individually cause divergence. Across Atari and MuJoCo benchmarks, LRRL achieves performance competitive with or superior to tuned baselines and standard schedulers. Our findings position LRRL as a practical solution for adapting to non-stationary objectives in deep RL.

[487] Reinforcement Learning for Dynamic Memory Allocation

Arisrei Lim, Abhiram Maddukuri

Main category: cs.LG

TL;DR: RL framework for dynamic memory allocation management that outperforms traditional algorithms like first-fit, best-fit, and worst-fit, especially in adversarial environments.

Details

Motivation: Traditional memory allocation algorithms fail to adapt to changing conditions, leading to fragmentation and suboptimal efficiency. RL offers potential for more adaptive strategies.

Method: RL agent continuously learns from system interactions using high-level and low-level action spaces, with history-aware policies leveraging previous allocation requests.

Result: RL-trained agents match and surpass traditional allocation strategies, particularly effective in environments with adversarial request patterns.

Conclusion: RL provides a promising approach for developing more adaptive and efficient memory allocation strategies that overcome limitations of hardcoded algorithms.

Abstract: In recent years, reinforcement learning (RL) has gained popularity and has been applied to a wide range of tasks. One such popular domain where RL has been effective is resource management problems in systems. We look to extend work on RL for resource management problems by considering the novel domain of dynamic memory allocation management. We consider dynamic memory allocation to be a suitable domain for RL since current algorithms like first-fit, best-fit, and worst-fit can fail to adapt to changing conditions and can lead to fragmentation and suboptimal efficiency. In this paper, we present a framework in which an RL agent continuously learns from interactions with the system to improve memory management tactics. We evaluate our approach through various experiments using high-level and low-level action spaces and examine different memory allocation patterns. Our results show that RL can successfully train agents that can match and surpass traditional allocation strategies, particularly in environments characterized by adversarial request patterns. We also explore the potential of history-aware policies that leverage previous allocation requests to enhance the allocator’s ability to handle complex request patterns. Overall, we find that RL offers a promising avenue for developing more adaptive and efficient memory allocation strategies, potentially overcoming limitations of hardcoded allocation algorithms.

[488] Quantum Rationale-Aware Graph Contrastive Learning for Jet Discrimination

Md Abrar Jahin, Md. Akmol Masud, M. F. Mridha, Nilanjan Dey, Zeyar Aung

Main category: cs.LG

TL;DR: Quantum Rationale-aware Graph Contrastive Learning (QRGCL) framework improves quark-gluon jet tagging with quantum rationale generator, achieving 77.53% AUC with only 45 parameters.

Details

Motivation: Existing contrastive learning frameworks struggle with rationale-aware augmentations, lack supervision for salient feature extraction, and face computational efficiency issues in particle jet tagging.

Method: Proposed QRGCL framework integrates quantum rationale generator (QRG) for graph contrastive learning, enabling effective rationale-aware augmentations and discriminative feature capture.

Result: QRGCL achieves 77.53% AUC on quark-gluon jet dataset with compact 45-parameter architecture, outperforming classical, quantum, and hybrid GCL/GNN benchmarks.

Conclusion: QRGCL shows potential to advance jet tagging and complex classification tasks in high-energy physics by addressing computational efficiency and feature extraction limitations.

Abstract: In high-energy physics, particle jet tagging plays a pivotal role in distinguishing quark from gluon jets using data from collider experiments. While graph-based deep learning methods have advanced this task beyond traditional feature-engineered approaches, the complex data structure and limited labeled samples present ongoing challenges. However, existing contrastive learning (CL) frameworks struggle to leverage rationale-aware augmentations effectively, often lacking supervision signals that guide the extraction of salient features and facing computational efficiency issues such as high parameter counts. In this study, we demonstrate that integrating a quantum rationale generator (QRG) within our proposed Quantum Rationale-aware Graph Contrastive Learning (QRGCL) framework significantly enhances jet discrimination performance, reducing reliance on labeled data and capturing discriminative features. Evaluated on the quark-gluon jet dataset, QRGCL achieves an AUC score of $77.53%$ while maintaining a compact architecture of only 45 QRG parameters, outperforming classical, quantum, and hybrid GCL and GNN benchmarks. These results highlight QRGCL’s potential to advance jet tagging and other complex classification tasks in high-energy physics, where computational efficiency and feature extraction limitations persist.

[489] Prefilled responses enhance zero-shot detection of AI-generated images

Zoher Kachwala, Danishjeet Singh, Danielle Yang, Filippo Menczer

Main category: cs.LG

TL;DR: The paper explores using Vision-Language Models (VLMs) for zero-shot detection of AI-generated images and introduces Prefill-Guided Thinking (PGT), a method that improves VLM performance by prefilling responses with task-aligned phrases.

Details

Motivation: Growing concerns over misuse of realistic AI-generated images highlight the need for reliable detection methods. Traditional supervised approaches fail to generalize to novel image generators and require large curated datasets.

Method: The authors evaluate pre-trained VLMs for zero-shot detection on three diverse benchmarks with images from 16 different generators. They introduce Prefill-Guided Thinking (PGT), which guides VLM reasoning by prefilling responses with task-aligned phrases like “Let’s examine the style and the synthesis artifacts”.

Result: Off-the-shelf VLMs perform poorly on AI-generated image detection, but PGT improves Macro F1 scores of three widely used open-source VLMs by up to 24% across various benchmarks.

Conclusion: Prefill-Guided Thinking effectively enhances VLM performance for zero-shot AI-generated image detection, providing a promising alternative to traditional supervised methods that struggle with generalization.

Abstract: As AI models generate increasingly realistic images, growing concerns over potential misuse underscore the need for reliable detection. Traditional supervised detection methods depend on large, curated datasets for training and often fail to generalize to novel, out-of-domain image generators. As an alternative, we explore pre-trained Vision-Language Models (VLMs) for zero-shot detection of AI-generated images. We evaluate VLM performance on three diverse benchmarks encompassing synthetic images of human faces, objects, and animals produced by 16 different state-of-the-art image generators. While off-the-shelf VLMs perform poorly on these datasets, we find that their reasoning can be guided effectively through simple response prefilling – a method we call Prefill-Guided Thinking (PGT). In particular, prefilling a VLM response with the task-aligned phrase “Let’s examine the style and the synthesis artifacts” improves the Macro F1 scores of three widely used open-source VLMs by up to 24%.

[490] VICON: Vision In-Context Operator Networks for Multi-Physics Fluid Dynamics Prediction

Yadi Cao, Yuxuan Liu, Liu Yang, Rose Yu, Hayden Schaeffer, Stanley Osher

Main category: cs.LG

TL;DR: VICON improves computational efficiency of In-Context Operator Networks by using vision transformers for patch-wise processing of 2D data, achieving better accuracy and faster inference while supporting flexible rollout strategies.

Details

Motivation: Existing ICONs process spatial points as individual tokens, which limits computational efficiency for dense data in higher dimensions. Vision transformers can enable more efficient patch-wise processing.

Method: VICON integrates vision transformer architectures to process 2D data through patch-wise operations while preserving ICON’s adaptability to multiphysics systems and varying timesteps.

Result: VICON reduces averaged last-step rollout error by 37.9% vs DPOT and 44.7% vs MPP, while requiring only 72.5% and 34.8% of their inference times respectively. It shows only 24.41% performance degradation in realistic scenarios vs 71.37%-74.49% for baselines.

Conclusion: VICON demonstrates superior computational efficiency and robustness for real-world deployment, supporting flexible rollout strategies without retraining and handling imperfect measurement systems effectively.

Abstract: In-Context Operator Networks (ICONs) have demonstrated the ability to learn operators across diverse partial differential equations using few-shot, in-context learning. However, existing ICONs process each spatial point as an individual token, severely limiting computational efficiency when handling dense data in higher spatial dimensions. We propose Vision In-Context Operator Networks (VICON), which integrates vision transformer architectures to efficiently process 2D data through patch-wise operations while preserving ICON’s adaptability to multiphysics systems and varying timesteps. Evaluated across three fluid dynamics benchmarks, VICON significantly outperforms state-of-the-art baselines: DPOT and MPP, reducing the averaged last-step rollout error by 37.9% compared to DPOT and 44.7% compared to MPP, while requiring only 72.5% and 34.8% of their respective inference times. VICON naturally supports flexible rollout strategies with varying timestep strides, enabling immediate deployment in imperfect measurement systems where sampling frequencies may differ or frames might be dropped - common challenges in real-world settings - without requiring retraining or interpolation. In these realistic scenarios, VICON exhibits remarkable robustness, experiencing only 24.41% relative performance degradation compared to 71.37%-74.49% degradation in baseline methods, demonstrating its versatility for deploying in realistic applications. Our scripts for processing datasets and code are publicly available at https://github.com/Eydcao/VICON.

[491] Contrastive Graph Condensation: Advancing Data Versatility through Self-Supervised Learning

Xinyi Gao, Yayong Li, Tong Chen, Guanhua Ye, Wentao Zhang, Hongzhi Yin

Main category: cs.LG

TL;DR: CTGC introduces a self-supervised contrastive learning approach for graph condensation that disentangles node attribute and structural generation, overcoming limitations of label-dependent methods and improving cross-task generalization.

Details

Motivation: Existing graph condensation methods rely heavily on classification tasks and node labels, making them ineffective in label-sparse scenarios and limiting their generalization to other downstream tasks due to overfitting on class-specific information.

Method: CTGC uses a dual-branch framework with contrastive learning: one branch for node attribute generation and another for structural generation using positional embeddings. It employs alternating optimization with contrastive loss and model inversion for high-quality graph synthesis.

Result: Extensive experiments show CTGC outperforms state-of-the-art graph condensation methods across various downstream tasks, particularly effective in label-limited scenarios.

Conclusion: CTGC successfully addresses label dependency and generalization limitations in graph condensation through self-supervised contrastive learning, enabling effective graph condensation for multiple downstream tasks with minimal labels.

Abstract: With the increasing computation of training graph neural networks (GNNs) on large-scale graphs, graph condensation (GC) has emerged as a promising solution to synthesize a compact, substitute graph of the large-scale original graph for efficient GNN training. However, existing GC methods predominantly employ classification as the surrogate task for optimization, thus excessively relying on node labels and constraining their utility in label-sparsity scenarios. More critically, this surrogate task tends to overfit class-specific information within the condensed graph, consequently restricting the generalization capabilities of GC for other downstream tasks. To address these challenges, we introduce Contrastive Graph Condensation (CTGC), which adopts a self-supervised surrogate task to extract critical, causal information from the original graph and enhance the cross-task generalizability of the condensed graph. Specifically, CTGC employs a dual-branch framework to disentangle the generation of the node attributes and graph structures, where a dedicated structural branch is designed to explicitly encode geometric information through nodes’ positional embeddings. By implementing an alternating optimization scheme with contrastive loss terms, CTGC promotes the mutual enhancement of both branches and facilitates high-quality graph generation through the model inversion technique. Extensive experiments demonstrate that CTGC excels in handling various downstream tasks with a limited number of labels, consistently outperforming state-of-the-art GC methods.

[492] DPGIIL: Dirichlet Process-Deep Generative Model-Integrated Incremental Learning for Clustering in Transmissibility-based Online Structural Anomaly Detection

Lin-Feng Mei, Wang-Ji Yan

Main category: cs.LG

TL;DR: Proposes DPGIIL, a novel clustering framework combining deep generative models and Dirichlet process mixture models for online structural anomaly detection, addressing limitations in cluster number determination, high-dimensional streaming data handling, and manual feature engineering.

Details

Motivation: Existing vibration-based clustering methods struggle with determining optimal cluster numbers, handling high-dimensional streaming data, and rely heavily on manually engineered features due to shallow structures.

Method: Combines deep generative models for representation learning with Dirichlet process mixture models for pattern identification. Uses variational Bayesian inference with a tighter lower bound for joint optimization, greedy split-merge scheme for accelerated inference, and incremental learning using DPMM statistics.

Result: DPGIIL demonstrates dynamic adaptability in three case studies, outperforming state-of-the-art approaches in both structural anomaly detection and clustering.

Conclusion: The proposed framework effectively detects anomalies by dynamically assigning data to new clusters while indicating different structural states, providing more operational information than traditional anomaly detectors.

Abstract: Clustering based on vibration responses, such as transmissibility functions (TFs), is promising in structural anomaly detection. However, most existing methods struggle to determine the optimal cluster number, handle high-dimensional streaming data, and rely heavily on manually engineered features due to their shallow structures. To address these issues, this work proposes a novel clustering framework, referred to as Dirichlet process-deep generative model-integrated incremental learning (DPGIIL), for online structural anomaly detection, which combines the advantages of deep generative models (DGMs) in representation learning and the Dirichlet process mixture model (DPMM) in identifying distinct patterns in observed data. Within the context of variational Bayesian inference, a lower bound on the log marginal likelihood of DPGIIL, tighter than the evidence lower bound, is derived analytically, which enables the joint optimization of DGM and DPMM parameters, thereby allowing the DPMM to regularize the DGM’s feature extraction process. Additionally, a greedy split-merge scheme-based coordinate ascent variational inference method is devised to accelerate the optimization. The summary statistics of the DPMM, along with the network parameters, are used to retain information about previous data for incremental learning. For online structural anomaly detection, DPGIIL can not only detect anomalies by dynamically assigning incoming data to new clusters but also indicate different structural states using distinct clusters, thereby providing additional information about the operating conditions of the monitored structure compared to traditional anomaly detectors. Three case studies demonstrate the dynamic adaptability of the proposed method and show that it outperforms some state-of-the-art approaches in both structural anomaly detection and clustering.

[493] Real-Time Progress Prediction in Reasoning Language Models

Hans Peter Lynsgøe Raaschou-jensen, Constanza Fierro, Anders Søgaard

Main category: cs.LG

TL;DR: The paper proposes a method for real-time progress prediction in reasoning language models to address the opacity of long reasoning chains.

Details

Motivation: As reasoning models operate over extended time horizons, their internal progress becomes opaque to users, complicating expectation management and real-time oversight.

Method: Discretize progress and train a linear probe to classify reasoning states, then introduce a two-stage fine-tuning approach that enables reasoning models to generate progress estimates (0→100%) during inference.

Result: The best fine-tuned model achieves an average error of 10% for sequences less than 16,000 tokens.

Conclusion: This provides a practical mechanism for monitoring and interpreting model reasoning in real time.

Abstract: Recent advances in reasoning language models – particularly those that use long, latent chains of thought – have demonstrated remarkable capabilities in complex, agentic tasks. However, as these models operate over increasingly extended time horizons, their internal progress becomes opaque to users, complicating expectation management and real-time oversight. In this work, we investigate whether real-time progress prediction is feasible. We discretize progress and train a linear probe to classify reasoning states. We then introduce a two-stage fine-tuning approach that enables reasoning models to generate progress estimates (0$\rightarrow$100%) during inference. Our best fine-tuned model achieves an average error of 10% for sequences less than 16,000 tokens, offering a practical mechanism for monitoring and interpreting model reasoning in real time.

[494] Towards the Worst-case Robustness of Large Language Models

Huanran Chen, Yinpeng Dong, Zeming Wei, Hang Su, Jun Zhu

Main category: cs.LG

TL;DR: This paper analyzes the worst-case robustness of large language models against adversarial attacks, showing current deterministic defenses have near-zero robustness, and proposes theoretical lower bounds for stochastic defenses using knapsack solvers.

Details

Motivation: Large language models are vulnerable to adversarial attacks that can induce harmful outputs, and there's a need to understand and improve their worst-case robustness against such attacks.

Method: The authors use white-box attacks to upper bound worst-case robustness and propose a general tight lower bound for randomized smoothing using fractional and 0-1 knapsack solvers to bound the robustness of stochastic defenses.

Result: Most current deterministic defenses achieve nearly 0% worst-case robustness. The method certifies robustness for specific cases, such as uniform kernel smoothing, against any possible attack with average ℓ₀ perturbation of 2.02 or average suffix length of 6.41.

Conclusion: The paper provides theoretical foundations for certifying the robustness of stochastic defenses against adversarial attacks, demonstrating that current deterministic approaches are insufficient while offering provable guarantees for randomized methods.

Abstract: Recent studies have revealed the vulnerability of large language models to adversarial attacks, where adversaries craft specific input sequences to induce harmful, violent, private, or incorrect outputs. In this work, we study their worst-case robustness, i.e., whether an adversarial example exists that leads to such undesirable outputs. We upper bound the worst-case robustness using stronger white-box attacks, indicating that most current deterministic defenses achieve nearly 0% worst-case robustness. We propose a general tight lower bound for randomized smoothing using fractional knapsack solvers or 0-1 knapsack solvers, and using them to bound the worst-case robustness of all stochastic defenses. Based on these solvers, we provide theoretical lower bounds for several previous empirical defenses. For example, we certify the robustness of a specific case, smoothing using a uniform kernel, against \textit{any possible attack} with an average $\ell_0$ perturbation of 2.02 or an average suffix length of 6.41.

[495] Structure-Aware Compound-Protein Affinity Prediction via Graph Neural Network with Group Lasso Regularization

Zanyu Shi, Yang Wang, Pathum Weerawarna, Jie Zhang, Timothy Richardson, Yijie Wang, Kun Huang

Main category: cs.LG

TL;DR: The paper proposes an explainable AI framework using graph neural networks with structure-aware loss functions to predict compound-protein affinity from activity cliff pairs, enhancing both prediction accuracy and interpretability for drug discovery.

Details

Motivation: Address challenges in structure-activity relationship modeling including limited compound-protein interaction data and subtle molecular configuration changes that significantly affect properties, by leveraging activity cliff pairs that share scaffolds but have large potency differences.

Method: Implement graph neural networks with structure-aware loss functions using group lasso and sparse group lasso regularizations to prune and highlight molecular subgraphs relevant to activity differences, applied to activity cliff data targeting three Src proteins.

Result: Improved property prediction with reduced RMSE and improved Pearson’s correlation coefficient by integrating common and uncommon node information with sparse group lasso. Enhanced feature attribution with boosted graph-level global direction scores and improved atom-level coloring accuracy.

Conclusion: The framework strengthens model interpretability in drug discovery pipelines, particularly for identifying critical molecular substructures in lead optimization through enhanced explainability and prediction performance.

Abstract: Explainable artificial intelligence (XAI) approaches have been increasingly applied in drug discovery to learn molecular representations and identify substructures driving property predictions. However, building end-to-end explainable models for structure-activity relationship (SAR) modeling for compound property prediction faces many challenges, such as the limited number of compound-protein interaction activity data for specific protein targets, and plenty of subtle changes in molecular configuration sites significantly affecting molecular properties. We exploit pairs of molecules with activity cliffs that share scaffolds but differ at substituent sites, characterized by large potency differences for specific protein targets. We propose a framework by implementing graph neural networks (GNNs) to leverage property and structure information from activity cliff pairs to predict compound-protein affinity (i.e., half maximal inhibitory concentration, IC50). To enhance model performance and explainability, we train GNNs with structure-aware loss functions using group lasso and sparse group lasso regularizations, which prune and highlight molecular subgraphs relevant to activity differences. We applied this framework to activity cliff data of molecules targeting three proto-oncogene tyrosine-protein kinase Src proteins (PDB IDs: 1O42, 2H8H, 4MXO). Our approach improved property prediction by integrating common and uncommon node information with sparse group lasso, as reflected in reduced root mean squared error (RMSE) and improved Pearson’s correlation coefficient (PCC). Applying regularizations also enhances feature attribution for GNN by boosting graph-level global direction scores and improving atom-level coloring accuracy. These advances strengthen model interpretability in drug discovery pipelines, particularly for identifying critical molecular substructures in lead optimization.

[496] Denoising Score Matching with Random Features: Insights on Diffusion Models from Precise Learning Curves

Anand Jerry George, Rodrigo Veiga, Nicolas Macris

Main category: cs.LG

TL;DR: The paper analyzes generalization and memorization in diffusion models, showing they depend on model complexity, dataset size, and noise samples per data point during training.

Details

Motivation: To understand the mechanisms behind generalization and memorization phenomena in diffusion models, which are influenced by model complexity, training dataset size, and the number of noise samples used during Denoising Score Matching.

Method: Theoretical analysis using random features neural networks to parameterize score functions, with target distribution as d-dimensional Gaussian. Analysis conducted in asymptotic regime where dimension d, data samples n, and features p go to infinity while keeping ratios ψ_n=n/d and ψ_p=p/d fixed.

Result: Derived asymptotically precise expressions for test and train errors of Denoising Score Matching, identifying regimes of generalization and memorization as functions of ψ_n, ψ_p, and m (noise samples per data point).

Conclusion: Theoretical findings are consistent with empirical observations, providing insights into the mechanisms governing generalization and memorization in diffusion models.

Abstract: We theoretically investigate the phenomena of generalization and memorization in diffusion models. Empirical studies suggest that these phenomena are influenced by model complexity and the size of the training dataset. In our experiments, we further observe that the number of noise samples per data sample ($m$) used during Denoising Score Matching (DSM) plays a significant and non-trivial role. We capture these behaviors and shed insights into their mechanisms by deriving asymptotically precise expressions for test and train errors of DSM under a simple theoretical setting. The score function is parameterized by random features neural networks, with the target distribution being $d$-dimensional Gaussian. We operate in a regime where the dimension $d$, number of data samples $n$, and number of features $p$ tend to infinity while keeping the ratios $\psi_n=\frac{n}{d}$ and $\psi_p=\frac{p}{d}$ fixed. By characterizing the test and train errors, we identify regimes of generalization and memorization as a function of $\psi_n,\psi_p$, and $m$. Our theoretical findings are consistent with the empirical observations.

[497] ExLLM: Experience-Enhanced LLM Optimization for Molecular Design and Beyond

Nian Ran, Yue Wang, Xiaoyuan Zhang, Zhongzheng Li, Qingsong Ran, Wenhao Li, Richard Allmendinger

Main category: cs.LG

TL;DR: ExLLM is an LLM-as-optimizer framework that introduces experience snippets, k-offspring exploration, and feedback adaptation to overcome limitations of traditional optimization methods in molecular design and other complex search spaces.

Details

Motivation: Traditional optimizers struggle with large irregular search spaces in molecular design, and existing LLM approaches lack mechanisms for handling complex feedback and maintaining scalable memory, leading to redundancy and poor outcomes in large-scale iterative search.

Method: Three key components: (1) compact evolving experience snippets that distill non-redundant cues, (2) k-offspring scheme for wider exploration per call, and (3) lightweight feedback adapter that normalizes objectives and formats constraints/expert hints.

Result: Sets new state-of-the-art results on PMO benchmark, achieves record performance on circle packing and stellarator design, and shows consistent gains across additional domains with minimal adaptation requirements.

Conclusion: ExLLM provides an effective LLM-based optimization framework that handles complex feedback, maintains scalable memory, and generalizes well across diverse domains using only task-description templates and evaluation functions.

Abstract: Molecular design involves an enormous and irregular search space, where traditional optimizers such as Bayesian optimization, genetic algorithms, and generative models struggle to leverage expert knowledge or handle complex feedback. Recently, LLMs have been used as optimizers, achieving promising results on benchmarks such as PMO. However, existing approaches rely only on prompting or extra training, without mechanisms to handle complex feedback or maintain scalable memory. In particular, the common practice of appending or summarizing experiences at every query leads to redundancy, degraded exploration, and ultimately poor final outcomes under large-scale iterative search. We introduce ExLLM (Experience-Enhanced LLM optimization), an LLM-as-optimizer framework with three components: (1) a compact, evolving experience snippet tailored to large discrete spaces that distills non-redundant cues and improves convergence at low cost; (2) a simple yet effective k-offspring scheme that widens exploration per call and reduces orchestration cost; and (3) a lightweight feedback adapter that normalizes objectives for selection while formatting constraints and expert hints for iteration. ExLLM sets new state-of-the-art results on PMO and generalizes strongly in our setup, it sets records on circle packing and stellarator design, and yields consistent gains across additional domains requiring only a task-description template and evaluation functions to transfer.

[498] Taxonomy, Opportunities, and Challenges of Representation Engineering for Large Language Models

Jan Wehner, Sahar Abdelnabi, Daniel Tan, David Krueger, Mario Fritz

Main category: cs.LG

TL;DR: RepE is a new approach for controlling LLMs by directly manipulating internal representations, offering advantages over traditional methods. This survey provides a unified framework and analysis of existing RepE methods.

Details

Motivation: To provide a comprehensive overview of Representation Engineering (RepE) for LLMs, addressing key questions about methods, applications, and comparisons with other approaches.

Method: Proposed a unified framework describing RepE as a pipeline with three stages: representation identification, operationalization, and control. Conducted systematic review of existing literature.

Result: Identified that RepE methods offer significant potential but face challenges including managing multiple concepts, ensuring reliability, and preserving model performance.

Conclusion: While RepE shows promise for effective, interpretable, and flexible LLM control, further improvements are needed. The paper provides guidance for best practices and identifies opportunities for methodological advancement.

Abstract: Representation Engineering (RepE) is a novel paradigm for controlling the behavior of LLMs. Unlike traditional approaches that modify inputs or fine-tune the model, RepE directly manipulates the model’s internal representations. As a result, it may offer more effective, interpretable, data-efficient, and flexible control over models’ behavior. We present the first comprehensive survey of RepE for LLMs, reviewing the rapidly growing literature to address key questions: What RepE methods exist and how do they differ? For what concepts and problems has RepE been applied? What are the strengths and weaknesses of RepE compared to other methods? To answer these, we propose a unified framework describing RepE as a pipeline comprising representation identification, operationalization, and control. We posit that while RepE methods offer significant potential, challenges remain, including managing multiple concepts, ensuring reliability, and preserving models’ performance. Towards improving RepE, we identify opportunities for experimental and methodological improvements and construct a guide for best practices.

[499] A Novel Collaborative Framework for Efficient Synchronization in Split Federated Learning over Wireless Networks

Haoran Gao, Samuel D. Okegbile, Jun Cai

Main category: cs.LG

TL;DR: CSFL introduces device-to-device collaboration in split federated learning to overcome synchronization bottlenecks in heterogeneous wireless networks, enabling efficient devices to help bottleneck devices complete their computations.

Details

Motivation: Traditional SFL suffers from straggler problems in heterogeneous wireless environments due to disparities in device capabilities and channel conditions, limiting efficiency and scalability.

Method: CSFL enables efficient devices to take over unfinished layers of bottleneck devices after completing their own forward propagation, supported by D2D communications for workload redistribution.

Result: CSFL significantly reduces training latency without compromising convergence speed or accuracy, demonstrating collaboration as key for synchronization-efficient learning.

Conclusion: Device-to-device collaboration through CSFL framework enables synchronization-efficient learning in next-generation wireless networks, overcoming limitations of traditional SFL approaches.

Abstract: Split Federated Learning (SFL) offers a promising approach for distributed model training in wireless networks, combining the layer-partitioning advantages of split learning with the federated aggregation that ensures global convergence. However, in heterogeneous wireless environments, disparities in device capabilities and channel conditions make strict round-based synchronization heavily straggler-dominated, thereby limiting both efficiency and scalability. To address this challenge, we propose a new framework, called Collaborative Split Federated Learning (CSFL), that redefines workload redistribution through device-to-device collaboration. Building on the flexibility of model partitioning, CSFL enables efficient devices, after completing their own forward propagation, to seamlessly take over the unfinished layers of bottleneck devices. This collaborative process, supported by D2D communications, allows bottleneck devices to offload computation earlier while maintaining synchronized progression across the network. Beyond the system design, we highlight key technical enablers such as privacy protection, multi-perspective matching, and incentive mechanisms, and discuss practical challenges including matching balance, privacy risks, and incentive sustainability. A case study demonstrates that CSFL significantly reduces training latency without compromising convergence speed or accuracy, underscoring collaboration as a key enabler for synchronization-efficient learning in next-generation wireless networks.

[500] CAPO: Towards Enhancing LLM Reasoning through Generative Credit Assignment

Guofu Xie, Yunsheng Shi, Hongtao Tian, Ting Yao, Xiao Zhang

Main category: cs.LG

TL;DR: CAPO introduces a credit assignment method using LLMs as Generative Process Reward Models to provide deterministic token-level feedback, overcoming limitations of coarse-grained RLVR rewards and unreliable Process Reward Models.

Details

Motivation: Current RLVR methods use coarse-grained binary rewards that hamper precise credit assignment, while Process Reward Models require expensive supervision and produce unreliable probabilistic feedback. There's a need for efficient, verifiable step-wise credit assignment.

Method: CAPO uses an off-the-shelf LLM as a Generative Process Reward Model (LLM-as-GenPRM) to generate step-wise critiques in one pass, providing deterministic token-level credits. It employs voting mechanisms for accuracy and robustness.

Result: CAPO consistently outperforms supervised learning and RL-based fine-tuning methods across four mathematical benchmarks and three out-of-domain benchmarks, helping models learn correct reasoning pathways.

Conclusion: CAPO provides an efficient solution for precise credit assignment in RL, enabling models to identify which reasoning steps lead to success or failure without requiring expensive process supervision.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has improved the reasoning abilities of Large Language Models (LLMs) by using rule-based binary feedback. However, current RLVR methods typically assign the same reward to every token. This coarse-grained feedback hampers precise credit assignment, making it hard for models to identify which reasoning steps lead to success or failure, and often results in suboptimal policies. Methods like PPO provide credit assignment by value estimation, but yield inaccurate and unverifiable signals due to limited sampling. On the other hand, methods using Process Reward Models can provide step-wise rewards but suffer from several key limitations: they require high-quality process supervision labels, the feedback is unreliable due to probabilistic reward modeling, and their application in online reinforcement learning (RL) is time-consuming. To overcome these limitations, we introduce a simple but efficient method-Credit Assignment Policy Optimization (CAPO). Instead of training auxiliary models, CAPO directly leverages an off-the-shelf, general-purpose LLM as a Generative Process Reward Model (LLM-as-GenPRM) to generate all step-wise critique by one pass only based on the correctness of the step itself, providing deterministic token-level credits to refine the tokens that were originally assigned identical rule-based rewards. To further enhance the accuracy and robustness, we employ voting mechanisms that scale with the number of generated critiques. Extensive experiments on various backbones like Llama and Qwen models show that CAPO consistently outperforms supervised learning-based and RL-based fine-tuning methods across four challenging mathematical benchmarks and three out-of-domain benchmarks. Further analysis shows that CAPO can help the model to foster the learning of correct reasoning pathways leading to correct answers.

[501] Nonparametric Bellman Mappings for Value Iteration in Distributed Reinforcement Learning

Yuki Akiyama, Konstantinos Slavakis

Main category: cs.LG

TL;DR: Novel Bellman mappings for distributed reinforcement learning over arbitrary network topologies without centralized nodes, enabling transmission of both Q-functions and basis information via covariance matrices, achieving linear convergence and lower communication costs than existing methods.

Details

Motivation: To enable efficient distributed reinforcement learning in networks without centralized nodes, where existing approaches only exchange Q-functions and lack structural information sharing, leading to suboptimal performance and higher communication costs.

Method: Nonparametric Bellman mappings operating on Q-functions in reproducing kernel Hilbert spaces, with agents exchanging both Q-function estimates and basis information (covariance matrices) only with direct neighbors, using optimal learning rates determined by graph spectral properties.

Result: Linear convergence rates for both Q-function and covariance-matrix estimates regardless of network topology, achieving performance comparable to centralized approaches with lower cumulative communication costs than existing DRL schemes.

Conclusion: The proposed framework demonstrates that sharing basis information through covariance matrices accelerates learning and reduces overall communication costs, providing an effective distributed alternative to centralized reinforcement learning.

Abstract: This paper introduces novel Bellman mappings (B-Maps) for value iteration (VI) in distributed reinforcement learning (DRL), where agents are deployed over an undirected, connected graph/network with arbitrary topology – but without a centralized node, that is, a node capable of aggregating all data and performing computations. Each agent constructs a nonparametric B-Map from its private data, operating on Q-functions represented in a reproducing kernel Hilbert space, with flexibility in choosing the basis for their representation. Agents exchange their Q-function estimates only with direct neighbors, and unlike existing DRL approaches that restrict communication to Q-functions, the proposed framework also enables the transmission of basis information in the form of covariance matrices, thereby conveying additional structural details. Linear convergence rates are established for both Q-function and covariance-matrix estimates toward their consensus values, regardless of the network topology, with optimal learning rates determined by the ratio of the smallest positive eigenvalue (the graph’s Fiedler value) to the largest eigenvalue of the graph Laplacian matrix. A detailed performance analysis further shows that the proposed DRL framework effectively approximates the performance of a centralized node, had such a node existed. Numerical tests on two benchmark control problems confirm the effectiveness of the proposed nonparametric B-Maps relative to prior methods. Notably, the tests reveal a counter-intuitive outcome: although the framework involves richer information exchange – specifically through transmitting covariance matrices as basis information – it achieves the desired performance at a lower cumulative communication cost than existing DRL schemes, underscoring the critical role of sharing basis information in accelerating the learning process.

[502] Unveiling the Basin-Like Loss Landscape in Large Language Models

Huanran Chen, Yinpeng Dong, Zeming Wei, Yao Huang, Yichi Zhang, Hang Su, Jun Zhu

Main category: cs.LG

TL;DR: Large language models develop ‘basins’ in their loss landscape where performance remains stable, with pre-training creating basic capability basins and fine-tuning forming specific capability basins. Adversarial fine-tuning follows worst-case directions that degrade capabilities.

Details

Motivation: To understand the loss landscape structure of LLMs and how model capabilities are preserved or degraded during fine-tuning, particularly examining the emergence of stability regions (basins) and their implications for model robustness.

Method: Analyzed the loss landscape of large language models across different scales, examining resilience to random perturbations, identified basic capability basins from pre-training and specific capability basins from fine-tuning, and studied worst-case directions and adversarial fine-tuning effects.

Result: Found that LLMs develop expansive stability regions (basins) where performance remains nearly identical, with pre-training creating basic capability basins and alignment fine-tuning forming specific capability basins. Adversarial fine-tuning moves along sharp worst-case directions that rapidly degrade capabilities.

Conclusion: The basin size bounds performance degradation during fine-tuning and guarantees model robustness to input perturbations, suggesting that enlarging basins is beneficial for maintaining model capabilities and stability.

Abstract: We discover the emergence of \textit{basins} in the loss landscape of large language models. As model scale increases, LLMs become progressively more resilient to random perturbations in the parameter space, giving rise to expansive stability regions where models exhibit nearly identical performance, but outside of which their capabilities collapse. We observe that pre-training creates a \textit{basic capability} basin, and subsequent alignment fine-tuning forms \textit{specific capability} basins (e.g., safety, math, coding). Thus, we argue that benign fine-tuning confined to the basin should preserve prior capabilities. Besides, we also analyze the loss landscape for worst-case directions, which is consistently sharp and detrimental. We find that adversarial fine-tuning moves along the nearly worst-case directions, thus rapidly degrading model capabilities. Finally, we provide a theoretical analysis demonstrating that the basin size bounds the performance degradation of any fine-tuning, including the adversarial ones, while also guaranteeing the model robustness w.r.t. input perturbations, suggesting the benefit of enlarging basins.

[503] Valid Inference with Imperfect Synthetic Data

Yewon Byun, Shantanu Gupta, Zachary C. Lipton, Rachel Leah Childers, Bryan Wilder

Main category: cs.LG

TL;DR: A new estimator using generalized method of moments to combine synthetic data from LLMs with real data while maintaining statistical validity.

Details

Motivation: There's increasing interest in using LLM-generated synthetic data (like survey responses) but unclear how to combine it with real data while maintaining statistical validity.

Method: Introduces a new estimator based on generalized method of moments that is hyperparameter-free and has strong theoretical guarantees.

Result: Found that interactions between moment residuals of synthetic and real data can greatly improve parameter estimates. Validated performance across computational social science tasks with large empirical gains.

Conclusion: The proposed estimator provides a principled way to combine synthetic and real data while maintaining statistical validity, with demonstrated improvements in estimation accuracy.

Abstract: Predictions and generations from large language models are increasingly being explored as an aid in limited data regimes, such as in computational social science and human subjects research. While prior technical work has mainly explored the potential to use model-predicted labels for unlabeled data in a principled manner, there is increasing interest in using large language models to generate entirely new synthetic samples (e.g., synthetic simulations), such as in responses to surveys. However, it remains unclear by what means practitioners can combine such data with real data and yet produce statistically valid conclusions upon them. In this paper, we introduce a new estimator based on generalized method of moments, providing a hyperparameter-free solution with strong theoretical guarantees to address this challenge. Intriguingly, we find that interactions between the moment residuals of synthetic data and those of real data (i.e., when they are predictive of each other) can greatly improve estimates of the target parameter. We validate the finite-sample performance of our estimator across different tasks in computational social science applications, demonstrating large empirical gains.

[504] HoPE: Hybrid of Position Embedding for Long Context Vision-Language Models

Haoran Li, Yingjie Qin, Baoyuan Ou, Lai Xu, Ruiwen Xu

Main category: cs.LG

TL;DR: HoPE is a novel position embedding method that improves Vision-Language Models’ performance in long-context video scenarios through hybrid frequency allocation and dynamic temporal scaling.

Details

Motivation: Current VLMs struggle with long-context scenarios like long videos, and existing RoPE extensions for video lack theoretical analysis and fail to reliably capture semantic similarities over extended contexts.

Method: HoPE introduces a hybrid frequency allocation strategy for reliable semantic modeling over long contexts and a dynamic temporal scaling mechanism for robust learning and flexible inference across different context lengths.

Result: Extensive experiments on four video benchmarks for long video understanding and retrieval tasks show that HoPE consistently outperforms existing methods.

Conclusion: HoPE effectively addresses the limitations of current multimodal RoPEs and significantly improves VLMs’ long-context capabilities in video tasks.

Abstract: Vision-Language Models (VLMs) have made significant progress in multimodal tasks. However, their performance often deteriorates in long-context scenarios, particularly long videos. While Rotary Position Embedding (RoPE) has been widely adopted for length generalization in Large Language Models (LLMs), extending vanilla RoPE to capture the intricate spatial-temporal dependencies in videos remains an unsolved challenge. Existing methods typically allocate different frequencies within RoPE to encode 3D positional information. However, these allocation strategies mainly rely on heuristics, lacking in-depth theoretical analysis. In this paper, we first study how different allocation strategies impact the long-context capabilities of VLMs. Our analysis reveals that current multimodal RoPEs fail to reliably capture semantic similarities over extended contexts. To address this issue, we propose HoPE, a Hybrid of Position Embedding designed to improve the long-context capabilities of VLMs. HoPE introduces a hybrid frequency allocation strategy for reliable semantic modeling over arbitrarily long contexts, and a dynamic temporal scaling mechanism to facilitate robust learning and flexible inference across diverse context lengths. Extensive experiments across four video benchmarks on long video understanding and retrieval tasks demonstrate that HoPE consistently outperforms existing methods, confirming its effectiveness. Our code is available at https://github.com/hrlics/HoPE.

[505] Dual Natural Gradient Descent for Scalable Training of Physics-Informed Neural Networks

Anas Jnini, Flavio Vella

Main category: cs.LG

TL;DR: D-NGD enables efficient natural-gradient training of PINNs by computing Gauss-Newton steps in residual space instead of parameter space, achieving massive scalability up to 12.8M parameters on a single GPU with significantly lower errors than first-order methods.

Details

Motivation: Natural-gradient methods accelerate PINN training but suffer from O(n³) complexity in parameter space, making them impractical for large networks. The authors aim to overcome this computational bottleneck.

Method: D-NGD computes Gauss-Newton steps in residual space rather than parameter space, uses geodesic-acceleration correction, and provides both direct and preconditioned conjugate-gradient solvers for different problem sizes.

Result: D-NGD scales to networks with 12.8M parameters, achieves 1-3 orders of magnitude lower final L² error than first-order methods (Adam, SGD) and quasi-Newton methods, and enables single-GPU natural-gradient PINN training at this scale.

Conclusion: D-NGD successfully overcomes the computational limitations of natural-gradient methods for PINNs, enabling efficient second-order optimization at unprecedented scales with superior accuracy.

Abstract: Natural-gradient methods markedly accelerate the training of Physics-Informed Neural Networks (PINNs), yet their Gauss–Newton update must be solved in the parameter space, incurring a prohibitive $O(n^3)$ time complexity, where $n$ is the number of network trainable weights. We show that exactly the same step can instead be formulated in a generally smaller residual space of size $m = \sum_{\gamma} N_{\gamma} d_{\gamma}$, where each residual class $\gamma$ (e.g. PDE interior, boundary, initial data) contributes $N_{\gamma}$ collocation points of output dimension $d_{\gamma}$. Building on this insight, we introduce \textit{Dual Natural Gradient Descent} (D-NGD). D-NGD computes the Gauss–Newton step in residual space, augments it with a geodesic-acceleration correction at negligible extra cost, and provides both a dense direct solver for modest $m$ and a Nystrom-preconditioned conjugate-gradient solver for larger $m$. Experimentally, D-NGD scales second-order PINN optimization to networks with up to 12.8 million parameters, delivers one- to three-order-of-magnitude lower final error $L^2$ than first-order methods (Adam, SGD) and quasi-Newton methods, and – crucially – enables natural-gradient training of PINNs at this scale on a single GPU.

[506] Learning where to learn: Training data distribution optimization for scientific machine learning

Nicolas Guerra, Nicholas H. Nelsen, Yunan Yang

Main category: cs.LG

TL;DR: This paper addresses the learning-where-to-learn problem in scientific machine learning, focusing on designing optimal training data distributions to minimize prediction error across various deployment regimes.

Details

Motivation: Models in scientific machine learning are often deployed with parameter values or boundary conditions different from training conditions, creating a need for training distributions that ensure robustness across deployment scenarios.

Method: The paper proposes two adaptive algorithms based on bilevel or alternating optimization in the space of probability measures, implemented using parametric distribution classes or nonparametric particle-based gradient flows.

Result: The optimized training distributions outperform nonadaptive designs, resulting in models with improved sample complexity and robustness to distribution shift.

Conclusion: This framework enables principled data acquisition for learning functions and solution operators of partial differential equations, unlocking potential for more robust scientific machine learning models.

Abstract: In scientific machine learning, models are routinely deployed with parameter values or boundary conditions far from those used in training. This paper studies the learning-where-to-learn problem of designing a training data distribution that minimizes average prediction error across a family of deployment regimes. A theoretical analysis shows how the training distribution shapes deployment accuracy. This motivates two adaptive algorithms based on bilevel or alternating optimization in the space of probability measures. Discretized implementations using parametric distribution classes or nonparametric particle-based gradient flows deliver optimized training distributions that outperform nonadaptive designs. Once trained, the resulting models exhibit improved sample complexity and robustness to distribution shift. This framework unlocks the potential of principled data acquisition for learning functions and solution operators of partial differential equations.

[507] On Task Vectors and Gradients

Luca Zhou, Daniele Solombrino, Donato Crisostomi, Maria Sofia Bucarelli, Giuseppe Alessio D’Inverno, Fabrizio Silvestri, Emanuele Rodolà

Main category: cs.LG

TL;DR: Task arithmetic works because task vectors approximate negative gradients of task losses, with first-epoch gradients dominating the finetuning trajectory, making single-epoch finetuning sufficient for effective model merging.

Details

Motivation: To provide a theoretical explanation for why task arithmetic works in model merging, as empirical success exists but lacks clear theoretical foundation.

Method: Established connection between task vectors and gradients, proved equivalence under gradient descent, bounded second-order error for feed-forward networks, and conducted empirical analysis across seven vision benchmarks.

Result: Task vectors from one epoch of finetuning are equivalent to negative gradients scaled by learning rate, and first-epoch gradients dominate finetuning trajectory in norm and direction.

Conclusion: Task arithmetic is a form of approximate multitask learning, early training dynamics are critical for model merging, and single-epoch finetuning often yields comparable performance to full convergence.

Abstract: Task arithmetic has emerged as a simple yet powerful technique for model merging, enabling the combination of multiple finetuned models into one. Despite its empirical success, a clear theoretical explanation of why and when it works is lacking. This paper provides a rigorous theoretical foundation for task arithmetic by establishing a connection between task vectors and gradients of the task losses. We show that under standard gradient descent, a task vector generated from one epoch of finetuning is exactly equivalent to the negative gradient of the loss, scaled by the learning rate. For the practical multi-epoch setting, we prove that this equivalence holds approximately, with a second-order error term that we explicitly bound for feed-forward networks. Our empirical analysis across seven vision benchmarks corroborates our theory, demonstrating that the first-epoch gradient dominates the finetuning trajectory in both norm and direction. A key implication is that merging models finetuned for only a single epoch often yields performance comparable to merging fully converged models. These findings reframe task arithmetic as a form of approximate multitask learning, providing a clear rationale for its effectiveness and highlighting the critical role of early training dynamics in model merging.

[508] Inference-Time Scaling of Discrete Diffusion Models via Importance Weighting and Optimal Proposal Design

Zijing Ou, Chinmay Pani, Yingzhen Li

Main category: cs.LG

TL;DR: A Sequential Monte Carlo framework for scalable inference-time control of discrete diffusion models through importance weighting and optimal proposal construction.

Details

Motivation: Real-world applications require generative processes to adhere to constraints, but existing discrete diffusion models lack effective inference-time control mechanisms.

Method: Proposes SMC framework with tractable importance weights for intermediate targets, characterizes optimal proposal, and develops two approximations: first-order gradient-based and amortised proposal trained to minimize log-variance of importance weights.

Result: Empirical results across synthetic tasks, language modelling, biology design, and text-to-image generation show enhanced controllability and sample quality.

Conclusion: SMC is an effective versatile recipe for scaling discrete diffusion models at inference time with improved constraint adherence.

Abstract: Discrete diffusion models have become highly effective across various domains. However, real-world applications often require the generative process to adhere to certain constraints. To this end, we propose a Sequential Monte Carlo (SMC) framework that enables scalable inference-time control of discrete diffusion models through principled importance weighting and optimal proposal construction. Specifically, our approach derives tractable importance weights for a range of intermediate targets and characterises the optimal proposal, for which we develop two practical approximations: a first-order gradient-based approximation and an amortised proposal trained to minimise the log-variance of the importance weights. Empirical results across synthetic tasks, language modelling, biology design, and text-to-image generation demonstrate that our framework enhances controllability and sample quality, highlighting the effectiveness of SMC as a versatile recipe for scaling discrete diffusion models at inference time.

[509] Rethinking Inter-LoRA Orthogonality in Adapter Merging: Insights from Orthogonal Monte Carlo Dropout

Andi Zhang, Xuan Ding, Haofan Wang, Steven McDonagh, Samuel Kaski

Main category: cs.LG

TL;DR: Orthogonal Monte Carlo Dropout enforces orthogonality in LoRA module merging to prevent interference, but empirical results show this doesn’t achieve semantic disentanglement.

Details

Motivation: To address interference between merged LoRA modules when combining different concepts (e.g., object + style), ensuring their semantic vectors remain orthogonal.

Method: Proposed Orthogonal Monte Carlo Dropout mechanism that enforces strict orthogonality between sparse semantic vectors during LoRA module combination without extra time complexity.

Result: While orthogonality prevents direct interference between merged modules, empirical analysis reveals it doesn’t lead to semantic disentanglement as previously claimed in compositional adaptation literature.

Conclusion: Inter-LoRA orthogonality alone is insufficient for achieving true semantic compositionality, suggesting need to re-examine its role in adapter merging.

Abstract: We propose Orthogonal Monte Carlo Dropout, a mechanism that enforces strict orthogonality when combining sparse semantic vectors without extra time complexity. Low-Rank Adaptation (LoRA), a popular fine-tuning method for large models, typically trains a module to represent a specific concept such as an object or a style. When multiple LoRA modules are merged, for example to generate an object in a particular style, their outputs (semantic vectors) may interfere with each other. Our method guarantees that merged LoRA modules remain orthogonal and thus free from direct interference. However, empirical analysis reveals that such orthogonality does not lead to the semantic disentanglement highlighted in prior work on compositional adaptation. This finding suggests that inter-LoRA orthogonality alone may be insufficient for achieving true semantic compositionality, prompting a re-examination of its role in adapter merging.

[510] AMBER: Adaptive Mesh Generation by Iterative Mesh Resolution Prediction

Niklas Freymuth, Tobias Würth, Nicolas Schreiber, Balazs Gyenes, Andreas Boltres, Johannes Mitsch, Aleksandar Taranovic, Tai Hoang, Philipp Dahlinger, Philipp Becker, Luise Kärger, Gerhard Neumann

Main category: cs.LG

TL;DR: AMBER is a supervised learning approach that uses hierarchical graph neural networks to predict adaptive mesh sizing fields, enabling automated mesh refinement without manual heuristics.

Details

Motivation: Adaptive meshing improves FEM efficiency but typically requires task-specific heuristics or manual expert design, which is cumbersome and limits automation.

Method: AMBER iteratively predicts sizing fields from coarse meshes using hierarchical graph neural networks, with data augmentation through automatic projection of expert labels onto generated data.

Result: AMBER generalizes to unseen geometries and outperforms multiple baselines including Graph/Convolutional Neural Networks and Reinforcement Learning approaches on 2D/3D datasets.

Conclusion: The proposed AMBER framework successfully automates adaptive meshing through supervised learning, demonstrating strong generalization and outperforming existing methods across various physical problems and industrial designs.

Abstract: The cost and accuracy of simulating complex physical systems using the Finite Element Method (FEM) scales with the resolution of the underlying mesh. Adaptive meshes improve computational efficiency by refining resolution in critical regions, but typically require task-specific heuristics or cumbersome manual design by a human expert. We propose Adaptive Meshing By Expert Reconstruction (AMBER), a supervised learning approach to mesh adaptation. Starting from a coarse mesh, AMBER iteratively predicts the sizing field, i.e., a function mapping from the geometry to the local element size of the target mesh, and uses this prediction to produce a new intermediate mesh using an out-of-the-box mesh generator. This process is enabled through a hierarchical graph neural network, and relies on data augmentation by automatically projecting expert labels onto AMBER-generated data during training. We evaluate AMBER on 2D and 3D datasets, including classical physics problems, mechanical components, and real-world industrial designs with human expert meshes. AMBER generalizes to unseen geometries and consistently outperforms multiple recent baselines, including ones using Graph and Convolutional Neural Networks, and Reinforcement Learning-based approaches.

[511] Grounding the Ungrounded: A Spectral-Graph Framework for Quantifying Hallucinations in Multimodal LLMs

Supratik Sarkar, Swagatam Das

Main category: cs.LG

TL;DR: A rigorous information-geometric framework using diffusion dynamics to quantify and measure hallucinations in multimodal LLMs through spectral embeddings and semantic-distortion metrics.

Details

Motivation: Hallucinations in LLMs, especially in multimodal settings, undermine reliability and trustworthiness of model outputs.

Method: Embed model outputs spectrally on multimodal graph Laplacians, measure gaps to truth manifold using semantic-distortion metric, derive Courant-Fischer bounds on temperature-dependent hallucination energy, and use RKHS eigenmodes for modality-aware measures.

Result: Developed a principled framework that reframes hallucination as measurable and bounded phenomenon, providing interpretable measures that track evolution over prompts and time.

Conclusion: The framework provides a principled basis for evaluation and mitigation of hallucinations in MLLMs, making hallucination a quantifiable and bounded problem.

Abstract: Hallucinations in LLMs–especially in multimodal settings–undermine reliability. We present a rigorous, information-geometric framework in diffusion dynamics that quantifies hallucination in MLLMs: model outputs are embedded spectrally on multimodal graph Laplacians, and gaps to a truth manifold define a semantic-distortion metric. We derive Courant–Fischer bounds on a temperature-dependent hallucination energy and use RKHS eigenmodes to obtain modality-aware, interpretable measures that track evolution over prompts and time. This reframes hallucination as measurable and bounded, providing a principled basis for evaluation and mitigation.

[512] Longitudinal Flow Matching for Trajectory Modeling

Mohammad Mohaiminul Islam, Thijs P. Kuipers, Sharvaree Vadgama, Coen de Vente, Afsana Khan, Clara I. Sánchez, Erik J. Bekkers

Main category: cs.LG

TL;DR: IMMFM learns continuous stochastic dynamics consistent with multiple observed time points using piecewise-quadratic interpolation and joint optimization of drift and diffusion coefficients.

Details

Motivation: Generative models struggle with sparsely sampled, high-dimensional trajectories and typically reduce dynamics learning to pairwise transitions, which limits their effectiveness.

Method: Uses piecewise-quadratic interpolation paths as smooth targets for flow matching, jointly optimizes drift and data-driven diffusion coefficients with theoretical stability conditions.

Result: Outperforms existing methods in forecasting accuracy and downstream tasks on synthetic benchmarks and real-world longitudinal neuroimaging datasets.

Conclusion: IMMFM effectively captures intrinsic stochasticity, handles irregular sparse sampling, and generates subject-specific trajectories for sequential data.

Abstract: Generative models for sequential data often struggle with sparsely sampled and high-dimensional trajectories, typically reducing the learning of dynamics to pairwise transitions. We propose Interpolative Multi-Marginal Flow Matching (IMMFM), a framework that learns continuous stochastic dynamics jointly consistent with multiple observed time points. IMMFM employs a piecewise-quadratic interpolation path as a smooth target for flow matching and jointly optimizes drift and a data-driven diffusion coefficient, supported by a theoretical condition for stable learning. This design captures intrinsic stochasticity, handles irregular sparse sampling, and yields subject-specific trajectories. Experiments on synthetic benchmarks and real-world longitudinal neuroimaging datasets show that IMMFM outperforms existing methods in both forecasting accuracy and further downstream tasks.

[513] Adversarial Surrogate Risk Bounds for Binary Classification

Natalie S. Frank

Main category: cs.LG

TL;DR: This paper provides surrogate risk bounds to quantify the convergence rate of adversarial classification risk to its optimal value during adversarial training.

Details

Motivation: Previous work established conditions for adversarial consistency (minimizing adversarial surrogate risk also minimizes adversarial classification risk) but didn't address the convergence rate at which this occurs.

Method: The paper develops surrogate risk bounds that characterize how quickly the adversarial classification risk approaches its optimal value along sequences that minimize the adversarial surrogate risk.

Result: The paper provides theoretical bounds that quantify the convergence rate of adversarial classification risk during adversarial training.

Conclusion: These surrogate risk bounds address the gap in understanding the convergence rate of adversarial classification risk in adversarial training, complementing existing consistency results.

Abstract: A central concern in classification is the vulnerability of machine learning models to adversarial attacks. Adversarial training is one of the most popular techniques for training robust classifiers, which involves minimizing an adversarial surrogate risk. Recent work has characterized the conditions under which any sequence minimizing the adversarial surrogate risk also minimizes the adversarial classification risk in the binary setting, a property known as adversarial consistency. However, these results do not address the rate at which the adversarial classification risk approaches its optimal value along such a sequence. This paper provides surrogate risk bounds that quantify that convergence rate.

[514] Auto-Compressing Networks

Vaggelis Dorovatas, Georgios Paraskevopoulos, Alexandros Potamianos

Main category: cs.LG

TL;DR: Auto-Compressing Networks (ACNs) replace short residual connections with long feedforward connections from each layer to the output, enabling automatic information compression during training that reduces computational redundancy while maintaining performance.

Details

Motivation: Traditional deep networks with residual connections often introduce computational redundancy as depth increases without corresponding improvements in representation quality, motivating the search for more efficient architectures.

Method: ACNs use additive long feedforward connections from each layer directly to the output instead of short residual connections, creating layer-wise training patterns that dynamically compress information into early layers during gradient descent training.

Result: ACNs achieve up to 18% reduction in catastrophic forgetting, 30-80% architectural compression while maintaining accuracy, enhanced noise robustness, superior low-data performance, and improved transfer learning across vision transformers, MLP-mixers, and BERT architectures.

Conclusion: ACNs provide a practical approach for developing efficient neural architectures that automatically adapt computational footprint to task complexity while learning robust representations suitable for noisy real-world tasks and continual learning.

Abstract: Deep neural networks with short residual connections have demonstrated remarkable success across domains, but increasing depth often introduces computational redundancy without corresponding improvements in representation quality. We introduce Auto-Compressing Networks (ACNs), an architectural variant where additive long feedforward connections from each layer to the output replace traditional short residual connections. By analyzing the distinct dynamics induced by this modification, we reveal a unique property we coin as auto-compression, the ability of a network to organically compress information during training with gradient descent, through architectural design alone. Through auto-compression, information is dynamically “pushed” into early layers during training, enhancing their representational quality and revealing potential redundancy in deeper ones. We theoretically show that this property emerges from layer-wise training patterns present in ACNs, where layers are dynamically utilized during training based on task requirements. We also find that ACNs exhibit enhanced noise robustness compared to residual networks, superior performance in low-data settings, improved transfer learning capabilities, and mitigate catastrophic forgetting suggesting that they learn representations that generalize better despite using fewer parameters. Our results demonstrate up to 18% reduction in catastrophic forgetting and 30-80% architectural compression while maintaining accuracy across vision transformers, MLP-mixers, and BERT architectures. These findings establish ACNs as a practical approach to developing efficient neural architectures that automatically adapt their computational footprint to task complexity, while learning robust representations suitable for noisy real-world tasks and continual learning scenarios.

[515] SafeProtein: Red-Teaming Framework and Benchmark for Protein Foundation Models

Jigang Fan, Zhenghong Zhou, Ruofan Jin, Le Cong, Mengdi Wang, Zaixi Zhang

Main category: cs.LG

TL;DR: SafeProtein is the first red-teaming framework for protein foundation models that combines multimodal prompt engineering and heuristic beam search to systematically test for biological safety risks, achieving up to 70% attack success rate on state-of-the-art models like ESM3.

Details

Motivation: The rapid advancement of protein foundation models has raised serious concerns about potential misuse in generating proteins with biological safety risks, but there's been a lack of systematic red-teaming to evaluate these security vulnerabilities.

Method: SafeProtein combines multimodal prompt engineering and heuristic beam search to systematically design red-teaming methods, along with a curated benchmark dataset (SafeProtein-Bench) and comprehensive evaluation protocol.

Result: The framework achieved continuous jailbreaks on state-of-the-art protein foundation models with up to 70% attack success rate for ESM3, revealing significant biological safety risks in current models.

Conclusion: SafeProtein successfully exposes potential biological safety risks in protein foundation models and provides insights for developing robust security protection technologies for frontier models.

Abstract: Proteins play crucial roles in almost all biological processes. The advancement of deep learning has greatly accelerated the development of protein foundation models, leading to significant successes in protein understanding and design. However, the lack of systematic red-teaming for these models has raised serious concerns about their potential misuse, such as generating proteins with biological safety risks. This paper introduces SafeProtein, the first red-teaming framework designed for protein foundation models to the best of our knowledge. SafeProtein combines multimodal prompt engineering and heuristic beam search to systematically design red-teaming methods and conduct tests on protein foundation models. We also curated SafeProtein-Bench, which includes a manually constructed red-teaming benchmark dataset and a comprehensive evaluation protocol. SafeProtein achieved continuous jailbreaks on state-of-the-art protein foundation models (up to 70% attack success rate for ESM3), revealing potential biological safety risks in current protein foundation models and providing insights for the development of robust security protection technologies for frontier models. The codes will be made publicly available at https://github.com/jigang-fan/SafeProtein.

[516] On the necessity of adaptive regularisation:Optimal anytime online learning on $\boldsymbol{\ell_p}$-balls

Emmeran Johnson, David Martínez-Rubio, Ciara Pike-Burke, Patrick Rebeschini

Main category: cs.LG

TL;DR: FTRL with time-varying regularization is anytime optimal for online convex optimization on ℓ_p-balls (p>2), but fixed regularization cannot achieve optimality across both high-dimensional (d>T) and low-dimensional (d≤T) regimes.

Details

Motivation: To understand whether fixed non-adaptive regularization can achieve anytime optimal regret for online convex optimization across different dimension regimes, particularly the shift in optimal regret behavior between high-dimensional (d>T) and low-dimensional (d≤T) settings.

Method: Analysis of Follow-the-Regularised-Leader (FTRL) with both time-varying adaptive regularization and fixed non-adaptive regularization on ℓ_p-balls for p>2, examining separable regularizers and their performance across dimension regimes.

Result: FTRL with time-varying regularization adaptive to dimension regime is anytime optimal. However, for separable regularizers, adaptivity is necessary - any fixed regularizer will be sub-optimal in one of the two dimension regimes. Lower bounds show no sub-linear regret is possible for linear bandits on ℓ_p-balls (p≥1) in sufficiently high dimensions.

Conclusion: Adaptive regularization is essential for achieving anytime optimal regret across different dimension regimes in online convex optimization on ℓ_p-balls, while fixed regularization cannot simultaneously achieve optimal performance in both high and low-dimensional settings.

Abstract: We study online convex optimization on $\ell_p$-balls in $\mathbb{R}^d$ for $p > 2$. While always sub-linear, the optimal regret exhibits a shift between the high-dimensional setting ($d > T$), when the dimension $d$ is greater than the time horizon $T$ and the low-dimensional setting ($d \leq T$). We show that Follow-the-Regularised-Leader (FTRL) with time-varying regularisation which is adaptive to the dimension regime is anytime optimal for all dimension regimes. Motivated by this, we ask whether it is possible to obtain anytime optimality of FTRL with fixed non-adaptive regularisation. Our main result establishes that for separable regularisers, adaptivity in the regulariser is necessary, and that any fixed regulariser will be sub-optimal in one of the two dimension regimes. Finally, we provide lower bounds which rule out sub-linear regret bounds for the linear bandit problem in sufficiently high-dimension for all $\ell_p$-balls with $p \geq 1$.

[517] A Minimalist Bayesian Framework for Stochastic Optimization

Kaizheng Wang

Main category: cs.LG

TL;DR: A minimalist Bayesian framework that places priors only on parameters of interest (like optimum location) while eliminating nuisance parameters via profile likelihood, with applications to Thompson sampling and structured optimization problems.

Details

Motivation: Traditional Bayesian methods require probabilistic models for all parameters, which hinders incorporation of complex structural constraints in sequential decision-making under uncertainty.

Method: Introduces a minimalist Bayesian framework using profile likelihood to eliminate nuisance parameters, and develops MINimalist Thompson Sampling (MINTS) algorithm as an instantiation.

Result: The framework accommodates structured problems (Lipschitz bandits, dynamic pricing), provides probabilistic interpretation of classical convex optimization methods, and achieves near-optimal regret guarantees for multi-armed bandits.

Conclusion: The minimalist Bayesian approach successfully bridges Bayesian principles with structural constraints, offering a flexible framework for sequential decision-making while maintaining theoretical guarantees.

Abstract: The Bayesian paradigm offers principled tools for sequential decision-making under uncertainty, but its reliance on a probabilistic model for all parameters can hinder the incorporation of complex structural constraints. We introduce a minimalist Bayesian framework that places a prior only on the component of interest, such as the location of the optimum. Nuisance parameters are eliminated via profile likelihood, which naturally handles constraints. As a direct instantiation, we develop a MINimalist Thompson Sampling (MINTS) algorithm. Our framework accommodates structured problems, including continuum-armed Lipschitz bandits and dynamic pricing. It also provides a probabilistic lens on classical convex optimization algorithms such as the center of gravity and ellipsoid methods. We further analyze MINTS for multi-armed bandits and establish near-optimal regret guarantees.

[518] Distributional Machine Unlearning via Selective Data Removal

Youssef Allouah, Rachid Guerraoui, Sanmi Koyejo

Main category: cs.LG

TL;DR: Distributional unlearning framework that selectively removes a small subset of data to efficiently forget unwanted domains while preserving desired ones, achieving strong unlearning effects with 15-82% less deletion than full removal.

Details

Motivation: Machine learning systems need to remove entire domains of information (e.g., toxic language, biases) rather than individual user data, but full removal is computationally expensive while random partial removal is statistically inefficient.

Method: Proposes distributional unlearning framework using Kullback-Leibler divergence constraints to select a small subset that balances forgetting unwanted distribution while preserving desired one. Uses distance-based selection algorithm that is quadratically more sample-efficient than random removal.

Result: Method requires 15-82% less deletion than full removal for strong unlearning effects, such as halving initial forget set accuracy. Experiments on synthetic, text, and image datasets (Jigsaw, CIFAR-10, SMS spam) demonstrate effectiveness.

Conclusion: A small forget set often suffices for effective domain unlearning, laying foundations for more scalable and rigorous subpopulation unlearning approaches.

Abstract: Machine learning systems increasingly face requirements to remove entire domains of information – such as toxic language or biases – rather than individual user data. This task presents a dilemma: full removal of the unwanted domain data is computationally expensive, while random partial removal is statistically inefficient. We find that a domain’s statistical influence is often concentrated in a small subset of its data samples, suggesting a path between ineffective partial removal and unnecessary complete removal. We formalize this as distributional unlearning: a framework to select a small subset that balances forgetting an unwanted distribution while preserving a desired one. Using Kullback-Leibler divergence constraints, we derive the exact removal-preservation Pareto frontier for exponential families and prove that models trained on the edited data achieve corresponding log-loss bounds. We propose a distance-based selection algorithm and show it is quadratically more sample-efficient than random removal in the challenging low-divergence regime. Experiments across synthetic, text, and image datasets (Jigsaw, CIFAR-10, SMS spam) show our method requires 15-82% less deletion than full removal for strong unlearning effects, e.g., halving initial forget set accuracy. Ultimately, by showing a small forget set often suffices, our framework lays the foundations for more scalable and rigorous subpopulation unlearning.

Ainhize Barrainkua, Giovanni De Toni, Jose Antonio Lozano, Novi Quadrianto

Main category: cs.LG

TL;DR: This paper addresses fairness in algorithmic recourse, linking it to classification fairness, and introduces a social burden-based framework with MISOB algorithm that reduces social burden across groups while maintaining classifier accuracy.

Details

Motivation: Emerging legislation requires classifiers to provide actionable recourse for negative decisions, but concerns exist about fairness guarantees in the recourse process itself.

Method: Theoretical characterization of unfairness in recourse, linking fairness in recourse and classification, and introducing MISOB algorithm based on social burden framework.

Result: Empirical results show MISOB reduces social burden across all groups without compromising overall classifier accuracy on real-world datasets.

Conclusion: The paper provides a comprehensive framework for fair algorithmic recourse that addresses limitations of equal cost paradigms and demonstrates practical effectiveness through MISOB.

Abstract: Machine learning based predictions are increasingly used in sensitive decision-making applications that directly affect our lives. This has led to extensive research into ensuring the fairness of classifiers. Beyond just fair classification, emerging legislation now mandates that when a classifier delivers a negative decision, it must also offer actionable steps an individual can take to reverse that outcome. This concept is known as algorithmic recourse. Nevertheless, many researchers have expressed concerns about the fairness guarantees within the recourse process itself. In this work, we provide a holistic theoretical characterization of unfairness in algorithmic recourse, formally linking fairness guarantees in recourse and classification, and highlighting limitations of the standard equal cost paradigm. We then introduce a novel fairness framework based on social burden, along with a practical algorithm (MISOB), broadly applicable under real-world conditions. Empirical results on real-world datasets show that MISOB reduces the social burden across all groups without compromising overall classifier accuracy.

[520] Enhancing Generative Auto-bidding with Offline Reward Evaluation and Policy Search

Zhiyu Mou, Yiqin Lv, Miao Xu, Qi Wang, Yixiu Mao, Qichen Ye, Chao Li, Rongquan Bai, Chuan Yu, Jian Xu, Bo Zheng

Main category: cs.LG

TL;DR: AIGB-Pearl is a novel auto-bidding method that combines generative planning with policy optimization to overcome limitations of existing AI-Generated Bidding methods by enabling safe exploration beyond offline datasets.

Details

Motivation: Existing AI-Generated Bidding methods face performance bottlenecks because they cannot explore beyond static offline datasets, limiting their potential for improved advertising performance.

Method: AIGB-Pearl integrates generative planning and policy optimization by constructing a trajectory evaluator to score generation quality and designing a KL-Lipschitz-constrained score maximization scheme for safe exploration. It uses synchronous coupling technique to ensure model regularity.

Result: Extensive experiments on both simulated and real-world advertising systems demonstrate state-of-the-art performance compared to existing methods.

Conclusion: AIGB-Pearl successfully addresses the exploration limitation in offline auto-bidding methods and achieves superior performance through its novel integration of generative planning and policy optimization with safe exploration guarantees.

Abstract: Auto-bidding serves as a critical tool for advertisers to improve their advertising performance. Recent progress has demonstrated that AI-Generated Bidding (AIGB), which learns a conditional generative planner from offline data, achieves superior performance compared to typical offline reinforcement learning (RL)-based auto-bidding methods. However, existing AIGB methods still face a performance bottleneck due to their inherent inability to explore beyond the static offline dataset. To address this, we propose {AIGB-Pearl} (\emph{{P}lanning with {E}valu{A}tor via RL}), a novel method that integrates generative planning and policy optimization. The core of AIGB-Pearl lies in constructing a trajectory evaluator for scoring generation quality and designing a provably sound KL-Lipschitz-constrained score maximization scheme to ensure safe and efficient exploration beyond the offline dataset. A practical algorithm incorporating the synchronous coupling technique is further devised to ensure the model regularity required by the proposed scheme. Extensive experiments on both simulated and real-world advertising systems demonstrate the state-of-the-art performance of our approach.

[521] IPR: Intelligent Prompt Routing with User-Controlled Quality-Cost Trade-offs

Aosong Feng, Balasubramaniam Srinivasan, Yun Zhou, Zhichao Xu, Kang Zhou, Sheng Guan, Yueyan Chen, Xian Wu, Ninad Kulkarni, Yi Zhang, Zhengyuan Shen, Dmitriy Bespalov, Soumya Smruti Mishra, Yifei Teng, Darren Yow-Bang Wang, Haibo Ding, Lin Lee Cheong

Main category: cs.LG

TL;DR: IPR is an intelligent prompt routing framework that dynamically selects optimal LLMs based on predicted response quality and user-specified tolerance levels, achieving 43.9% cost reduction while maintaining quality parity.

Details

Motivation: To solve the fundamental challenge of routing queries to the most cost-effective LLM while maintaining response quality in large-scale commercial systems.

Method: Uses lightweight quality estimators trained on 1.5M prompts, user-controlled routing with tolerance parameter τ, and extensible design with frozen encoders and model-specific adapters.

Result: Achieves 43.9% cost reduction while maintaining quality parity with the strongest Claude model, with sub-150ms latency on a major cloud platform.

Conclusion: IPR provides an effective solution for optimizing performance-cost trade-offs in LLM deployment, with rapid model integration and explicit user control over quality-cost balance.

Abstract: Routing incoming queries to the most cost-effective LLM while maintaining response quality poses a fundamental challenge in optimizing performance-cost trade-offs for large-scale commercial systems. We present IPR, – ,a quality-constrained \textbf{I}ntelligent \textbf{P}rompt \textbf{R}outing framework that dynamically selects optimal models based on predicted response quality and user-specified tolerance levels. IPR introduces three key innovations: (1) a modular architecture with lightweight quality estimators trained on 1.5M prompts annotated with calibrated quality scores, enabling fine-grained quality prediction across model families; (2) a user-controlled routing mechanism with tolerance parameter $\tau \in [0,1]$ that provides explicit control over quality-cost trade-offs; and (3) an extensible design using frozen encoders with model-specific adapters, reducing new model integration from days to hours. To rigorously train and evaluate IPR, we curate an industrial-level dataset IPRBench\footnote{IPRBench will be released upon legal approval.}, a comprehensive benchmark containing 1.5 million examples with response quality annotations across 11 LLM candidates. Deployed on a major cloud platform, IPR achieves 43.9% cost reduction while maintaining quality parity with the strongest model in the Claude family and processes requests with sub-150ms latency. The deployed system and additional product details are publicly available at https://aws.amazon.com/bedrock/intelligent-prompt-routing/

[522] When Judgment Becomes Noise: How Design Failures in LLM Judge Benchmarks Silently Undermine Validity

Benjamin Feuer, Chiung-Yi Tseng, Astitwa Sarthak Lathe, Oussama Elachqar, John P Dickerson

Main category: cs.LG

TL;DR: LLM-judged benchmarks have design flaws that can make rankings unreliable. The paper introduces two diagnostic tools to measure schema adherence and psychometric validity, revealing high unexplained variance and factor collapse in popular judges like DeepSeek-R1-32B.

Details

Motivation: To address the failure modes in LLM-judged benchmarks that produce noisy rankings despite high confidence, as these benchmarks lack tight objectives and verifiable constructions compared to ground-truth based benchmarks.

Method: Introduces two diagnostic mechanisms: schematic adherence (quantifies how much verdicts follow explicit evaluation schemas) and psychometric validity (aggregates internal consistency and discriminant validity to measure irreducible uncertainty). Applied these to Arena-Hard Auto benchmark.

Result: Found severe schema incoherence (90%+ unexplained variance for DeepSeek-R1-32B) and factor collapse (correlations >0.93 for most criteria). Showed that ELO-style aggregation masks genuine ranking uncertainty.

Conclusion: Highlights design failures undermining benchmark validity and provides principles for creating better-scoped, reliability-aware LLM-judged benchmarks. Code and dataset released for further research.

Abstract: LLM-judged benchmarks are increasingly used to evaluate complex model behaviors, yet their design introduces failure modes absent in conventional ground-truth based benchmarks. We argue that without tight objectives and verifiable constructions, benchmark rankings can produce high-confidence rankings that are in fact largely noise. We introduce two mechanisms to diagnose these issues. Schematic adherence quantifies how much of a judge’s overall verdict is explained by the explicit evaluation schema, revealing unexplained variance when judges deviate from their own rubric. Psychometric validity aggregates internal consistency and discriminant validity signals to quantify irreducible uncertainty in any benchmarking run. Applying these tools to Arena-Hard Auto, we find severe schema incoherence and factor collapse across popular judges: for example, unexplained variance exceeding 90 percent for DeepSeek-R1-32B and factor correlations above 0.93 for most criteria. We also show that the ELO-style aggregation used by Arena-Hard Auto collapses and masks genuine ranking uncertainty. Our results highlight design failures that undermine validity and offer actionable principles for building better-scoped, reliability-aware LLM-judged benchmarks. We released our code and dataset at https://github.com/penfever/judgment-to-noise

[523] P3D: Scalable Neural Surrogates for High-Resolution 3D Physics Simulations with Global Context

Benjamin Holzschuh, Georg Kohl, Florian Redinger, Nils Thuerey

Main category: cs.LG

TL;DR: A scalable framework for learning neural surrogates for high-resolution 3D physics simulations using a hybrid CNN-Transformer architecture that enables training on small patches and scaling to large domains.

Details

Motivation: To create efficient neural surrogates for high-resolution 3D physics simulations that can handle complex PDE dynamics while reducing memory and computational requirements.

Method: Hybrid CNN-Transformer backbone architecture that can be pretrained on small simulation patches and fused for global solutions, optionally guided by sequence-to-sequence models for long-range dependencies.

Result: Outperforms existing architectures in speed and accuracy, scales to high-resolution isotropic turbulence at 512^3 resolution, and successfully learns dynamics of 14 different PDE types in 3D.

Conclusion: The proposed framework provides an effective approach for scalable neural surrogates in 3D physics simulations, demonstrating versatility through both deterministic predictions and probabilistic sampling via diffusion models.

Abstract: We present a scalable framework for learning deterministic and probabilistic neural surrogates for high-resolution 3D physics simulations. We introduce a hybrid CNN-Transformer backbone architecture targeted for 3D physics simulations, which significantly outperforms existing architectures in terms of speed and accuracy. Our proposed network can be pretrained on small patches of the simulation domain, which can be fused to obtain a global solution, optionally guided via a fast and scalable sequence-to-sequence model to include long-range dependencies. This setup allows for training large-scale models with reduced memory and compute requirements for high-resolution datasets. We evaluate our backbone architecture against a large set of baseline methods with the objective to simultaneously learn the dynamics of 14 different types of PDEs in 3D. We demonstrate how to scale our model to high-resolution isotropic turbulence with spatial resolutions of up to $512^3$. Finally, we demonstrate the versatility of our network by training it as a diffusion model to produce probabilistic samples of highly turbulent 3D channel flows across varying Reynolds numbers, accurately capturing the underlying flow statistics.

[524] GRPO is Secretly a Process Reward Model

Michael Sullivan

Main category: cs.LG

TL;DR: The paper proves GRPO RL algorithm induces a non-trivial process reward model (PRM) under certain assumptions, identifies a flaw in GRPO objective, proposes λ-GRPO modification, and shows improved performance over standard GRPO.

Details

Motivation: To question the advantage of costly, explicitly-defined PRMs for GRPO by leveraging the hidden, built-in PRM structure within vanilla GRPO algorithm to boost model performance with minimal training impact.

Method: Theoretical proof of GRPO inducing non-trivial PRM under token sequence overlap assumptions, empirical validation of assumptions, identification of GRPO objective flaw, and proposal of λ-GRPO modification to address non-uniform process step distribution issues.

Result: LLMs trained with λ-GRPO achieve higher validation accuracy and performance on downstream reasoning tasks, reaching peak performance more rapidly than standard GRPO-trained models.

Conclusion: It’s possible to leverage the hidden PRM structure within vanilla GRPO algorithm to boost performance with negligible impact on training time and cost, questioning the need for costly explicit PRMs.

Abstract: We prove theoretically that the GRPO RL algorithm induces a non-trivial process reward model (PRM), under certain assumptions regarding within-group overlap of token sequences across completions. We then show empirically that these assumptions are met under real-world conditions: GRPO does in fact induce a non-trivial PRM. Leveraging the framework of GRPO-as-a-PRM, we identify a flaw in the GRPO objective: non-uniformly distributed process steps hinder both exploration and exploitation (under different conditions). We propose a simple modification to the algorithm to mitigate this defect ($\lambda$-GRPO), and show that LLMs trained with $\lambda$-GRPO achieve higher validation accuracy and performance on downstream reasoning tasks$-$and reach peak performance more rapidly$-$than LLMs trained with standard GRPO. Our results call into question the advantage of costly, explicitly-defined PRMs for GRPO: we show that it is possible to instead leverage the hidden, built-in PRM structure within the vanilla GRPO algorithm to boost model performance with a negligible impact on training time and cost.

[525] GenFacts-Generative Counterfactual Explanations for Multi-Variate Time Series

Sarah Seifi, Anass Ibrahimi, Tobias Sukianto, Cecilia Carbonelli, Lorenzo Servadei, Robert Wille

Main category: cs.LG

TL;DR: GenFacts is a generative framework that produces plausible and actionable counterfactual explanations for time series classifiers, outperforming baselines in plausibility and interpretability.

Details

Motivation: Existing counterfactual explanation methods for multivariate time series often lack validity, plausibility, or intuitive interpretability, limiting their practical usefulness.

Method: GenFacts introduces a structured approach to latent space modeling and targeted counterfactual synthesis for generating counterfactual explanations.

Result: GenFacts outperforms baseline methods by +18.7% in plausibility metrics and achieves highest interpretability scores in user studies across radar gesture recognition and handwritten letter trajectory datasets.

Conclusion: Realism and user-centered interpretability, rather than sparsity alone, are crucial for actionable counterfactuals in time series applications.

Abstract: Counterfactual explanations aim to enhance model transparency by illustrating how input modifications can change model predictions. In the multivariate time series domain, existing approaches often produce counterfactuals that lack validity, plausibility, or intuitive interpretability. We present \textbf{GenFacts}, a novel generative framework for producing plausible and actionable counterfactual explanations for time series classifiers. GenFacts introduces a structured approach to latent space modeling and targeted counterfactual synthesis. We evaluate GenFacts on radar gesture recognition as an industrial use case and handwritten letter trajectories as an intuitive benchmark. Across both datasets, GenFacts consistently outperforms baseline methods in plausibility metrics (+18.7%) and achieves the highest interpretability scores in user studies. These results underscore that realism and user-centered interpretability, rather than sparsity alone, are vital for actionable counterfactuals in time series applications.

[526] Autonomy-Aware Clustering: When Local Decisions Supersede Global Prescriptions

Amber Srivastava, Salar Basiri, Srinivasa Salapaka

Main category: cs.LG

TL;DR: The paper introduces autonomy-aware clustering, a reinforcement learning framework that accounts for entity autonomy in clustering without requiring prior knowledge of autonomy forms. It integrates RL with Deterministic Annealing and uses a transformer-based Adaptive Distance Estimation Network to learn dependencies between entities and clusters.

Details

Motivation: Most clustering approaches assume passive entities that strictly conform to assigned groups, but in reality entities exhibit local autonomy that can reshape clustering outcomes, affecting cluster compositions, geometry, and cardinality with significant downstream effects.

Method: Combines reinforcement learning with Deterministic Annealing procedure that promotes exploration in early stages and exploitation later. Uses Adaptive Distance Estimation Network (ADEN) - a transformer-based attention model that learns dependencies between entities and cluster representatives within the RL loop.

Result: The framework achieves solutions close to ground truth with gap ~3-4%, while ignoring autonomy leads to substantially larger gaps (~35-40%). The approach closely aligns with underlying data dynamics even without explicit autonomy models.

Conclusion: Autonomy-aware clustering effectively accounts for entity autonomy in clustering problems, demonstrating significant performance improvements over traditional methods that ignore autonomy, with the framework being publicly available.

Abstract: Clustering arises in a wide range of problem formulations, yet most existing approaches assume that the entities under clustering are passive and strictly conform to their assigned groups. In reality, entities often exhibit local autonomy, overriding prescribed associations in ways not fully captured by feature representations. Such autonomy can substantially reshape clustering outcomes – altering cluster compositions, geometry, and cardinality – with significant downstream effects on inference and decision-making. We introduce autonomy-aware clustering, a reinforcement learning (RL) framework that learns and accounts for the influence of local autonomy without requiring prior knowledge of its form. Our approach integrates RL with a Deterministic Annealing (DA) procedure, where, to determine underlying clusters, DA naturally promotes exploration in early stages of annealing and transitions to exploitation later. We also show that the annealing procedure exhibits phase transitions that enable design of efficient annealing schedules. To further enhance adaptability, we propose the Adaptive Distance Estimation Network (ADEN), a transformer-based attention model that learns dependencies between entities and cluster representatives within the RL loop, accommodates variable-sized inputs and outputs, and enables knowledge transfer across diverse problem instances. Empirical results show that our framework closely aligns with underlying data dynamics: even without explicit autonomy models, it achieves solutions close to the ground truth (gap ~3-4%), whereas ignoring autonomy leads to substantially larger gaps (~35-40%). The code and data are publicly available at https://github.com/salar96/AutonomyAwareClustering.

[527] Closed-form $\ell_r$ norm scaling with data for overparameterized linear regression and diagonal linear networks under $\ell_p$ bias

Shuofeng Zhang, Ard Louis

Main category: cs.LG

TL;DR: This paper provides a unified characterization of parameter norm scaling for overparameterized linear regression with minimum-ℓ_p interpolators (p∈(1,2]), revealing data-dependent transition points and universal thresholds that separate saturating from growing norms.

Details

Motivation: To understand how different ℓ_r norms of minimum-ℓ_p interpolators scale with sample size in overparameterized linear regression, which is fundamental for understanding generalization behavior since many generalization proxies depend on these norms.

Method: Uses dual-ray analysis to reveal competition between signal spike and null coordinates in X⊤Y, yielding closed-form predictions for data-dependent transition points (elbow) and universal thresholds. Also studies diagonal linear networks by calibrating initialization scale to effective p via separable potential.

Result: Identifies a data-dependent transition point n_⋆ (elbow) and universal threshold r_⋆=2(p-1) that separates ℓ_r norms which plateau from those that continue growing. Shows DLNs inherit same elbow/threshold laws through effective p calibration.

Conclusion: Provides unified solution for scaling of all ℓ_r norms within r∈[1,p] under ℓ_p-biased interpolation, explaining which norms saturate vs increase. Suggests predictive power of generalization proxies depends sensitively on which ℓ_r norm is used.

Abstract: For overparameterized linear regression with isotropic Gaussian design and minimum-$\ell_p$ interpolator $p\in(1,2]$, we give a unified, high-probability characterization for the scaling of the family of parameter norms $ \{ \lVert \widehat{w_p} \rVert_r \}{r \in [1,p]} $ with sample size. We solve this basic, but unresolved question through a simple dual-ray analysis, which reveals a competition between a signal spike and a bulk of null coordinates in $X^\top Y$, yielding closed-form predictions for (i) a data-dependent transition $n\star$ (the “elbow”), and (ii) a universal threshold $r_\star=2(p-1)$ that separates $\lVert \widehat{w_p} \rVert_r$’s which plateau from those that continue to grow with an explicit exponent. This unified solution resolves the scaling of all $\ell_r$ norms within the family $r\in [1,p]$ under $\ell_p$-biased interpolation, and explains in one picture which norms saturate and which increase as $n$ grows. We then study diagonal linear networks (DLNs) trained by gradient descent. By calibrating the initialization scale $\alpha$ to an effective $p_{\mathrm{eff}}(\alpha)$ via the DLN separable potential, we show empirically that DLNs inherit the same elbow/threshold laws, providing a predictive bridge between explicit and implicit bias. Given that many generalization proxies depend on $\lVert \widehat {w_p} \rVert_r$, our results suggest that their predictive power will depend sensitively on which $l_r$ norm is used.

[528] Decipher the Modality Gap in Multimodal Contrastive Learning: From Convergent Representations to Pairwise Alignment

Lingjie Yi, Raphael Douady, Chao Chen

Main category: cs.LG

TL;DR: This paper provides the first theoretical framework analyzing the modality gap in multimodal contrastive learning, identifying dimension collapse as its fundamental cause and showing how it affects downstream performance through sample alignment.

Details

Motivation: Empirical evidence shows representations from different modalities occupy separate regions (modality gap), but inconsistent findings exist about how this gap influences downstream performance. The paper aims to understand what causes the modality gap and how it affects downstream tasks.

Method: The authors introduce a theoretical framework to analyze convergent optimal representations of multimodal contrastive learning and modality alignment. They prove convergence properties under different constraints: no constraint, cone constraint, and subspace constraint (where dimension collapse occurs).

Result: Without constraints or under cone constraint, the modality gap converges to zero. Under subspace constraint (dimension collapse), the gap converges to the smallest angle between the two hyperplanes. The paper identifies dimension collapse as the fundamental origin of modality gap and shows paired samples cannot be perfectly aligned under subspace constraint.

Conclusion: The modality gap affects downstream performance by influencing sample pair alignment. Perfect alignment can still be achieved through hyperplane rotation and shared space projection, providing theoretical solutions to mitigate the modality gap issue.

Abstract: Multimodal contrastive learning (MCL) aims to embed data from different modalities in a shared embedding space. However, empirical evidence shows that representations from different modalities occupy completely separate regions of embedding space, a phenomenon referred to as the modality gap. Moreover, experimental findings on how the size of the modality gap influences downstream performance are inconsistent. These observations raise two key questions: (1) What causes the modality gap? (2) How does it affect downstream tasks? To address these questions, this paper introduces the first theoretical framework for analyzing the convergent optimal representations of MCL and the modality alignment when training is optimized. Specifically, we prove that without any constraint or under the cone constraint, the modality gap converges to zero. Under the subspace constraint (i.e., representations of two modalities fall into two distinct hyperplanes due to dimension collapse), the modality gap converges to the smallest angle between the two hyperplanes. This result identifies \emph{dimension collapse} as the fundamental origin of the modality gap. Furthermore, our theorems demonstrate that paired samples cannot be perfectly aligned under the subspace constraint. The modality gap influences downstream performance by affecting the alignment between sample pairs. We prove that, in this case, perfect alignment between two modalities can still be achieved via two ways: hyperplane rotation and shared space projection.

[529] A Family of Kernelized Matrix Costs for Multiple-Output Mixture Neural Networks

Bo Hu, José C. Príncipe

Main category: cs.LG

TL;DR: This paper combines Mixture Density Networks with contrastive learning by using four kernelized matrix costs in Hilbert space for data density approximation.

Details

Motivation: To improve self-supervised and contrastive feature learning by integrating pairwise distance-based costs with mixture density modeling for better density approximation.

Method: Combines Mixture Density Networks with contrastive costs using four types of kernelized matrix costs in Hilbert space: scalar cost, vector-matrix cost, matrix-matrix cost (trace of Schur complement), and SVD cost (nuclear norm).

Result: Proposes a novel approach for learning multiple centers needed to define mixture densities through kernelized matrix costs in Hilbert space.

Conclusion: The integration of MDNs with contrastive learning using kernelized matrix costs provides an effective framework for data density approximation in self-supervised feature learning.

Abstract: Pairwise distance-based costs are crucial for self-supervised and contrastive feature learning. Mixture Density Networks (MDNs) are a widely used approach for generative models and density approximation, using neural networks to produce multiple centers that define a Gaussian mixture. By combining MDNs with contrastive costs, this paper proposes data density approximation using four types of kernelized matrix costs in the Hilbert space: the scalar cost, the vector-matrix cost, the matrix-matrix cost (the trace of Schur complement), and the SVD cost (the nuclear norm), for learning multiple centers required to define a mixture density.

[530] Slow-Fast Policy Optimization: Reposition-Before-Update for LLM Reasoning

Ziyan Wang, Zheng Wang, Jie Fu, Xingwei Qu, Qi Cheng, Shengpu Tang, Minjia Zhang, Xiaoming Huo

Main category: cs.LG

TL;DR: SFPO introduces a slow-fast policy optimization framework that improves RL training stability and efficiency for LLM reasoning tasks through a three-stage decomposition approach.

Details

Motivation: On-policy RL algorithms like GRPO suffer from noisy gradients and unstable updates in early training due to low-quality rollouts, leading to inefficient exploration.

Method: SFPO decomposes each training step into three stages: fast trajectory of inner steps on the same batch, reposition mechanism to control off-policy drift, and slow correction, while preserving the original objective and rollout process.

Result: SFPO outperforms GRPO by up to 2.80 points on math reasoning benchmarks, achieves up to 4.93x fewer rollouts, and reduces wall-clock time by up to 4.19x to match GRPO’s best accuracy.

Conclusion: SFPO consistently improves stability, reduces rollouts, and accelerates convergence of reasoning RL training while being plug-compatible with existing policy-gradient pipelines.

Abstract: Reinforcement learning (RL) has become central to enhancing reasoning in large language models (LLMs). Yet on-policy algorithms such as Group Relative Policy Optimization (GRPO) often suffer in early training: noisy gradients from low-quality rollouts lead to unstable updates and inefficient exploration. We introduce Slow-Fast Policy Optimization (SFPO), a simple yet efficient framework to address these limitations via decomposing each step into three stages: a short fast trajectory of inner steps on the same batch, a reposition mechanism to control off-policy drift, and a final slow correction. This reposition-before-update design preserves the objective and rollout process unchanged, making SFPO plug-compatible with existing policy-gradient pipelines. Extensive experiments demonstrate that SFPO consistently improves stability, reduces rollouts, and accelerates convergence of reasoning RL training. Specifically, it outperforms GRPO by up to 2.80 points in average on math reasoning benchmarks. It also achieves up to 4.93\texttimes{} fewer rollouts and an up to 4.19\texttimes{} reduction in wall-clock time to match GRPO’s best accuracy.

[531] PolyKAN: A Polyhedral Analysis Framework for Provable and Approximately Optimal KAN Compression

Di Zhang

Main category: cs.LG

TL;DR: PolyKAN is a theoretical framework for compressing Kolmogorov-Arnold Networks (KANs) that provides formal guarantees on model size reduction and approximation error through polyhedral region merging and dynamic programming.

Details

Motivation: KANs offer enhanced interpretability over MLPs but suffer from parameter inefficiency, limiting their practical deployment. There is a need for compression methods with mathematical guarantees.

Method: Leverages the piecewise polynomial structure of KANs, formulates compression as polyhedral region merging, establishes polyhedral characterization, develops ε-equivalent compression theory, and designs a dynamic programming algorithm for near-optimal compression.

Result: PolyKAN achieves provably near-optimal compression with strict error control, and guaranteed global optimality for univariate spline functions.

Conclusion: This provides the first formal foundation for KAN compression with mathematical guarantees, enabling efficient deployment of interpretable neural architectures.

Abstract: Kolmogorov-Arnold Networks (KANs) have emerged as a promising alternative to traditional Multi-Layer Perceptrons (MLPs), offering enhanced interpretability and a solid mathematical foundation. However, their parameter efficiency remains a significant challenge for practical deployment. This paper introduces PolyKAN, a novel theoretical framework for KAN compression that provides formal guarantees on both model size reduction and approximation error. By leveraging the inherent piecewise polynomial structure of KANs, we formulate the compression problem as a polyhedral region merging task. We establish a rigorous polyhedral characterization of KANs, develop a complete theory of $\epsilon$-equivalent compression, and design a dynamic programming algorithm that achieves approximately optimal compression under specified error bounds. Our theoretical analysis demonstrates that PolyKAN achieves provably near-optimal compression while maintaining strict error control, with guaranteed global optimality for univariate spline functions. This framework provides the first formal foundation for KAN compression with mathematical guarantees, opening new directions for the efficient deployment of interpretable neural architectures.

[532] Guiding Mixture-of-Experts with Temporal Multimodal Interactions

Xing Han, Hsing-Huan Chung, Joydeep Ghosh, Paul Pu Liang, Suchi Saria

Main category: cs.LG

TL;DR: A novel MoE framework that uses temporal multimodal interaction dynamics to guide expert routing, improving performance and interpretability.

Details

Motivation: Current MoE routing mechanisms ignore time-varying interaction dynamics between modalities, limiting expert specialization and effective reasoning.

Method: Proposes a multimodal interaction-aware router that dispatches tokens to experts based on temporal interaction patterns, encouraging generalizable interaction-processing skills.

Result: Comprehensive experiments on multimodal benchmarks show enhanced performance and improved interpretability.

Conclusion: Leveraging temporal multimodal interactions improves both MoE design and performance, enabling better expert specialization and reasoning.

Abstract: Mixture-of-Experts (MoE) architectures have become pivotal for large-scale multimodal models. However, their routing mechanisms typically overlook the informative, time-varying interaction dynamics between modalities. This limitation hinders expert specialization, as the model cannot explicitly leverage intrinsic modality relationships for effective reasoning. To address this, we propose a novel framework that guides MoE routing using quantified temporal interaction. A multimodal interaction-aware router learns to dispatch tokens to experts based on the nature of their interactions. This dynamic routing encourages experts to acquire generalizable interaction-processing skills rather than merely learning task-specific features. Our framework builds on a new formulation of temporal multimodal interaction dynamics, which are used to guide expert routing. We first demonstrate that these temporal multimodal interactions reveal meaningful patterns across applications, and then show how they can be leveraged to improve both the design and performance of MoE-based models. Comprehensive experiments on challenging multimodal benchmarks validate our approach, demonstrating both enhanced performance and improved interpretability.

[533] The Alignment Auditor: A Bayesian Framework for Verifying and Refining LLM Objectives

Matthieu Bou, Nyal Patel, Arjun Jagota, Satyapriya Krishna, Sonali Parbhoo

Main category: cs.LG

TL;DR: A Bayesian IRL framework for auditing LLMs that addresses reward non-identifiability, provides uncertainty-aware diagnostics, and validates policy-level utility through RLHF.

Details

Motivation: LLMs' implicit objectives are dangerously opaque, making trustworthy alignment and auditing challenging. Existing IRL approaches either produce overconfident single estimates or fail to address fundamental ambiguity in reward inference.

Method: Leverages Bayesian IRL to recover distributions over objectives, enabling posterior contraction analysis, uncertainty-aware diagnostics for spurious shortcuts and OOD prompts, and validation through RLHF training.

Result: Successfully audited a detoxified LLM, yielding well-calibrated and interpretable objectives that strengthen alignment guarantees and achieve toxicity reductions comparable to ground-truth alignment.

Conclusion: Provides a practical toolkit for auditors, safety teams, and regulators to verify what LLMs are truly trying to achieve, moving toward more trustworthy and accountable AI.

Abstract: The objectives that Large Language Models (LLMs) implicitly optimize remain dangerously opaque, making trustworthy alignment and auditing a grand challenge. While Inverse Reinforcement Learning (IRL) can infer reward functions from behaviour, existing approaches either produce a single, overconfident reward estimate or fail to address the fundamental ambiguity of the task (non-identifiability). This paper introduces a principled auditing framework that re-frames reward inference from a simple estimation task to a comprehensive process for verification. Our framework leverages Bayesian IRL to not only recover a distribution over objectives but to enable three critical audit capabilities: (i) Quantifying and systematically reducing non-identifiability by demonstrating posterior contraction over sequential rounds of evidence; (ii) Providing actionable, uncertainty-aware diagnostics that expose spurious shortcuts and identify out-of-distribution prompts where the inferred objective cannot be trusted; and (iii) Validating policy-level utility by showing that the refined, low-uncertainty reward can be used directly in RLHF to achieve training dynamics and toxicity reductions comparable to the ground-truth alignment process. Empirically, our framework successfully audits a detoxified LLM, yielding a well-calibrated and interpretable objective that strengthens alignment guarantees. Overall, this work provides a practical toolkit for auditors, safety teams, and regulators to verify what LLMs are truly trying to achieve, moving us toward more trustworthy and accountable AI.

[534] Bayesian Distributional Models of Executive Functioning

Robert Kasumba, Zeyu Lu, Dom CP Marticorena, Mingyang Zhong, Paul Beggs, Anja Pahor, Geetha Ramani, Imani Goffney, Susanne M Jaeggi, Aaron R Seitz, Jacob R Gardner, Dennis L Barbour

Main category: cs.LG

TL;DR: DLVM and DALE outperform conventional IMLE in parameter estimation, especially with sparse data, and adaptive sampling further enhances efficiency in cognitive assessments.

Details

Motivation: To evaluate the performance of Distributional Latent Variable Models (DLVM) and Bayesian Distributional Active Learning (DALE) against conventional Independent Maximum Likelihood Estimation (IMLE) in cognitive assessment tasks.

Method: Used controlled simulations with known ground-truth parameters to compare DLVM (which integrates observations across multiple tasks and individuals) and DALE (adaptive sampling) against IMLE.

Result: DLVM consistently outperformed IMLE, especially with smaller data amounts, converging faster to accurate estimates. DALE outperformed random sampling and fixed test batteries, particularly in the first 80 trials.

Conclusion: Combining DLVM’s cross-task inference with DALE’s optimal adaptive sampling provides a principled basis for more efficient cognitive assessments.

Abstract: This study uses controlled simulations with known ground-truth parameters to evaluate how Distributional Latent Variable Models (DLVM) and Bayesian Distributional Active LEarning (DALE) perform in comparison to conventional Independent Maximum Likelihood Estimation (IMLE). DLVM integrates observations across multiple executive function tasks and individuals, allowing parameter estimation even under sparse or incomplete data conditions. DLVM consistently outperformed IMLE, especially under with smaller amounts of data, and converges faster to highly accurate estimates of the true distributions. In a second set of analyses, DALE adaptively guided sampling to maximize information gain, outperforming random sampling and fixed test batteries, particularly within the first 80 trials. These findings establish the advantages of combining DLVM’s cross-task inference with DALE’s optimal adaptive sampling, providing a principled basis for more efficient cognitive assessments.

[535] Multi-Agent Stage-wise Conservative Linear Bandits

Amirhoseein Afsharrad, Ahmadreza Moradipari, Sanjay Lall

Main category: cs.LG

TL;DR: A multi-agent networked bandit algorithm with stage-wise safety constraints that achieves near-optimal regret through collaborative learning while maintaining safety guarantees.

Details

Motivation: Real-world applications like recommendation systems require multiple agents to balance exploration-exploitation while avoiding catastrophic failures through safety constraints, particularly in distributed settings with local observations and communication constraints.

Method: Proposed MA-SCLUCB (Multi-Agent Stage-wise Conservative Linear UCB), an episodic algorithm that alternates between action selection and consensus-building phases, where agents communicate only with immediate neighbors and ensure expected rewards are no less than (1-α) times a baseline policy.

Result: Achieves regret Õ(d/√N * √T * log(NT)/√log(1/|λ₂|)) with high probability, showing collaboration yields 1/√N improvement despite local communication, communication overhead grows logarithmically for well-connected networks, and safety constraints add only lower-order regret.

Conclusion: Distributed learning with safety guarantees achieves near-optimal performance in reasonably connected networks, demonstrating the feasibility of collaborative multi-agent systems with safety constraints.

Abstract: In many real-world applications such as recommendation systems, multiple learning agents must balance exploration and exploitation while maintaining safety guarantees to avoid catastrophic failures. We study the stochastic linear bandit problem in a multi-agent networked setting where agents must satisfy stage-wise conservative constraints. A network of $N$ agents collaboratively maximizes cumulative reward while ensuring that the expected reward at every round is no less than $(1-\alpha)$ times that of a baseline policy. Each agent observes local rewards with unknown parameters, but the network optimizes for the global parameter (average of local parameters). Agents communicate only with immediate neighbors, and each communication round incurs additional regret. We propose MA-SCLUCB (Multi-Agent Stage-wise Conservative Linear UCB), an episodic algorithm alternating between action selection and consensus-building phases. We prove that MA-SCLUCB achieves regret $\tilde{O}\left(\frac{d}{\sqrt{N}}\sqrt{T}\cdot\frac{\log(NT)}{\sqrt{\log(1/|\lambda_2|)}}\right)$ with high probability, where $d$ is the dimension, $T$ is the horizon, and $|\lambda_2|$ is the network’s second largest eigenvalue magnitude. Our analysis shows: (i) collaboration yields $\frac{1}{\sqrt{N}}$ improvement despite local communication, (ii) communication overhead grows only logarithmically for well-connected networks, and (iii) stage-wise safety adds only lower-order regret. Thus, distributed learning with safety guarantees achieves near-optimal performance in reasonably connected networks.

[536] On residual network depth

Benoit Dherin, Michael Munn

Main category: cs.LG

TL;DR: Deep residual networks behave like ensembles of shallower models, and increasing depth expands this implicit ensemble size. This creates a combinatorial explosion in output signal that explains the need for normalization layers.

Details

Motivation: To formally understand why depth is effective in residual architectures like ResNet and Transformer, and to explain the ensemble behavior intuition proposed by Veit et al. (2016).

Method: Developed an explicit analytical formula called the Residual Expansion Theorem that proves depth expansion is equivalent to ensemble size expansion, revealing hierarchical ensemble structure and combinatorial path growth.

Result: The analysis explains the historical necessity of normalization layers in deep models and provides a principled explanation for normalization-free techniques like SkipInit and Fixup. Scaling residual modules controls the combinatorial explosion and acts as implicit regularization.

Conclusion: Residual networks’ effectiveness comes from their ensemble-like behavior, where depth expansion increases implicit ensemble size. Scaling residual modules provides a principled solution to manage the combinatorial explosion and offers capacity control.

Abstract: Deep residual architectures, such as ResNet and the Transformer, have enabled models of unprecedented depth, yet a formal understanding of why depth is so effective remains an open question. A popular intuition, following Veit et al. (2016), is that these residual networks behave like ensembles of many shallower models. Our key finding is an explicit analytical formula that verifies this ensemble perspective, proving that increasing network depth is mathematically equivalent to expanding the size of this implicit ensemble. Furthermore, our expansion reveals a hierarchical ensemble structure in which the combinatorial growth of computation paths leads to an explosion in the output signal, explaining the historical necessity of normalization layers in training deep models. This insight offers a first principles explanation for the historical dependence on normalization layers and sheds new light on a family of successful normalization-free techniques like SkipInit and Fixup. However, while these previous approaches infer scaling factors through optimizer analysis or a heuristic analogy to Batch Normalization, our work offers the first explanation derived directly from the network’s inherent functional structure. Specifically, our Residual Expansion Theorem reveals that scaling each residual module provides a principled solution to taming the combinatorial explosion inherent to these architectures. We further show that this scaling acts as a capacity controls that also implicitly regularizes the model’s complexity.

[537] Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment

Nevan Wichers, Aram Ebtekar, Ariana Azarbal, Victor Gillioz, Christine Ye, Emil Ryd, Neil Rathi, Henry Sleight, Alex Mallen, Fabien Roger, Samuel Marks

Main category: cs.LG

TL;DR: Inoculation Prompting (IP) prevents learning of undesired behaviors by explicitly requesting them in training prompts, reducing reward hacking and sycophancy without compromising desired capabilities.

Details

Motivation: Large language models often learn from imperfect oversight signals, leading to problematic behaviors like reward hacking and sycophancy. Improving oversight quality is expensive or infeasible, so methods are needed to improve behavior despite imperfect training signals.

Method: Inoculation Prompting (IP) modifies training prompts to explicitly request undesired behaviors. For example, to prevent reward hacking, prompts request code that only works on provided test cases but fails on other inputs. Prompts that more strongly elicit undesired behavior before fine-tuning are more effective inoculators.

Result: Across four settings, IP reduces learning of undesired behavior without substantially reducing learning of desired capabilities. The technique effectively controls how models generalize from fine-tuning.

Conclusion: IP is a simple yet effective way to prevent learning of undesired behaviors without substantially disrupting desired capabilities, providing a practical approach to improve model behavior despite imperfect training signals.

Abstract: Large language models are sometimes trained with imperfect oversight signals, leading to undesired behaviors such as reward hacking and sycophancy. Improving oversight quality can be expensive or infeasible, motivating methods that improve learned behavior despite an imperfect training signal. We introduce Inoculation Prompting (IP), a simple but counterintuitive technique that prevents learning of an undesired behavior by modifying training prompts to explicitly request it. For example, to inoculate against reward hacking, we modify the prompts used in supervised fine-tuning to request code that only works on provided test cases but fails on other inputs. Across four settings we find that IP reduces the learning of undesired behavior without substantially reducing the learning of desired capabilities. We also show that prompts which more strongly elicit the undesired behavior prior to fine-tuning more effectively inoculate against the behavior when used during training; this serves as a heuristic to identify promising inoculation prompts. Overall, IP is a simple yet effective way to control how models generalize from fine-tuning, preventing learning of undesired behaviors without substantially disrupting desired capabilities.

[538] Empirical Comparison of Membership Inference Attacks in Deep Transfer Learning

Yuxuan Bai, Gauri Pradhan, Marlon Tobaben, Antti Honkela

Main category: cs.LG

TL;DR: Comparison of diverse membership inference attacks (MIAs) in transfer learning settings shows no single MIA captures all privacy risks, with LiRA generally performing best but IHA being more effective on PatchCamelyon dataset in high data regime.

Details

Motivation: With the shift to transfer learning using foundation models, there's a need to comprehensively assess privacy risks through MIAs, as prior work only evaluated a limited subset of attacks.

Method: Compared performance of diverse membership inference attacks in transfer learning settings, analyzing attack efficacy across different training data sizes and datasets.

Result: Attack efficacy decreases with more training data for score-based MIAs. LiRA performs best in most scenarios, but IHA is more effective on PatchCamelyon dataset with large training data.

Conclusion: No single MIA captures all privacy risks in transfer learning models. Practitioners need to consider multiple attacks for comprehensive privacy evaluation, with LiRA generally recommended but IHA being better for specific datasets.

Abstract: With the emergence of powerful large-scale foundation models, the training paradigm is increasingly shifting from from-scratch training to transfer learning. This enables high utility training with small, domain-specific datasets typical in sensitive applications. Membership inference attacks (MIAs) provide an empirical estimate of the privacy leakage by machine learning models. Yet, prior assessments of MIAs against models fine-tuned with transfer learning rely on a small subset of possible attacks. We address this by comparing performance of diverse MIAs in transfer learning settings to help practitioners identify the most efficient attacks for privacy risk evaluation. We find that attack efficacy decreases with the increase in training data for score-based MIAs. We find that there is no one MIA which captures all privacy risks in models trained with transfer learning. While the Likelihood Ratio Attack (LiRA) demonstrates superior performance across most experimental scenarios, the Inverse Hessian Attack (IHA) proves to be more effective against models fine-tuned on PatchCamelyon dataset in high data regime.

[539] BLISS: A Lightweight Bilevel Influence Scoring Method for Data Selection in Language Model Pretraining

Jie Hao, Rui Yu, Wei Zhang, Huixia Wang, Jie Xu, Mingrui Liu

Main category: cs.LG

TL;DR: BLISS is a lightweight data selection method for LLM pretraining that operates from scratch without external models, using bilevel optimization to estimate long-term influence of training samples.

Details

Motivation: Existing data selection methods rely on external pretrained models and overlook long-term impact due to high costs of full LLM pretraining.

Method: Uses a small proxy model and score model with bilevel optimization - upper level optimizes sample weights, lower level trains proxy model to convergence on weighted loss.

Result: Achieves 1.7x speedup in reaching same performance as state-of-the-art method under 1B model setting, with superior performance across multiple downstream tasks.

Conclusion: BLISS provides an effective data selection approach that works from scratch and accounts for long-term training impact, demonstrating significant efficiency gains.

Abstract: Effective data selection is essential for pretraining large language models (LLMs), enhancing efficiency and improving generalization to downstream tasks. However, existing approaches often require leveraging external pretrained models, making it difficult to disentangle the effects of data selection from those of the external pretrained models. In addition, they often overlook the long-term impact of selected data if the model is trained to convergence, primarily due to the prohibitive cost of full-scale LLM pretraining. In this paper, we introduce BLISS (\textbf{B}ileve\textbf{L} \textbf{I}nfluence \textbf{S}coring method for data \textbf{S}election): a lightweight data selection method that operates entirely \emph{from scratch}, without relying on any external pretrained oracle models, while explicitly accounting for the long-term impact of selected data. BLISS leverages a small proxy model as a surrogate for the LLM and employs a score model to estimate the long-term influence of training samples if the proxy model is trained to convergence. We formulate data selection as a bilevel optimization problem, where the upper-level objective optimizes the score model to assign importance weights to training samples, ensuring that minimizing the lower-level objective (i.e., training the proxy model over the weighted training loss until convergence) leads to best validation performance. Once optimized, the trained score model predicts influence scores for the dataset, enabling efficient selection of high-quality samples for LLM pretraining. We validate BLISS by pretraining 410M/1B/2.8B Pythia and LLaMA-0.5B models on selected subsets of the C4 dataset. Notably, under the 1B model setting, BLISS achieves $1.7\times$ speedup in reaching the same performance as the state-of-the-art method, demonstrating superior performance across multiple downstream tasks.

[540] Edit-Based Flow Matching for Temporal Point Processes

David Lüdke, Marten Lienen, Marcel Kollovieh, Stephan Günnemann

Main category: cs.LG

TL;DR: The paper introduces Edit Flow, a non-autoregressive diffusion model for temporal point processes that uses insert, delete, and substitute operations to transport noise to data, reducing edit operations during generation.

Details

Motivation: Existing autoregressive TPP models are limited by sequential sampling, while recent diffusion models use discrete Markov chains with insertions and deletions. This work aims to generalize this approach with more flexible edit operations.

Method: Proposes Edit Flow process that learns instantaneous edit rates (insert, delete, substitute) within a continuous-time Markov chain framework to transport noise to data.

Result: Empirical results show the model achieves generative flexibility in both unconditional and conditional generation tasks on benchmark TPPs.

Conclusion: The Edit Flow process provides a flexible and efficient non-autoregressive approach for temporal point processes that effectively reduces necessary edit operations during generation.

Abstract: Temporal point processes (TPPs) are a fundamental tool for modeling event sequences in continuous time, but most existing approaches rely on autoregressive parameterizations that are limited by their sequential sampling. Recent non-autoregressive, diffusion-style models mitigate these issues by jointly interpolating between noise and data through event insertions and deletions in a discrete Markov chain. In this work, we generalize this perspective and introduce an Edit Flow process for TPPs that transports noise to data via insert, delete, and substitute edit operations. By learning the instantaneous edit rates within a continuous-time Markov chain framework, we attain a flexible and efficient model that effectively reduces the total number of necessary edit operations during generation. Empirical results demonstrate the generative flexibility of our unconditionally trained model in a wide range of unconditional and conditional generation tasks on benchmark TPPs.

cs.MA

[541] R3R: Decentralized Multi-Agent Collision Avoidance with Infinite-Horizon Safety

Thomas Marshall Vielmetti, Devansh R. Agrawal, Dimitra Panagou

Main category: cs.MA

TL;DR: R3R is the first decentralized, asynchronous multi-agent motion planning framework with infinite-horizon safety guarantees under communication constraints, using a novel combination of gatekeeper safety and R-Boundedness geometric constraints.

Details

Motivation: Existing decentralized methods lack formal infinite-horizon safety guarantees, especially for communication-constrained multi-agent systems, creating a need for provably safe planning solutions.

Method: Combines gatekeeper safety framework with R-Boundedness geometric constraint, constraining trajectories within a fixed planning radius derived from communication radius, enabling provable safety using only local information in fully asynchronous settings.

Result: Validated in simulations with up to 128 Dubins vehicles, achieving 100% safety in dense, obstacle-rich scenarios, with performance scaling with agent density rather than problem size.

Conclusion: R3R provides a practical solution for scalable and provably safe multi-agent systems with formal infinite-horizon safety guarantees under communication constraints.

Abstract: Existing decentralized methods for multi-agent motion planning lack formal, infinite-horizon safety guarantees, especially for communication-constrained systems. We present R3R, to our knowledge the first decentralized and asynchronous framework for multi-agent motion planning under distance-based communication constraints with infinite-horizon safety guarantees for systems of nonlinear agents. R3R’s novelty lies in combining our gatekeeper safety framework with a geometric constraint called R-Boundedness, which together establish a formal link between an agent’s communication radius and its ability to plan safely. We constrain trajectories to within a fixed planning radius that is a function of the agent’s communication radius, which enables trajectories to be shown provably safe for all time, using only local information. Our algorithm is fully asynchronous, and ensures the forward invariance of these guarantees even in time-varying networks where agents asynchronously join, leave, and replan. We validate our approach in simulations of up to 128 Dubins vehicles, demonstrating 100% safety in dense, obstacle rich scenarios. Our results demonstrate that R3R’s performance scales with agent density rather than problem size, providing a practical solution for scalable and provably safe multi-agent systems.

[542] Generalizing Liquid Democracy to multi-agent delegation: A Voting Weight Measure and Equilibrium Analysis

Francisco M. Bersetche

Main category: cs.MA

TL;DR: The paper proposes a fractional delegation model for liquid democracy that allows agents to split their voting weight among multiple representatives while keeping some for themselves, with penalties for long delegation chains to ensure equilibrium existence.

Details

Motivation: To generalize classic liquid democracy by enabling fractional delegation while maintaining equilibrium states, addressing limitations of the classical model where pure strategy Nash equilibria may not exist.

Method: Introduces a penalty mechanism for delegation chain length and allows agents to partition and delegate votes to multiple representatives while retaining partial voting power.

Result: Demonstrates that the proposed game exhibits pure strategy Nash equilibria when delegation chain penalties are imposed, unlike the classical model.

Conclusion: Smaller penalty factors bring the model closer to satisfying desirable properties, and the fractional delegation approach with chain penalties enables equilibrium existence in liquid democracy.

Abstract: In this study, we propose a generalization of the classic model of liquid democracy that allows fractional delegation of voting weight, while simultaneously allowing for the existence of equilibrium states. Our approach empowers agents to partition and delegate their votes to multiple representatives, all while retaining a fraction of the voting power for themselves. We introduce a penalty mechanism for the length of delegation chains. We discuss the desirable properties of a reasonable generalization of the classic model, and prove that smaller penalty factors bring the model closer to satisfying these properties. In the subsequent section, we explore the presence of equilibrium states in a general delegation game utilizing the proposed voting measure. In contrast to the classical model, we demonstrate that this game exhibits pure strategy Nash equilibria, contingent upon the imposition of a penalty on the length of delegation chains.

[543] Dynamic Strategy Adaptation in Multi-Agent Environments with Large Language Models

Shaurya Mallampati, Rashed Shelim, Walid Saad, Naren Ramakrishnan

Main category: cs.MA

TL;DR: LLMs combined with game-theoretic reasoning and real-time adaptation achieve 26% improvement over PPO baselines in dynamic multi-agent cooperative environments.

Details

Motivation: To test LLM reasoning in dynamic, real-time multi-agent scenarios where agents continuously adapt to each other, unlike previous static or turn-based evaluations.

Method: Combines LLM-driven agents with strategic reasoning using game-theoretic principles (belief consistency, Nash equilibrium) and real-time adaptation mechanisms with strategy refinement and adaptive feedback.

Result: Achieves 26% improvement in return over PPO baselines in high-noise environments while maintaining real-time latency under 1.05 milliseconds. Improves collaboration efficiency, task completion rates, and flexibility.

Conclusion: Game-theoretic guidance integrated with real-time feedback enhances LLM performance, fostering more resilient and flexible strategic multi-agent systems.

Abstract: Large language models (LLMs) demonstrate strong reasoning abilities across mathematical, strategic, and linguistic tasks, yet little is known about how well they reason in dynamic, real-time, multi-agent scenarios, such as collaborative environments in which agents continuously adapt to each other’s behavior, as in cooperative gameplay settings. In this paper, we bridge this gap by combining LLM-driven agents with strategic reasoning and real-time adaptation in cooperative, multi-agent environments grounded in game-theoretic principles such as belief consistency and Nash equilibrium. The proposed framework applies broadly to dynamic scenarios in which agents coordinate, communicate, and make decisions in response to continuously changing conditions. We provide real-time strategy refinement and adaptive feedback mechanisms that enable agents to dynamically adjust policies based on immediate contextual interactions, in contrast to previous efforts that evaluate LLM capabilities in static or turn-based settings. Empirical results show that our method achieves up to a 26% improvement in return over PPO baselines in high-noise environments, while maintaining real-time latency under 1.05 milliseconds. Our approach improves collaboration efficiency, task completion rates, and flexibility, illustrating that game-theoretic guidance integrated with real-time feedback enhances LLM performance, ultimately fostering more resilient and flexible strategic multi-agent systems.

cs.MM

eess.AS

[544] Moises-Light: Resource-efficient Band-split U-Net For Music Source Separation

Yun-Ning, Hung, Igor Pereira, Filip Korzeniowski

Main category: eess.AS

TL;DR: Proposes Moises-Light, a lightweight music source separation model that achieves comparable performance to models with 13x more parameters through careful design.

Details

Motivation: Existing music source separation models are parameter-heavy and challenging for resource-constrained devices, while lightweight models perform worse than larger counterparts.

Method: Takes inspiration from recent advances (dual-path modeling, band-split modules, transformer layers) to improve a lightweight model with careful design.

Result: Moises-Light achieves competitive SDRs comparable to models with 13x more parameters on MUSDB-HQ benchmark for four musical stems, and shows competitive scalability with MoisesDB training data.

Conclusion: With careful design, lightweight models can achieve performance comparable to much larger models in music source separation tasks.

Abstract: In recent years, significant advances have been made in music source separation, with model architectures such as dual-path modeling, band-split modules, or transformer layers achieving comparably good results. However, these models often contain a significant number of parameters, posing challenges to devices with limited computational resources in terms of training and practical application. While some lightweight models have been introduced, they generally perform worse compared to their larger counterparts. In this paper, we take inspiration from these recent advances to improve a lightweight model. We demonstrate that with careful design, a lightweight model can achieve comparable SDRs to models with up to 13 times more parameters. Our proposed model, Moises-Light, achieves competitive results in separating four musical stems on the MUSDB-HQ benchmark dataset. The proposed model also demonstrates competitive scalability when using MoisesDB as additional training data.

[545] Towards Responsible Evaluation for Text-to-Speech

Yifan Yang, Hui Wang, Bing Han, Shujie Liu, Jinyu Li, Yong Qin, Xie Chen

Main category: eess.AS

TL;DR: This position paper introduces Responsible Evaluation for TTS systems, addressing current inadequate evaluation practices through three levels: accurate capability assessment, standardized benchmarks, and ethical risk mitigation.

Details

Motivation: Current TTS evaluation practices are inadequate for capturing full capabilities, limitations, and societal implications despite advances producing human-indistinguishable speech.

Method: Proposes Responsible Evaluation concept structured through three progressive levels: 1) accurate capability reflection with robust scoring, 2) comparability through standardized benchmarks, and 3) ethical risk assessment.

Result: Identifies systemic shortcomings in current evaluation practices and provides actionable recommendations for improvement.

Conclusion: Responsible Evaluation concept aims to foster trustworthy TTS technology and guide development toward ethically sound and societally beneficial applications.

Abstract: Recent advances in text-to-speech (TTS) technology have enabled systems to produce human-indistinguishable speech, bringing benefits across accessibility, content creation, and human-computer interaction. However, current evaluation practices are increasingly inadequate for capturing the full range of capabilities, limitations, and societal implications. This position paper introduces the concept of Responsible Evaluation and argues that it is essential and urgent for the next phase of TTS development, structured through three progressive levels: (1) ensuring the faithful and accurate reflection of a model’s true capabilities, with more robust, discriminative, and comprehensive objective and subjective scoring methodologies; (2) enabling comparability, standardization, and transferability through standardized benchmarks, transparent reporting, and transferable evaluation metrics; and (3) assessing and mitigating ethical risks associated with forgery, misuse, privacy violations, and security vulnerabilities. Through this concept, we critically examine current evaluation practices, identify systemic shortcomings, and propose actionable recommendations. We hope this concept of Responsible Evaluation will foster more trustworthy and reliable TTS technology and guide its development toward ethically sound and societally beneficial applications.

[546] Comparison of Speech Tasks in Human Expert and Machine Detection of Parkinson’s Disease

Peter Plantinga, Roozbeh Sattari, Karine Marcotte, Carla Di Gironimo, Madeleine Sharp, Liziane Bouvier, Maiya Geddes, Ingrid Verduyckt, Étienne de Villers-Sidani, Mirco Ravanelli, Denise Klein

Main category: eess.AS

TL;DR: This paper investigates how human experts detect Parkinson’s Disease from speech across five tasks and compares their performance with Whisper-based machine learning system.

Details

Motivation: To understand the factors human experts use to detect Parkinson's Disease from speech and compare their accuracy with machine learning approaches, especially for challenging patient subgroups.

Method: Conducted listening tests with clinicians to assess PD detection from audio alone, and developed a machine learning system based on Whisper for detection across five speech tasks: phonations, sentence repetition, reading, recall, and picture description.

Result: Whisper performs on par or better than human experts when only audio is available, particularly excelling with challenging subgroups: younger patients, mild cases, and female patients.

Conclusion: Whisper’s ability to recognize acoustic cues in difficult cases complements human experts’ multimodal and contextual strengths, suggesting potential for human-machine collaboration in PD detection from speech.

Abstract: The speech of people with Parkinson’s Disease (PD) has been shown to hold important clues about the presence and progression of the disease. We investigate the factors based on which humans experts make judgments of the presence of disease in speech samples over five different speech tasks: phonations, sentence repetition, reading, recall, and picture description. We make comparisons by conducting listening tests to determine clinicians accuracy at recognizing signs of PD from audio alone, and we conduct experiments with a machine learning system for detection based on Whisper. Across tasks, Whisper performs on par or better than human experts when only audio is available, especially on challenging but important subgroups of the data: younger patients, mild cases, and female patients. Whisper’s ability to recognize acoustic cues in difficult cases complements the multimodal and contextual strengths of human experts.

[547] Good practices for evaluation of synthesized speech

Erica Cooper, Sébastien Le Maguer, Esther Klabbers, Junichi Yamagishi

Main category: eess.AS

TL;DR: A guideline document for reviewers of speech synthesis papers, focusing on evaluation best practices and common pitfalls.

Details

Motivation: To provide structured guidance for reviewers evaluating speech synthesis papers, particularly in the area of evaluation methodology.

Method: Outlining best practices and common pitfalls for paper review, with emphasis on evaluation criteria and referencing author guidelines.

Result: Creation of a living document that serves as a review guideline, to be updated based on reader feedback.

Conclusion: This document offers guidance but emphasizes that reviewers should use their own discretion when evaluating papers.

Abstract: This document is provided as a guideline for reviewers of papers about speech synthesis. We outline some best practices and common pitfalls for papers about speech synthesis, with a particular focus on evaluation. We also recommend that reviewers check the guidelines for authors written in the paper kit and consider those as reviewing criteria as well. This is intended to be a living document, and it will be updated as we receive comments and feedback from readers. We note that this document is meant to provide guidance only, and that reviewers should ultimately use their own discretion when evaluating papers.

[548] Self-Supervised Speech Quality Assessment (S3QA): Leveraging Speech Foundation Models for a Scalable Speech Quality Metric

Mattson Ogg, Caitlyn Bishop, Han Yi, Sarah Robinson

Main category: eess.AS

TL;DR: S3QA is a self-supervised speech quality assessment model that automatically predicts speech degradation using pre-trained speech foundation models and transformer architecture, achieving alignment with human ratings and speech technology performance.

Details

Motivation: Human behavioral ratings (MOS) for speech quality assessment are labor-intensive, variable between raters, and difficult to generalize across corpora, limiting their scalability for quantifying diverse acoustic challenges.

Method: Used WavLM foundation model to create self-supervised training targets by computing cosine distances between clean and degraded speech embeddings. Trained transformer model to predict these distances from degraded speech only, using various acoustic degradations like filtering, reverberation, noise, and compression.

Result: S3QA accurately predicts degradation across diverse acoustic conditions, aligns with behavioral MOS ratings and ASR performance, and captures important data features like microphone distances on unseen test corpora.

Conclusion: S3QA provides an automated, scalable method for assessing speech quality across various acoustic challenges, overcoming limitations of human rating approaches.

Abstract: Methods for automatically assessing speech quality in real world environments are critical for developing robust human language technologies and assistive devices. Behavioral ratings provided by human raters (e.g., mean opinion scores; MOS) are considered the gold standard, but they are susceptible to variability between individual raters, cannot easily be generalized across corpora, and are labor-intensive to collect, thus limiting the acoustic challenges they can quantify. Here, we present a new, scalable method for automatically assessing speech quality: the self-supervised speech quality assessment (S3QA) model. First, we manipulated high quality utterances from multiple speech corpora, using a wide range of acoustic challenges intended to emulate common sources of quality degradation in the real-world: frequency filtering, reverberation, background noise, and digital compression. Second, we leveraged an existing, pre-trained speech foundation model, WavLM, to computationally derive a self-supervised training target that quantified speech degradation using the cosine distance between the clean and degraded versions of each utterance in the embedding space. Next, we trained a transformer-based model to predict these cosine distances, given only the degraded versions of the utterances. Finally, the trained model was evaluated on unseen test corpora of synthetic mixtures, NISQA, and VOiCES. We show that the S3QA model trained on this task accurately predicts degradation cosine distances across a wide range challenging acoustic conditions and is aligned with both behavioral ratings (MOS), speech technology performance (automatic speech recognition) and other important features of the held-out data (e.g., microphone distances). This model provides an automated, scalable method for assessing speech quality across a wide range of acoustic challenges.

[549] Enhancing Few-shot Keyword Spotting Performance through Pre-Trained Self-supervised Speech Models

Alican Gok, Oguzhan Buyuksolak, Osman Erman Okman, Murat Saraclar

Main category: eess.AS

TL;DR: A novel training scheme for Few-Shot Keyword Spotting that uses self-supervised learning, Sub-center ArcFace loss, and knowledge distillation to significantly improve accuracy on resource-constrained edge devices.

Details

Motivation: Traditional FS-KWS systems have subpar accuracy at desirable false acceptance rates, especially in resource-constrained edge environments where hands-free interaction is critical for battery-powered devices.

Method: Leverages self-supervised learning models (Wav2Vec 2.0 teacher model) with Sub-center ArcFace loss for robust feature extraction, uses attention-based dimensionality reduction, and trains a lightweight ResNet15 student model through knowledge distillation.

Result: Achieved 74.1% 10-shot classification accuracy on 11 classes at 1% false alarm rate on GSC dataset, improving from previous 33.4% - making it significantly better for real-world use cases.

Conclusion: The proposed training method effectively addresses FS-KWS challenges by combining self-supervised learning, advanced loss functions, and efficient model design for practical edge deployment.

Abstract: Keyword Spotting plays a critical role in enabling hands-free interaction for battery-powered edge devices. Few-Shot Keyword Spotting (FS-KWS) addresses the scalability and adaptability challenges of traditional systems by enabling recognition of custom keywords with only a few examples. However, existing FS-KWS systems achieve subpar accuracy at desirable false acceptance rates, particularly in resource-constrained edge environments. To address these issues, we propose a training scheme that leverages self-supervised learning models for robust feature extraction, dimensionality reduction, and knowledge distillation. The teacher model, based on Wav2Vec 2.0 is trained using Sub-center ArcFace loss, which enhances inter-class separability and intra-class compactness. To enable efficient deployment on edge devices, we introduce attention-based dimensionality reduction and train a standard lightweight ResNet15 student model. We evaluate the proposed approach on the English portion of the Multilingual Spoken Words Corpus (MSWC) and the Google Speech Commands (GSC) datasets. Notably, the proposed training method improves the 10-shot classification accuracy from 33.4% to 74.1% on 11 classes at 1% false alarm accuracy on the GSC dataset, thus making it significantly better-suited for a real use case scenario.

[550] Enhancing Situational Awareness in Wearable Audio Devices Using a Lightweight Sound Event Localization and Detection System

Jun-Wei Yeow, Ee-Leng Tan, Santi Peksi, Zhen-Ting Ong, Woon-Seng Gan

Main category: eess.AS

TL;DR: Proposes an environmental intelligence framework combining Acoustic Scene Classification (ASC) with Sound Event Localization and Detection (SELD) to enhance situational awareness in wearable audio devices with ANC.

Details

Motivation: Active noise control (ANC) in wearable audio devices creates auditory isolation that masks crucial environmental cues, posing safety risks by reducing situational awareness.

Method: Uses lightweight ASC model to infer current environment, then dynamically conditions a SELD network based on scene prediction to tune sensitivity for detecting contextually salient sounds.

Result: On simulated headphone data, the ASC-conditioned SELD system shows improved spatial intelligence compared to conventional baseline approaches.

Conclusion: This represents a crucial step towards intelligent hearables that deliver environmental information for safer, more context-aware listening experiences.

Abstract: Wearable audio devices with active noise control (ANC) enhance listening comfort but often at the expense of situational awareness. However, this auditory isolation may mask crucial environmental cues, posing significant safety risks. To address this, we propose an environmental intelligence framework that combines Acoustic Scene Classification (ASC) with Sound Event Localization and Detection (SELD). Our system first employs a lightweight ASC model to infer the current environment. The scene prediction then dynamically conditions a SELD network, tuning its sensitivity to detect and localize sounds that are most salient to the current context. On simulated headphone data, the proposed ASC-conditioned SELD system demonstrates improved spatial intelligence over a conventional baseline. This work represents a crucial step towards creating intelligent hearables that can deliver crucial environmental information, fostering a safer and more context-aware listening experience.

eess.IV

[551] Stacked Regression using Off-the-shelf, Stimulus-tuned and Fine-tuned Neural Networks for Predicting fMRI Brain Responses to Movies (Algonauts 2025 Report)

Robert Scholz, Kunal Bagga, Christine Ahrends, Carlo Alberto Barbano

Main category: eess.IV

TL;DR: The paper presents a multimodal approach for predicting fMRI brain responses to movie stimuli, combining various models and achieving 10th place in the Algonauts 2025 Challenge.

Details

Motivation: To develop effective multimodal encoding models for predicting brain activity in response to complex movie stimuli, addressing the challenge of understanding how the brain processes audiovisual information.

Method: Integrated multimodal representations from large language models, video encoders, audio models, and vision-language models (both off-the-shelf and fine-tuned), enhanced textual inputs with detailed transcripts and summaries, and used stacked regression to combine individual model predictions.

Result: Achieved solid performance with the team ranking 10th in the Algonauts 2025 Challenge, demonstrating the effectiveness of the multimodal approach for brain activity prediction.

Conclusion: The multimodal integration strategy successfully predicts fMRI responses to movie stimuli, and the authors contribute to the field by making all code and resources publicly available for further research.

Abstract: We present our submission to the Algonauts 2025 Challenge, where the goal is to predict fMRI brain responses to movie stimuli. Our approach integrates multimodal representations from large language models, video encoders, audio models, and vision-language models, combining both off-the-shelf and fine-tuned variants. To improve performance, we enhanced textual inputs with detailed transcripts and summaries, and we explored stimulus-tuning and fine-tuning strategies for language and vision models. Predictions from individual models were combined using stacked regression, yielding solid results. Our submission, under the team name Seinfeld, ranked 10th. We make all code and resources publicly available, contributing to ongoing efforts in developing multimodal encoding models for brain activity.

Mehdi Rabiee, Sergio Greco, Reza Shahbazian, Irina Trubitsyna

Main category: eess.IV

TL;DR: A novel framework for segmenting Focal Cortical Dysplasia (FCD) in 3D brain MRI using transformer-enhanced encoder-decoder architecture with anisotropic Total Variation loss to improve spatial smoothness and reduce false positives.

Details

Motivation: FCD is difficult to detect in MRI due to subtle lesions, limited annotated datasets, small lesion size, weak contrast, and need for anatomical consistency not addressed by standard loss functions.

Method: Transformer-enhanced encoder-decoder architecture combined with novel loss function integrating Dice loss with anisotropic Total Variation term to encourage spatial smoothness and reduce false positives without post-processing.

Result: 11.9% improvement in Dice coefficient, 13.3% higher precision over baseline, and 61.6% reduction in false positive clusters on public FCD dataset with 85 epilepsy patients.

Conclusion: The proposed framework with anisotropic TV loss significantly improves FCD segmentation accuracy, precision, and reduces false positives compared to standard approaches.

Abstract: Focal Cortical Dysplasia (FCD) is a primary cause of drug-resistant epilepsy and is difficult to detect in brain {magnetic resonance imaging} (MRI) due to the subtle and small-scale nature of its lesions. Accurate segmentation of FCD regions in 3D multimodal brain MRI images is essential for effective surgical planning and treatment. However, this task remains highly challenging due to the limited availability of annotated FCD datasets, the extremely small size and weak contrast of FCD lesions, the complexity of handling 3D multimodal inputs, and the need for output smoothness and anatomical consistency, which is often not addressed by standard voxel-wise loss functions. This paper presents a new framework for segmenting FCD regions in 3D brain MRI images. We adopt state-of-the-art transformer-enhanced encoder-decoder architecture and introduce a novel loss function combining Dice loss with an anisotropic {Total Variation} (TV) term. This integration encourages spatial smoothness and reduces false positive clusters without relying on post-processing. The framework is evaluated on a public FCD dataset with 85 epilepsy patients and demonstrates superior segmentation accuracy and consistency compared to standard loss formulations. The model with the proposed TV loss shows an 11.9% improvement on the Dice coefficient and 13.3% higher precision over the baseline model. Moreover, the number of false positive clusters is reduced by 61.6%

[553] SER-Diff: Synthetic Error Replay Diffusion for Incremental Brain Tumor Segmentation

Sashank Makanaboyina

Main category: eess.IV

TL;DR: SER-Diff is a novel framework that combines diffusion-based refinement with incremental learning for brain tumor segmentation, using synthetic error replay to prevent catastrophic forgetting without requiring generative replay or large storage.

Details

Motivation: To address catastrophic forgetting in incremental brain tumor segmentation models, which typically rely on generative replay or auxiliary storage, and to explore the potential of diffusion models in incremental learning contexts.

Method: Uses a frozen teacher diffusion model to generate synthetic error maps from past tasks, which are replayed during new task training. Employs a dual-loss formulation combining Dice loss for new data and knowledge distillation loss for replayed errors.

Result: Achieves state-of-the-art performance on BraTS datasets: Dice scores of 95.8%, 94.9%, and 94.6% and HD95 values of 4.4mm, 4.7mm, and 4.9mm on BraTS2020, BraTS2021, and BraTS2023 respectively.

Conclusion: SER-Diff effectively mitigates catastrophic forgetting while producing more accurate and anatomically coherent segmentations across evolving datasets, demonstrating the successful integration of diffusion models with incremental learning.

Abstract: Incremental brain tumor segmentation is critical for models that must adapt to evolving clinical datasets without retraining on all prior data. However, catastrophic forgetting, where models lose previously acquired knowledge, remains a major obstacle. Recent incremental learning frameworks with knowledge distillation partially mitigate forgetting but rely heavily on generative replay or auxiliary storage. Meanwhile, diffusion models have proven effective for refining tumor segmentations, but have not been explored in incremental learning contexts. We propose Synthetic Error Replay Diffusion (SER-Diff), the first framework that unifies diffusion-based refinement with incremental learning. SER-Diff leverages a frozen teacher diffusion model to generate synthetic error maps from past tasks, which are replayed during training on new tasks. A dual-loss formulation combining Dice loss for new data and knowledge distillation loss for replayed errors ensures both adaptability and retention. Experiments on BraTS2020, BraTS2021, and BraTS2023 demonstrate that SER-Diff consistently outperforms prior methods. It achieves the highest Dice scores of 95.8%, 94.9%, and 94.6%, along with the lowest HD95 values of 4.4 mm, 4.7 mm, and 4.9 mm, respectively. These results indicate that SER-Diff not only mitigates catastrophic forgetting but also delivers more accurate and anatomically coherent segmentations across evolving datasets.

[554] Conditional Denoising Diffusion Model-Based Robust MR Image Reconstruction from Highly Undersampled Data

Mohammed Alsubaie, Wenxi Liu, Linxia Gu, Ovidiu C. Andronesi, Sirani M. Perera, Xianqi Li

Main category: eess.IV

TL;DR: A conditional denoising diffusion framework with iterative data-consistency correction for accelerated MRI reconstruction that embeds MRI physics directly into reverse diffusion steps.

Details

Motivation: MRI acquisition time is a critical limitation in clinical settings. While undersampling accelerates acquisition, it causes artifacts. Existing diffusion models either use unsupervised score functions or apply data consistency as post-processing, lacking direct integration of measurement models.

Method: Hybrid conditional denoising diffusion framework that trains on paired undersampled-ground truth data and embeds the measurement model directly into every reverse diffusion step with iterative data-consistency correction.

Result: Outperforms state-of-the-art deep learning and diffusion-based methods on fastMRI dataset in SSIM, PSNR, and LPIPS metrics, with LPIPS capturing perceptual improvements more faithfully.

Conclusion: Integrating conditional supervision with iterative consistency updates yields substantial improvements in both pixel-level fidelity and perceptual realism, establishing a principled advance toward robust, accelerated MRI reconstruction.

Abstract: Magnetic Resonance Imaging (MRI) is a critical tool in modern medical diagnostics, yet its prolonged acquisition time remains a critical limitation, especially in time-sensitive clinical scenarios. While undersampling strategies can accelerate image acquisition, they often result in image artifacts and degraded quality. Recent diffusion models have shown promise for reconstructing high-fidelity images from undersampled data by learning powerful image priors; however, most existing approaches either (i) rely on unsupervised score functions without paired supervision or (ii) apply data consistency only as a post-processing step. In this work, we introduce a conditional denoising diffusion framework with iterative data-consistency correction, which differs from prior methods by embedding the measurement model directly into every reverse diffusion step and training the model on paired undersampled-ground truth data. This hybrid design bridges generative flexibility with explicit enforcement of MRI physics. Experiments on the fastMRI dataset demonstrate that our framework consistently outperforms recent state-of-the-art deep learning and diffusion-based methods in SSIM, PSNR, and LPIPS, with LPIPS capturing perceptual improvements more faithfully. These results demonstrate that integrating conditional supervision with iterative consistency updates yields substantial improvements in both pixel-level fidelity and perceptual realism, establishing a principled and practical advance toward robust, accelerated MRI reconstruction.

[555] FEAorta: A Fully Automated Framework for Finite Element Analysis of the Aorta From 3D CT Images

Jiasong Chen, Linchen Qian, Ruonan Gong, Christina Sun, Tongran Qin, Thuy Pham, Caitlin Martin, Mohammad Zafar, John Elefteriades, Wei Sun, Liang Liang

Main category: eess.IV

TL;DR: Development of an end-to-end deep neural network for automated generation of patient-specific aortic finite element meshes from 3D CT images to overcome labor-intensive manual segmentation in thoracic aortic aneurysm rupture risk assessment.

Details

Motivation: Aortic aneurysm is a top 20 cause of death in the US, and current patient-specific biomechanical analysis for rupture risk assessment faces two major barriers: labor-intensive manual segmentation and computational burden of traditional FEA.

Method: The authors developed an end-to-end deep neural network that generates patient-specific finite element meshes of the aorta directly from 3D CT images, building on their previous work that reduced FEA computation time through PyTorch FEA library and DNN-FEA integration.

Result: The previous work successfully reduced FEA-based stress computation time from resource-intensive traditional methods to approximately three minutes per case using PyTorch FEA, and further to just a few seconds per case through DNN-FEA integration.

Conclusion: This work addresses the remaining barrier of manual segmentation by developing automated mesh generation capabilities, which combined with their previous computational efficiency improvements, aims to make patient-specific TAA rupture risk assessment clinically feasible and scalable.

Abstract: Aortic aneurysm disease ranks consistently in the top 20 causes of death in the U.S. population. Thoracic aortic aneurysm is manifested as an abnormal bulging of thoracic aortic wall and it is a leading cause of death in adults. From the perspective of biomechanics, rupture occurs when the stress acting on the aortic wall exceeds the wall strength. Wall stress distribution can be obtained by computational biomechanical analyses, especially structural Finite Element Analysis. For risk assessment, probabilistic rupture risk of TAA can be calculated by comparing stress with material strength using a material failure model. Although these engineering tools are currently available for TAA rupture risk assessment on patient specific level, clinical adoption has been limited due to two major barriers: labor intensive 3D reconstruction current patient specific anatomical modeling still relies on manual segmentation, making it time consuming and difficult to scale to a large patient population, and computational burden traditional FEA simulations are resource intensive and incompatible with time sensitive clinical workflows. The second barrier was successfully overcome by our team through the development of the PyTorch FEA library and the FEA DNN integration framework. By incorporating the FEA functionalities within PyTorch FEA and applying the principle of static determinacy, we reduced the FEA based stress computation time to approximately three minutes per case. Moreover, by integrating DNN and FEA through the PyTorch FEA library, our approach further decreases the computation time to only a few seconds per case. This work focuses on overcoming the first barrier through the development of an end to end deep neural network capable of generating patient specific finite element meshes of the aorta directly from 3D CT images.

[556] Fitzpatrick Thresholding for Skin Image Segmentation

Duncan Stothers, Sophia Xu, Carlie Reeves, Lia Gracey

Main category: eess.IV

TL;DR: The paper proposes using Fitzpatrick skin tone-specific decision thresholds to improve psoriasis segmentation performance across different skin tones, particularly benefiting darker skin tones with up to 31% improvement.

Details

Motivation: Current psoriasis segmentation models perform significantly worse on darker skin tones, potentially leading to inequitable healthcare outcomes. Accurate BSA estimation is crucial for treatment assessment and selection.

Method: Assembled psoriasis dataset from six public atlases with Fitzpatrick skin type annotations and segmentation masks. Trained U-Net, ResU-Net, and SETR-small models without tone information, then applied Fitzpatrick-specific decision thresholds during inference.

Result: Fitzpatrick-specific thresholds improved segmentation for darkest skin tones (Fitz VI) by up to +31% binary IoU and +24% Dice on U-Net, with consistent gains for other architectures. The approach is model-agnostic and requires no retraining.

Conclusion: Fitzpatrick thresholding is a simple, cost-effective fairness intervention that significantly improves segmentation performance on darker skin tones without architectural changes or retraining, making it a promising fairness baseline for medical imaging.

Abstract: Accurate estimation of the body surface area (BSA) involved by a rash, such as psoriasis, is critical for assessing rash severity, selecting an initial treatment regimen, and following clinical treatment response. Attempts at segmentation of inflammatory skin disease such as psoriasis perform markedly worse on darker skin tones, potentially impeding equitable care. We assembled a psoriasis dataset sourced from six public atlases, annotated for Fitzpatrick skin type, and added detailed segmentation masks for every image. Reference models based on U-Net, ResU-Net, and SETR-small are trained without tone information. On the tuning split we sweep decision thresholds and select (i) global optima and (ii) per Fitzpatrick skin tone optima for Dice and binary IoU. Adapting Fitzpatrick specific thresholds lifted segmentation performance for the darkest subgroup (Fitz VI) by up to +31 % bIoU and +24 % Dice on UNet, with consistent, though smaller, gains in the same direction for ResU-Net (+25 % bIoU, +18 % Dice) and SETR-small (+17 % bIoU, +11 % Dice). Because Fitzpatrick skin tone classifiers trained on Fitzpatrick-17k now exceed 95 % accuracy, the cost of skin tone labeling required for this technique has fallen dramatically. Fitzpatrick thresholding is simple, model-agnostic, requires no architectural changes, no re-training, and is virtually cost free. We demonstrate the inclusion of Fitzpatrick thresholding as a potential future fairness baseline.

[557] Content-Adaptive Inference for State-of-the-art Learned Video Compression

Ahmet Bilican, M. Akın Yılmaz, A. Murat Tekalp

Main category: eess.IV

TL;DR: Proposed framework adaptively downsamples frames during inference to match motion vector ranges between test and training videos, improving BD-rate performance for complex motion scenes without model retraining.

Details

Motivation: Learned video codecs underperform on videos with complex/large motions due to inability to generalize to unseen motion vector ranges, causing poor flow estimation and compression.

Method: Generic framework that controls motion vector scale by adaptively downsampling frames during encoding to match training data motion ranges, enabling better flow estimation and compression.

Result: Improves BD-rate performance of state-of-the-art DCVC-FM codec by up to 41% on individual videos with complex motions, without model fine-tuning.

Conclusion: Content-adaptive inference through frame downsampling effectively addresses motion vector generalization issues in learned video codecs, significantly boosting performance on complex motion scenes.

Abstract: While the BD-rate performance of recent learned video codec models in both low-delay and random-access modes exceed that of respective modes of traditional codecs on average over common benchmarks, the performance improvements for individual videos with complex/large motions is much smaller compared to scenes with simple motion. This is related to the inability of a learned encoder model to generalize to motion vector ranges that have not been seen in the training set, which causes loss of performance in both coding of flow fields as well as frame prediction and coding. As a remedy, we propose a generic (model-agnostic) framework to control the scale of motion vectors in a scene during inference (encoding) to approximately match the range of motion vectors in the test and training videos by adaptively downsampling frames. This results in down-scaled motion vectors enabling: i) better flow estimation; hence, frame prediction and ii) more efficient flow compression. We show that the proposed framework for content-adaptive inference improves the BD-rate performance of already state-of-the-art low-delay video codec DCVC-FM by up to 41% on individual videos without any model fine tuning. We present ablation studies to show measures of motion and scene complexity can be used to predict the effectiveness of the proposed framework.

[558] Train-Free Segmentation in MRI with Cubical Persistent Homology

Anton François, Raphaël Tinarrage

Main category: eess.IV

TL;DR: A new TDA-based framework for MRI segmentation that identifies objects via thresholding, detects topological features, and deduces segmentation components without requiring large annotated datasets.

Details

Motivation: Traditional machine learning approaches for MRI segmentation often require large annotated datasets and lack interpretability. TDA offers advantages but is typically embedded in deep networks rather than used as a standalone method.

Method: Three-step pipeline: 1) automatic thresholding to identify whole object, 2) detection of distinctive topological subsets with known topology, 3) deduction of segmentation components using localization of representative cycles from persistence diagrams.

Result: Validated on three applications: glioblastoma segmentation (sphere detection), myocardium segmentation (cylinder detection), and cortical plate detection (circle detection in 2D slices). Compared favorably with established supervised and unsupervised baselines.

Conclusion: The framework provides interpretable, modular segmentation without large annotated datasets, adaptable to various MRI segmentation challenges through topological feature mapping to anatomical components.

Abstract: We present a new general framework for segmentation of MRI scans based on Topological Data Analysis (TDA), offering several advantages over traditional machine learning approaches. The pipeline proceeds in three steps, first identifying the whole object to segment via automatic thresholding, then detecting a distinctive subset whose topology is known in advance, and finally deducing the various components of the segmentation. Unlike most prior TDA uses in medical image segmentation, which are typically embedded within deep networks, our approach is a standalone method tailored to MRI. A key ingredient is the localization of representative cycles from the persistence diagram, which enables interpretable mappings from topological features to anatomical components. In particular, the method offers the ability to perform segmentation without the need for large annotated datasets. Its modular design makes it adaptable to a wide range of data segmentation challenges. We validate the framework on three applications: glioblastoma segmentation in brain MRI, where a sphere is to be detected; myocardium in cardiac MRI, forming a cylinder; and cortical plate detection in fetal brain MRI, whose 2D slices are circles. We compare our method with established supervised and unsupervised baselines.

[559] A Deep Learning System for Rapid and Accurate Warning of Acute Aortic Syndrome on Non-contrast CT in China

Yujian Hu, Yilang Xiang, Yan-Jie Zhou, Yangyan He, Dehai Lang, Shifeng Yang, Xiaolong Du, Chunlan Den, Youyao Xu, Gaofeng Wang, Zhengyao Ding, Jingyong Huang, Wenjun Zhao, Xuejun Wu, Donglin Li, Qianqian Zhu, Zhenjiang Li, Chenyang Qiu, Ziheng Wu, Yunjun He, Chen Tian, Yihui Qiu, Zuodong Lin, Xiaolong Zhang, Yuan He, Zhenpeng Yuan, Xiaoxiang Zhou, Rong Fan, Ruihan Chen, Wenchao Guo, Jianpeng Zhang, Tony C. W. Mok, Zi Li, Mannudeep K. Kalra, Le Lu, Wenbo Xiao, Xiaoqiang Li, Yun Bian, Chengwei Shao, Guofu Wang, Wei Lu, Zhengxing Huang, Minfeng Xu, Hongkun Zhang

Main category: eess.IV

TL;DR: iAorta is an AI-based warning system that uses non-contrast CT scans to identify acute aortic syndromes (AAS) with high accuracy, addressing diagnostic challenges in resource-constrained settings where contrast CT angiography is not immediately available.

Details

Motivation: Acute aortic syndromes are difficult to diagnose in acute chest pain patients, and in China's resource-constrained healthcare system, most suspected patients initially receive non-contrast CT instead of the preferred CTA due to economic and workflow limitations.

Method: Developed an AI system called iAorta that analyzes non-contrast CT scans to identify AAS. The system was evaluated through multi-center retrospective studies, large-scale real-world studies, prospective comparative studies, and pilot deployments in emergency departments.

Result: iAorta achieved high performance: AUC of 0.958 in retrospective study, sensitivity of 0.913-0.942 and specificity of 0.991-0.993 in real-world study. In prospective deployment, it correctly identified 21/22 AAS patients and reduced average diagnostic time to 102.1 minutes.

Conclusion: iAorta effectively identifies AAS using non-contrast CT, significantly shortens diagnostic time, and helps prevent delayed or missed diagnoses in resource-constrained settings where contrast CT is not immediately available.

Abstract: The accurate and timely diagnosis of acute aortic syndromes (AAS) in patients presenting with acute chest pain remains a clinical challenge. Aortic CT angiography (CTA) is the imaging protocol of choice in patients with suspected AAS. However, due to economic and workflow constraints in China, the majority of suspected patients initially undergo non-contrast CT as the initial imaging testing, and CTA is reserved for those at higher risk. In this work, we present an artificial intelligence-based warning system, iAorta, using non-contrast CT for AAS identification in China, which demonstrates remarkably high accuracy and provides clinicians with interpretable warnings. iAorta was evaluated through a comprehensive step-wise study. In the multi-center retrospective study (n = 20,750), iAorta achieved a mean area under the receiver operating curve (AUC) of 0.958 (95% CI 0.950-0.967). In the large-scale real-world study (n = 137,525), iAorta demonstrated consistently high performance across various non-contrast CT protocols, achieving a sensitivity of 0.913-0.942 and a specificity of 0.991-0.993. In the prospective comparative study (n = 13,846), iAorta demonstrated the capability to significantly shorten the time to correct diagnostic pathway. For the prospective pilot deployment that we conducted, iAorta correctly identified 21 out of 22 patients with AAS among 15,584 consecutive patients presenting with acute chest pain and under non-contrast CT protocol in the emergency department (ED) and enabled the average diagnostic time of these 21 AAS positive patients to be 102.1 (75-133) mins. Last, the iAorta can help avoid delayed or missed diagnosis of AAS in settings where non-contrast CT remains the unavoidable the initial or only imaging test in resource-constrained regions and in patients who cannot or did not receive intravenous contrast.

[560] Optimizing Breast Cancer Detection in Mammograms: A Comprehensive Study of Transfer Learning, Resolution Reduction, and Multi-View Classification

Daniel G. P. Petrini, Hae Yong Kim

Main category: eess.IV

TL;DR: This paper systematically investigates key research questions in AI-based mammography analysis, demonstrating performance improvements through multi-view architectures and establishing new state-of-the-art benchmarks for breast cancer detection.

Details

Motivation: Despite advances in AI for mammography, critical questions remain unanswered about patch classifiers, transfer learning, resizing methods, multi-view integration, and robustness across image quality.

Method: The study systematically addresses five key research questions through experiments comparing patch-based vs whole-image approaches, natural-image transfer learning, learn-to-resize techniques, multi-view integration, and robustness testing across datasets.

Result: Achieved significant performance gains: improved CBIS-DDSM single-view AUC from 0.8153 to 0.8343, multiple-view AUC from 0.8483 to 0.8658, and VinDr-Mammo multiple-view AUC to 0.8511 with 0.0492 improvement over single view.

Conclusion: Multi-view architectures provide clear advantages for mammogram interpretation, establishing new state-of-the-art benchmarks and offering principled insights for developing more accurate breast cancer screening tools.

Abstract: Mammography, an X-ray-based imaging technique, remains central to the early detection of breast cancer. Recent advances in artificial intelligence have enabled increasingly sophisticated computer-aided diagnostic methods, evolving from patch-based classifiers to whole-image approaches and then to multi-view architectures that jointly analyze complementary projections. Despite this progress, several critical questions remain unanswered. In this study, we systematically investigate these issues by addressing five key research questions: (1) the role of patch classifiers in performance, (2) the transferability of natural-image-trained backbones, (3) the advantages of learn-to-resize over conventional downscaling, (4) the contribution of multi-view integration, and (5) the robustness of findings across varying image quality. Beyond benchmarking, our experiments demonstrate clear performance gains over prior work. For the CBIS-DDSM dataset, we improved single-view AUC from 0.8153 to 0.8343, and multiple-view AUC from 0.8483 to 0.8658. Using a new comparative method, we also observed a 0.0217 AUC increase when extending from single to multiple-view analysis. On the complete VinDr-Mammo dataset, the multiple-view approach further improved results, achieving a 0.0492 AUC increase over single view and reaching 0.8511 AUC overall. These results establish new state-of-the-art benchmarks, providing clear evidence of the advantages of multi-view architectures for mammogram interpretation. Beyond performance, our analysis offers principled insights into model design and transfer learning strategies, contributing to the development of more accurate and reliable breast cancer screening tools. The inference code and trained models are publicly available at https://github.com/dpetrini/multiple-view.

[561] Intelligent Healthcare Imaging Platform: A VLM-Based Framework for Automated Medical Image Analysis and Clinical Report Generation

Samer Al-Hamadani

Main category: eess.IV

TL;DR: An intelligent multimodal framework using Vision-Language Models for medical image analysis that automates tumor detection and clinical report generation across CT, MRI, X-ray, and Ultrasound modalities.

Details

Motivation: To revolutionize diagnostic medicine and clinical decision-making by leveraging AI advancements in healthcare imaging, addressing the need for automated diagnostic support and improved radiological workflow efficiency.

Method: Integrates Google Gemini 2.5 Flash for visual feature extraction and natural language processing, with coordinate verification mechanisms, probabilistic Gaussian modeling for anomaly distribution, multi-layered visualization techniques, and precise prompt engineering for structured clinical information extraction.

Result: Demonstrated high performance in anomaly detection across multiple modalities with location measurement achieving 80 pixels average deviation, featuring zero-shot learning capabilities and a user-friendly Gradio interface for clinical workflow integration.

Conclusion: The framework represents a significant advancement in automated diagnostic support and radiological workflow efficiency, though clinical validation and multi-center evaluation are necessary prior to widespread adoption.

Abstract: The rapid advancement of artificial intelligence (AI) in healthcare imaging has revolutionized diagnostic medicine and clinical decision-making processes. This work presents an intelligent multimodal framework for medical image analysis that leverages Vision-Language Models (VLMs) in healthcare diagnostics. The framework integrates Google Gemini 2.5 Flash for automated tumor detection and clinical report generation across multiple imaging modalities including CT, MRI, X-ray, and Ultrasound. The system combines visual feature extraction with natural language processing to enable contextual image interpretation, incorporating coordinate verification mechanisms and probabilistic Gaussian modeling for anomaly distribution. Multi-layered visualization techniques generate detailed medical illustrations, overlay comparisons, and statistical representations to enhance clinical confidence, with location measurement achieving 80 pixels average deviation. Result processing utilizes precise prompt engineering and textual analysis to extract structured clinical information while maintaining interpretability. Experimental evaluations demonstrated high performance in anomaly detection across multiple modalities. The system features a user-friendly Gradio interface for clinical workflow integration and demonstrates zero-shot learning capabilities to reduce dependence on large datasets. This framework represents a significant advancement in automated diagnostic support and radiological workflow efficiency, though clinical validation and multi-center evaluation are necessary prior to widespread adoption.

Today’s Research Highlights

Table of Contents

cs.CL

[1] OpenStaxQA: A multilingual dataset based on open-source college textbooks

[2] Knowledge Graph-Guided Multi-Agent Distillation for Reliable Industrial Question Answering with Datasets

[3] Transparent Reference-free Automated Evaluation of Open-Ended User Survey Responses

[4] CoT Referring: Improving Referring Expression Tasks with Grounded Reasoning

[5] Evaluating Embedding Frameworks for Scientific Domain

[6] TRepLiNa: Layer-wise CKA+REPINA Alignment Improves Low-Resource Machine Translation in Aya-23 8B

[7] Scalable multilingual PII annotation for responsible AI in LLMs

[8] FURINA: A Fully Customizable Role-Playing Benchmark via Scalable Multi-Agent Collaboration Pipeline

[9] Prakriti200: A Questionnaire-Based Dataset of 200 Ayurvedic Prakriti Assessments

[10] Learning to Rewrite Prompts for Bootstrapping LLMs on Downstream Tasks

[11] Dual-stage and Lightweight Patient Chart Summarization for Emergency Physicians

[12] SHANKS: Simultaneous Hearing and Thinking for Spoken Language Models

[13] Open ASR Leaderboard: Towards Reproducible and Transparent Multilingual and Long-Form Speech Recognition Evaluation

[14] A Comprehensive Survey of Hallucination in Large Language Models: Causes, Detection, and Mitigation

[15] Making Machines Sound Sarcastic: LLM-Enhanced and Retrieval-Guided Sarcastic Speech Synthesis

[16] Language models for longitudinal analysis of abusive content in Billboard Music Charts

[17] Reproducibility Study of “XRec: Large Language Models for Explainable Recommendation”

[18] Type and Complexity Signals in Multilingual Question Representations

[19] LLM Bias Detection and Mitigation through the Lens of Desired Distributions

[20] AutoMind: Adaptive Knowledgeable Agent for Automated Data Science

[21] EVALUESTEER: Measuring Reward Model Steerability Towards Values and Preference

[22] PredGen: Accelerated Inference of Large Language Models through Input-Time Speculation for Real-Time Speech Interaction

[23] The Sound of Syntax: Finetuning and Comprehensive Evaluation of Language Models for Speech Pathology

[24] EverydayMMQA: A Multilingual and Multimodal Framework for Culturally Grounded Spoken Visual QA

[25] Semantic Regexes: Auto-Interpreting LLM Features with a Structured Language

[26] Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning

[27] Protecting De-identified Documents from Search-based Linkage Attacks

[28] Controllable Stylistic Text Generation with Train-Time Attribute-Regularized Diffusion

[29] Reward Model Perspectives: Whose Opinions Do Reward Models Reward?

[30] Instructional Goal-Aligned Question Generation for Student Evaluation in Virtual Lab Settings: How Closely Do LLMs Actually Align?

[31] FinLFQA: Evaluating Attributed Text Generation of LLMs in Financial Long-Form Question Answering

[32] Bridging Discourse Treebanks with a Unified Rhetorical Structure Parser

[33] MathRobust-LV: Evaluation of Large Language Models’ Robustness to Linguistic Variations in Mathematical Reasoning

[34] A Survey on Agentic Security: Applications, Threats and Defenses

[35] Linguistically Informed Tokenization Improves ASR for Underresourced Languages

[36] Test-Time Scaling of Reasoning Models for Machine Translation

[37] Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels

[38] From Acceleration to Saturation: Scaling Behavior of Bootstrapped Language Model Pretraining

[39] Flipping the Dialogue: Training and Evaluating User Language Models

[40] The Algebra of Meaning: Why Machines Need Montague More Than Moore’s Law

[41] TinyScientist: An Interactive, Extensible, and Controllable Framework for Building Research Agents

[42] Do Internal Layers of LLMs Reveal Patterns for Jailbreak Detection?

[43] A Comparative Analysis of Contextual Representation Flow in State-Space and Transformer Architectures

[44] Aligning Large Language Models via Fully Self-Synthetic Data

[45] ToolMem: Enhancing Multimodal Agents with Learnable Tool Capability Memory

[46] PIKA: Expert-Level Synthetic Datasets for Post-Training Alignment from Scratch

[47] WeatherArchive-Bench: Benchmarking Retrieval-Augmented Reasoning for Historical Weather Archives

[48] Incremental Summarization for Customer Support via Progressive Note-Taking and Agent Feedback

[49] How Language Models Conflate Logical Validity with Plausibility: A Representational Analysis of Content Effects

[50] Scaling LLM Multi-turn RL with End-to-end Summarization-based Context Management

[51] PTEB: Towards Robust Text Embedding Evaluation via Stochastic Paraphrasing at Evaluation Time with LLMs

[52] Are LLMs Reliable Rankers? Rank Manipulation via Two-Stage Token Optimization

[53] AWM: Accurate Weight-Matrix Fingerprint for Large Language Models

[54] TWIST: Training-free and Label-free Short Text Clustering through Iterative Vector Updating with LLMs

[55] A Formal Framework for Fluency-based Multi-Reference Evaluation in Grammatical Error Correction

[56] Gold-Switch: Training-Free Superposition of Slow- and Fast- Thinking LLMs

[57] Adaptive LLM-Symbolic Reasoning via Dynamic Logical Solver Composition

[58] Foundations of LLM Knowledge Materialization: Termination, Reproducibility, Robustness

[59] Overview of the Plagiarism Detection Task at PAN 2025

[60] BlackboxNLP-2025 MIB Shared Task: Exploring Ensemble Strategies for Circuit Localization Methods

[61] Adaptive Tool Generation with Models as Tools and Reinforcement Learning

[62] Mid-Training of Large Language Models: A Survey

[63] GAMBIT+: A Challenge Set for Evaluating Gender Bias in Machine Translation Quality Estimation Metrics

[64] SID: Multi-LLM Debate Driven by Self Signals

[65] OpenJAI-v1.0: An Open Thai Large Language Model

[66] Unlocking Latent Discourse Translation in LLMs Through Quality-Aware Decoding

[67] $λ$-GRPO: Unifying the GRPO Frameworks with Learnable Token Preferences

[68] MeXtract: Light-Weight Metadata Extraction from Scientific Papers

[69] LongRM: Revealing and Unlocking the Context Boundary of Reward Modeling

[70] EDUMATH: Generating Standards-aligned Educational Math Word Problems

[71] Probing Social Identity Bias in Chinese LLMs with Gendered Pronouns and Social Groups

[72] Towards Reliable Retrieval in RAG Systems for Large Legal Datasets

[73] Pragyaan: Designing and Curating High-Quality Cultural Post-Training Datasets for Indian Languages

[74] Native Hybrid Attention for Efficient Sequence Modeling

[75] Mining the Mind: What 100M Beliefs Reveal About Frontier LLM Knowledge

[76] Beyond Monolingual Assumptions: A Survey of Code-Switched NLP in the Era of Large Language Models

[77] Search-R3: Unifying Reasoning and Embedding Generation in Large Language Models