Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 151]
cs.CV [Total: 124]
cs.AI [Total: 124]
cs.SD [Total: 18]
cs.LG [Total: 132]
cs.MA [Total: 5]
cs.MM [Total: 0]
eess.AS [Total: 4]
eess.IV [Total: 7]

cs.CL

[1] MedPI: Evaluating AI Systems in Medical Patient-facing Interactions

Diego Fajardo V., Oleksii Proniakin, Victoria-Elisabeth Gruber, Razvan Marinescu

Main category: cs.CL

TL;DR: MedPI is a comprehensive benchmark for evaluating LLMs in medical dialogues across 105 dimensions, revealing poor performance of current models especially on differential diagnosis.

Details

Motivation: Existing benchmarks focus on single-turn QA, lacking evaluation of LLMs in realistic patient-clinician conversations across medical processes, safety, outcomes, and communication.

Method: Five-layer framework: synthetic EHR-like patient packets, AI Patients with memory/affect, task matrix (encounter reasons × objectives), 105-dimension evaluation rubric aligned with ACGME competencies, and calibrated AI Judges for scoring.

Result: Tested 9 flagship models across 366 AI Patients and 7,097 conversations; all showed low performance across dimensions, particularly poor on differential diagnosis.

Conclusion: Current LLMs perform poorly in medical dialogue tasks; MedPI provides a comprehensive evaluation framework to guide future development of LLMs for diagnosis and treatment recommendations.

Abstract: We present MedPI, a high-dimensional benchmark for evaluating large language models (LLMs) in patient-clinician conversations. Unlike single-turn question-answer (QA) benchmarks, MedPI evaluates the medical dialogue across 105 dimensions comprising the medical process, treatment safety, treatment outcomes and doctor-patient communication across a granular, accreditation-aligned rubric. MedPI comprises five layers: (1) Patient Packets (synthetic EHR-like ground truth); (2) an AI Patient instantiated through an LLM with memory and affect; (3) a Task Matrix spanning encounter reasons (e.g. anxiety, pregnancy, wellness checkup) x encounter objectives (e.g. diagnosis, lifestyle advice, medication advice); (4) an Evaluation Framework with 105 dimensions on a 1-4 scale mapped to the Accreditation Council for Graduate Medical Education (ACGME) competencies; and (5) AI Judges that are calibrated, committee-based LLMs providing scores, flags, and evidence-linked rationales. We evaluate 9 flagship models – Claude Opus 4.1, Claude Sonnet 4, MedGemma, Gemini 2.5 Pro, Llama 3.3 70b Instruct, GPT-5, GPT OSS 120b, o3, Grok-4 – across 366 AI Patients and 7,097 conversations using a standardized “vanilla clinician” prompt. For all LLMs, we observe low performance across a variety of dimensions, in particular on differential diagnosis. Our work can help guide future use of LLMs for diagnosis and treatment recommendations.

[2] RAGVUE: A Diagnostic View for Explainable and Automated Evaluation of Retrieval-Augmented Generation

Keerthana Murugaraj, Salima Lamsiyah, Martin Theobald

Main category: cs.CL

TL;DR: RAGVUE is a diagnostic framework for evaluating RAG systems with fine-grained metrics and explanations, addressing limitations of existing single-score metrics.

Details

Motivation: Existing RAG evaluation metrics collapse heterogeneous behaviors into single scores and provide little insight into whether errors come from retrieval, reasoning, or grounding, making it difficult to diagnose and improve RAG systems.

Method: RAGVUE decomposes RAG behavior into four components: retrieval quality, answer relevance/completeness, strict claim-level faithfulness, and judge calibration. Each metric includes structured explanations for transparency. The framework supports both manual metric selection and fully automated agentic evaluation, with Python API, CLI, and Streamlit interface.

Result: RAGVUE surfaces fine-grained failures that existing tools like RAGAS often overlook. It provides a transparent evaluation process with structured explanations and supports integration into research pipelines and practical RAG development.

Conclusion: RAGVUE offers a diagnostic and explainable framework for RAG evaluation that addresses the limitations of existing metrics, providing actionable insights for improving RAG systems through transparent, fine-grained assessment.

Abstract: Evaluating Retrieval-Augmented Generation (RAG) systems remains a challenging task: existing metrics often collapse heterogeneous behaviors into single scores and provide little insight into whether errors arise from retrieval,reasoning, or grounding. In this paper, we introduce RAGVUE, a diagnostic and explainable framework for automated, reference-free evaluation of RAG pipelines. RAGVUE decomposes RAG behavior into retrieval quality, answer relevance and completeness, strict claim-level faithfulness, and judge calibration. Each metric includes a structured explanation, making the evaluation process transparent. Our framework supports both manual metric selection and fully automated agentic evaluation. It also provides a Python API, CLI, and a local Streamlit interface for interactive usage. In comparative experiments, RAGVUE surfaces fine-grained failures that existing tools such as RAGAS often overlook. We showcase the full RAGVUE workflow and illustrate how it can be integrated into research pipelines and practical RAG development. The source code and detailed instructions on usage are publicly available on GitHub

[3] V-FAT: Benchmarking Visual Fidelity Against Text-bias

Ziteng Wang, Yujie He, Guanliang Li, Siqi Yang, Jiaqi Xiong, Songxiang Liu

Main category: cs.CL

TL;DR: The paper introduces V-FAT, a diagnostic benchmark to measure Text Bias in MLLMs, where models rely on linguistic shortcuts rather than genuine visual reasoning, and proposes a Visual Robustness Score to evaluate true visual fidelity.

Details

Motivation: There is growing concern that Multimodal Large Language Models (MLLMs) rely excessively on linguistic shortcuts rather than genuine visual grounding, a phenomenon termed Text Bias. The paper investigates the fundamental tension between visual perception and linguistic priors in these models.

Method: The authors decouple Text Bias into two dimensions: Internal Corpus Bias (from statistical correlations in pretraining) and External Instruction Bias (from alignment-induced sycophancy). They introduce V-FAT, a diagnostic benchmark with 4,026 VQA instances across six semantic domains, using a Three-Level Evaluation Framework that systematically increases conflict between visual evidence and textual information. They also propose the Visual Robustness Score (VRS) metric to penalize linguistic guesses and reward true visual fidelity.

Result: Evaluation of 12 frontier MLLMs reveals that while models excel in existing benchmarks, they experience significant visual collapse under high linguistic dominance, demonstrating their vulnerability to Text Bias.

Conclusion: Current MLLMs suffer from Text Bias, relying more on linguistic shortcuts than genuine visual reasoning. The V-FAT benchmark and VRS metric provide tools to diagnose and measure this problem, highlighting the need for improved visual grounding in multimodal models.

Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated impressive performance on standard visual reasoning benchmarks. However, there is growing concern that these models rely excessively on linguistic shortcuts rather than genuine visual grounding, a phenomenon we term Text Bias. In this paper, we investigate the fundamental tension between visual perception and linguistic priors. We decouple the sources of this bias into two dimensions: Internal Corpus Bias, stemming from statistical correlations in pretraining, and External Instruction Bias, arising from the alignment-induced tendency toward sycophancy. To quantify this effect, we introduce V-FAT (Visual Fidelity Against Text-bias), a diagnostic benchmark comprising 4,026 VQA instances across six semantic domains. V-FAT employs a Three-Level Evaluation Framework that systematically increases the conflict between visual evidence and textual information: (L1) internal bias from atypical images, (L2) external bias from misleading instructions, and (L3) synergistic bias where both coincide. We introduce the Visual Robustness Score (VRS), a metric designed to penalize “lucky” linguistic guesses and reward true visual fidelity. Our evaluation of 12 frontier MLLMs reveals that while models excel in existing benchmarks, they experience significant visual collapse under high linguistic dominance.

[4] Automatic Construction of Chinese Verb Collostruction Database

Xuri Tang, Daohuan Liu

Main category: cs.CL

TL;DR: Proposes unsupervised method to build Chinese verb collostruction database to complement LLMs with interpretable rules for error correction.

Details

Motivation: To provide explicit, interpretable linguistic rules for Chinese verb usage that complement LLMs in scenarios requiring explanation and interpretability, where black-box LLMs fall short.

Method: Defines verb collostructions as projective, rooted, ordered, directed acyclic graphs; uses clustering algorithms on sentences from large-scale corpus to generate collostructions for each verb.

Result: Generated collostructions show functional independence and graded typicality; verb grammatical error correction using maximum matching with collostructions outperforms LLMs.

Conclusion: Unsupervised collostruction database provides interpretable linguistic rules that effectively complement LLMs, especially for grammatical error correction tasks requiring explainability.

Abstract: This paper proposes a fully unsupervised approach to the construction of verb collostruction database for Chinese language, aimed at complementing LLMs by providing explicit and interpretable rules for application scenarios where explanation and interpretability are indispensable. The paper formally defines a verb collostruction as a projective, rooted, ordered, and directed acyclic graph and employs a series of clustering algorithms to generate collostructions for a given verb from a list of sentences retrieved from large-scale corpus. Statistical analysis demonstrates that the generated collostructions possess the design features of functional independence and graded typicality. Evaluation with verb grammatical error correction shows that the error correction algorithm based on maximum matching with collostructions achieves better performance than LLMs.

[5] POLYCHARTQA: Benchmarking Large Vision-Language Models with Multilingual Chart Question Answering

Yichen Xu, Liangyu Chen, Liang Zhang, Jianzhe Ma, Wenxuan Wang, Qin Jin

Main category: cs.CL

TL;DR: PolyChartQA is the first large-scale multilingual benchmark for chart question answering across 10 languages, addressing the English-centric limitation of existing chart understanding benchmarks.

Details

Motivation: Existing chart understanding benchmarks are overwhelmingly English-centric, limiting accessibility and relevance to global audiences. There's a need for multilingual benchmarks to develop globally inclusive vision-language models.

Method: Constructed through a scalable pipeline using data translation and code reuse, supported by LLM-based translation and rigorous quality control. Created 22,606 charts and 26,151 QA pairs across 10 languages.

Result: Revealed significant performance gap between English and other languages (especially low-resource ones) when evaluating state-of-the-art LVLMs. Fine-tuning on PolyChartQA-Train yields substantial gains in multilingual chart understanding across diverse model sizes and architectures.

Conclusion: PolyChartQA provides a foundation for developing globally inclusive vision-language models capable of understanding charts across diverse linguistic contexts, addressing the multilingual gap in chart understanding research.

Abstract: Charts are a universally adopted medium for data communication, yet existing chart understanding benchmarks are overwhelmingly English-centric, limiting their accessibility and relevance to global audiences. To address this limitation, we introduce PolyChartQA, the first large-scale multilingual benchmark for chart question answering, comprising 22,606 charts and 26,151 QA pairs across 10 diverse languages. PolyChartQA is constructed through a scalable pipeline that enables efficient multilingual chart generation via data translation and code reuse, supported by LLM-based translation and rigorous quality control. We systematically evaluate multilingual chart understanding with PolyChartQA on state-of-the-art LVLMs and reveal a significant performance gap between English and other languages, particularly low-resource ones. Additionally, we introduce a companion multilingual chart question answering training set, PolyChartQA-Train, on which fine-tuning LVLMs yields substantial gains in multilingual chart understanding across diverse model sizes and architectures. Together, our benchmark provides a foundation for developing globally inclusive vision-language models capable of understanding charts across diverse linguistic contexts.

[6] Attribute-Aware Controlled Product Generation with LLMs for E-commerce

Virginia Negri, Víctor Martínez Gómez, Sergio A. Balanya, Subburam Rajaram

Main category: cs.CL

TL;DR: LLM-based synthetic data generation for e-commerce product information extraction achieves performance comparable to real data and significantly improves over zero-shot baselines.

Details

Motivation: High-quality labeled datasets for e-commerce product information extraction are difficult to obtain, creating a need for synthetic data generation methods.

Method: A systematic approach using LLMs with controlled modification framework including attribute-preserving modification, controlled negative example generation, and systematic attribute removal, using attribute-aware prompts with store constraints.

Result: Human evaluation shows 99.6% naturalness, 96.5% valid attributes, and over 90% attribute consistency. On MAVE dataset, synthetic data achieves 60.5% accuracy (vs 60.8% real data and 13.4% zero-shot baseline). Hybrid configurations reach 68.8% accuracy.

Conclusion: The framework provides a practical solution for augmenting e-commerce datasets, especially valuable for low-resource scenarios where labeled data is scarce.

Abstract: Product information extraction is crucial for e-commerce services, but obtaining high-quality labeled datasets remains challenging. We present a systematic approach for generating synthetic e-commerce product data using Large Language Models (LLMs), introducing a controlled modification framework with three strategies: attribute-preserving modification, controlled negative example generation, and systematic attribute removal. Using a state-of-the-art LLM with attribute-aware prompts, we enforce store constraints while maintaining product coherence. Human evaluation of 2000 synthetic products demonstrates high effectiveness, with 99.6% rated as natural, 96.5% containing valid attribute values, and over 90% showing consistent attribute usage. On the public MAVE dataset, our synthetic data achieves 60.5% accuracy, performing on par with real training data (60.8%) and significantly improving upon the 13.4% zero-shot baseline. Hybrid configurations combining synthetic and real data further improve performance, reaching 68.8% accuracy. Our framework provides a practical solution for augmenting e-commerce datasets, particularly valuable for low-resource scenarios.

[7] Collective Narrative Grounding: Community-Coordinated Data Contributions to Improve Local AI Systems

Zihan Gao, Mohsin Y. K. Yousufi, Jacob Thebault-Spieker

Main category: cs.CL

TL;DR: A participatory protocol called Collective Narrative Grounding transforms community stories into structured narrative units for AI systems to address LLM knowledge blind spots on local queries, reducing 76.7% of errors in community-specific QA.

Details

Motivation: LLMs often fail on community-specific queries, creating "knowledge blind spots" that marginalize local voices and reinforce epistemic injustice. Current AI systems lack local grounding for community-specific information.

Method: Collective Narrative Grounding protocol: participatory workshops (N=24) to transform community stories into structured narrative units with entity, time, and place extraction. Audit of 14,782 local QA pairs to scope problem, then participatory QA evaluation.

Result: 76.7% of errors in local QA due to factual gaps, cultural misunderstandings, geographic confusions, and temporal misalignments. State-of-the-art LLM answered <21% of community questions correctly without added context. Missing facts often appear in collected narratives.

Conclusion: The protocol, taxonomy, and participatory evaluation provide foundation for community-grounded AI. Key design tensions identified: representation/power, governance/control, privacy/consent. Requirements for retrieval-first, provenance-visible, locally governed QA systems.

Abstract: Large language model (LLM) question-answering systems often fail on community-specific queries, creating “knowledge blind spots” that marginalize local voices and reinforce epistemic injustice. We present Collective Narrative Grounding, a participatory protocol that transforms community stories into structured narrative units and integrates them into AI systems under community governance. Learning from three participatory mapping workshops with N=24 community members, we designed elicitation methods and a schema that retain narrative richness while enabling entity, time, and place extraction, validation, and provenance control. To scope the problem, we audit a county-level benchmark of 14,782 local information QA pairs, where factual gaps, cultural misunderstandings, geographic confusions, and temporal misalignments account for 76.7% of errors. On a participatory QA set derived from our workshops, a state-of-the-art LLM answered fewer than 21% of questions correctly without added context, underscoring the need for local grounding. The missing facts often appear in the collected narratives, suggesting a direct path to closing the dominant error modes for narrative items. Beyond the protocol and pilot, we articulate key design tensions, such as representation and power, governance and control, and privacy and consent, providing concrete requirements for retrieval-first, provenance-visible, locally governed QA systems. Together, our taxonomy, protocol, and participatory evaluation offer a rigorous foundation for building community-grounded AI that better answers local questions.

[8] TeleTables: A Benchmark for Large Language Models in Telecom Table Interpretation

Anas Ezzakri, Nicola Piovesan, Mohamed Sana, Antonio De Domenico, Fadhel Ayed, Haozhe Zhang

Main category: cs.CL

TL;DR: TeleTables is a benchmark for evaluating LLMs’ knowledge and interpretation of tables in telecom standards (3GPP), revealing that smaller models struggle while larger models show better reasoning, highlighting the need for domain-specialized fine-tuning.

Details

Motivation: LLMs perform poorly on telecom standards despite their increasing use in telecom engineering tasks. The authors identify that telecom standards densely include tables containing essential information, but LLMs' knowledge and interpretation ability of such tables remains unexamined.

Method: Created TeleTables benchmark through a multi-stage data generation pipeline: extracted tables from 3GPP standards, used multimodal and reasoning-oriented LLMs to generate and validate questions, resulting in 500 human-verified question-answer pairs with corresponding tables in multiple formats.

Result: Smaller models (under 10B parameters) struggle with both recalling 3GPP knowledge and interpreting tables, indicating limited exposure to telecom standards in pretraining. Larger models show stronger reasoning on table interpretation. Overall performance highlights limitations in handling complex technical material.

Conclusion: TeleTables reveals significant gaps in LLMs’ ability to handle telecom standards, particularly with table interpretation. The benchmark demonstrates the need for domain-specialized fine-tuning to reliably interpret and reason over telecom standards, providing a valuable resource for evaluating and improving LLMs in this domain.

Abstract: Language Models (LLMs) are increasingly explored in the telecom industry to support engineering tasks, accelerate troubleshooting, and assist in interpreting complex technical documents. However, recent studies show that LLMs perform poorly on telecom standards, particularly 3GPP specifications. We argue that a key reason is that these standards densely include tables to present essential information, yet the LLM knowledge and interpretation ability of such tables remains largely unexamined. To address this gap, we introduce TeleTables, a benchmark designed to evaluate both the implicit knowledge LLMs have about tables in technical specifications and their explicit ability to interpret them. TeleTables is built through a novel multi-stage data generation pipeline that extracts tables from 3GPP standards and uses multimodal and reasoning-oriented LLMs to generate and validate questions. The resulting dataset, which is publicly available, comprises 500 human-verified question-answer pairs, each associated with the corresponding table in multiple formats. Our evaluation shows that, smaller models (under 10B parameters) struggle both to recall 3GPP knowledge and to interpret tables, indicating the limited exposure to telecom standards in their pretraining and the insufficient inductive biases for navigating complex technical material. Larger models, on the other hand, show stronger reasoning on table interpretation. Overall, TeleTables highlights the need for domain-specialized fine-tuning to reliably interpret and reason over telecom standards.

Xueqing Wu, Zihan Xue, Da Yin, Shuyan Zhou, Kai-Wei Chang, Nanyun Peng, Yeming Wen

Main category: cs.CL

TL;DR: FronTalk is a benchmark for conversational front-end code generation with multi-modal feedback, featuring 100 multi-turn dialogues from real websites, revealing key challenges in model forgetting and visual interpretation.

Details

Motivation: Front-end development relies heavily on visual artifacts like sketches and mockups to convey design intent, but the role of these multi-modal elements in multi-turn code generation remains unexplored. The authors aim to address this research gap.

Method: Created FronTalk benchmark with 100 multi-turn dialogues from real-world websites across diverse domains. Each turn includes both textual and visual instructions representing the same user intent. Proposed an agent-based evaluation framework using a web agent to simulate users and measure functional correctness and user experience.

Result: Evaluation of 20 models revealed two key challenges: (1) significant forgetting issue where models overwrite previously implemented features, and (2) persistent difficulty interpreting visual feedback, especially for open-source VLMs. Proposed AceCoder baseline reduces forgetting to nearly zero and improves performance by up to 9.3% (56.0% to 65.3%).

Conclusion: FronTalk provides a foundation for future research in front-end development and multi-turn, multi-modal code generation interaction dynamics. The benchmark highlights critical challenges in conversational code generation that need systematic exploration.

Abstract: We present FronTalk, a benchmark for front-end code generation that pioneers the study of a unique interaction dynamic: conversational code generation with multi-modal feedback. In front-end development, visual artifacts such as sketches, mockups and annotated creenshots are essential for conveying design intent, yet their role in multi-turn code generation remains largely unexplored. To address this gap, we focus on the front-end development task and curate FronTalk, a collection of 100 multi-turn dialogues derived from real-world websites across diverse domains such as news, finance, and art. Each turn features both a textual instruction and an equivalent visual instruction, each representing the same user intent. To comprehensively evaluate model performance, we propose a novel agent-based evaluation framework leveraging a web agent to simulate users and explore the website, and thus measuring both functional correctness and user experience. Evaluation of 20 models reveals two key challenges that are under-explored systematically in the literature: (1) a significant forgetting issue where models overwrite previously implemented features, resulting in task failures, and (2) a persistent challenge in interpreting visual feedback, especially for open-source vision-language models (VLMs). We propose a strong baseline to tackle the forgetting issue with AceCoder, a method that critiques the implementation of every past instruction using an autonomous web agent. This approach significantly reduces forgetting to nearly zero and improves the performance by up to 9.3% (56.0% to 65.3%). Overall, we aim to provide a solid foundation for future research in front-end development and the general interaction dynamics of multi-turn, multi-modal code generation. Code and data are released at https://github.com/shirley-wu/frontalk

Xinhao Sun, Maoliang Li, Zihao Zheng, Jiayu Chen, Hezhao Xu, Yun Liang, Xiang Chen

Main category: cs.CL

TL;DR: Proposes a dynamic remasking strategy for diffusion language models that adapts confidence thresholds per token per step based on temporal variance and spatial deviance, achieving up to 8.9× speedup while maintaining quality.

Details

Motivation: Current diffusion language models use fixed global confidence thresholds for remasking, which ignores the temporal evolution and spatial relationships between tokens, leading to redundant iterations and constrained parallelism.

Method: Introduces a novel remasking approach that dynamically detects Temporal Variance (convergence status) and Spatial Deviance (inter-token correlations) for each token, then adaptively adjusts confidence thresholds per token per step based on these signals.

Result: Significantly improves DLM operational efficiency across mainstream datasets with speedups up to 8.9 times while faithfully preserving generation quality.

Conclusion: Dynamic per-token threshold adjustment based on temporal and spatial dynamics is more effective than fixed global thresholds for diffusion language model remasking, enabling substantial efficiency gains without quality degradation.

Abstract: Unlike autoregressive language models, diffusion language models (DLMs) generate text by iteratively denoising all token positions in parallel. At each timestep, the remasking strategy of a DLM selects low-priority tokens to defer their decoding, thereby improving both efficiency and output quality. However, mainstream remasking strategies rely on a single global confidence threshold, overlooking the temporal and spatial dynamics of individual tokens. Motivated by the redundant iterations and constrained parallelism introduced by fixed-threshold remasking, we propose a novel remasking approach that dynamically detects Temporal Variance and Spatial Deviance of each token, which reflect its convergence status and inter-token correlations. Using these signals, our method adaptively adjusts the confidence threshold for every token at every step. Empirical results show that our approach significantly improves the operational efficiency of DLMs across mainstream datasets, achieving speedups of up to 8.9 times while faithfully preserving generation quality.

[11] Enhancing Admission Inquiry Responses with Fine-Tuned Models and Retrieval-Augmented Generation

Aram Virabyan

Main category: cs.CL

TL;DR: AI system combining fine-tuned language model with RAG for university admissions inquiry management, improving response time and accuracy.

Details

Motivation: University admissions offices struggle with high inquiry volumes while maintaining response quality, which affects prospective students' perceptions. Need to address response time and information accuracy challenges.

Method: Hybrid approach integrating fine-tuned language model with Retrieval-Augmented Generation (RAG). Fine-tuned model on curated admissions-specific dataset to enhance domain understanding, combined with RAG for accessing up-to-date information. Also explored optimization strategies for response generation logic to balance quality and speed.

Result: The system improves ability to interpret RAG-provided data accurately and generate domain-relevant outputs for university admissions communications. Enhanced contextual understanding of complex admissions rules and specific details.

Conclusion: The proposed AI system effectively addresses admissions inquiry management challenges by combining RAG’s information retrieval capabilities with fine-tuning’s domain-specific understanding, enabling high-quality, accurate responses that meet admissions communication requirements.

Abstract: University admissions offices face the significant challenge of managing high volumes of inquiries efficiently while maintaining response quality, which critically impacts prospective students’ perceptions. This paper addresses the issues of response time and information accuracy by proposing an AI system integrating a fine-tuned language model with Retrieval-Augmented Generation (RAG). While RAG effectively retrieves relevant information from large datasets, its performance in narrow, complex domains like university admissions can be limited without adaptation, potentially leading to contextually inadequate responses due to the intricate rules and specific details involved. To overcome this, we fine-tuned the model on a curated dataset specific to admissions processes, enhancing its ability to interpret RAG-provided data accurately and generate domain-relevant outputs. This hybrid approach leverages RAG’s ability to access up-to-date information and fine-tuning’s capacity to embed nuanced domain understanding. We further explored optimization strategies for the response generation logic, experimenting with settings to balance response quality and speed, aiming for consistently high-quality outputs that meet the specific requirements of admissions communications.

[12] WESR: Scaling and Evaluating Word-level Event-Speech Recognition

Chenchen Yang, Kexin Huang, Liwei Fan, Qian Tu, Botian Jiang, Dong Zhang, Linqi Yin, Shimin Li, Zhaoye Fei, Qinyuan Cheng, Xipeng Qiu

Main category: cs.CL

TL;DR: The paper introduces WESR-Bench, a new benchmark for vocal event detection with refined taxonomy and precise localization evaluation, along with strong baseline models trained on a large corpus.

Details

Motivation: Current methods for detecting non-verbal vocal events (like laughing, crying) have insufficient task definitions with limited categories, ambiguous temporal granularity, and lack standardized evaluation frameworks, hindering downstream applications.

Method: 1) Developed refined taxonomy of 21 vocal events categorized into discrete (standalone) vs continuous (mixed with speech) types; 2) Created WESR-Bench, an expert-annotated evaluation set with 900+ utterances using position-aware protocol to disentangle ASR errors from event detection; 3) Built strong baseline by constructing 1,700+ hour corpus and training specialized models.

Result: The specialized models surpass both open-source audio-language models and commercial APIs while preserving ASR quality, providing a strong baseline for vocal event detection.

Conclusion: WESR will serve as a foundational resource for future research in modeling rich, real-world auditory scenes, addressing the critical gap in precise localization of non-verbal vocal events.

Abstract: Speech conveys not only linguistic information but also rich non-verbal vocal events such as laughing and crying. While semantic transcription is well-studied, the precise localization of non-verbal events remains a critical yet under-explored challenge. Current methods suffer from insufficient task definitions with limited category coverage and ambiguous temporal granularity. They also lack standardized evaluation frameworks, hindering the development of downstream applications. To bridge this gap, we first develop a refined taxonomy of 21 vocal events, with a new categorization into discrete (standalone) versus continuous (mixed with speech) types. Based on the refined taxonomy, we introduce WESR-Bench, an expert-annotated evaluation set (900+ utterances) with a novel position-aware protocol that disentangles ASR errors from event detection, enabling precise localization measurement for both discrete and continuous events. We also build a strong baseline by constructing a 1,700+ hour corpus, and train specialized models, surpassing both open-source audio-language models and commercial APIs while preserving ASR quality. We anticipate that WESR will serve as a foundational resource for future research in modeling rich, real-world auditory scenes.

Wei Xia, Haowen Tang, Luozheng Li

Main category: cs.CL

TL;DR: LLMs have internal political ideology structures that don’t fully align with human ideological space; a lightweight linear probe quantifies and corrects this misalignment by adjusting output probabilities without retraining.

Details

Motivation: LLMs internally organize political ideology in ways that are systematically misaligned with human ideological space, creating a need for efficient alignment methods that preserve model capabilities.

Method: Introduces a lightweight linear probe that quantifies ideological misalignment and minimally corrects the output layer by calculating bias scores from internal features and directly adjusting final output probabilities.

Result: The method provides a practical, low-cost solution for aligning models with specific user opinions while preserving the original reasoning power of the model.

Conclusion: LLMs’ internal political ideology structures are systematically misaligned with human space, but this can be efficiently corrected using lightweight probes that adjust output probabilities without full retraining.

Abstract: LLMs internally organize political ideology along low-dimensional structures that are partially, but not fully aligned with human ideological space. This misalignment is systematic, model specific, and measurable. We introduce a lightweight linear probe that both quantifies the misalignment and minimally corrects the output layer. This paper introduces a simple and efficient method for aligning models with specific user opinions. Instead of retraining the model, we calculated a bias score from its internal features and directly adjusted the final output probabilities. This solution is practical and low-cost and preserves the original reasoning power of the model.

[14] LLMs for Explainable Business Decision-Making: A Reinforcement Learning Fine-Tuning Approach

Xiang Cheng, Wen Wang, Anindya Ghose

Main category: cs.CL

TL;DR: LEXMA is a reinforcement learning framework that fine-tunes LLMs to generate narrative explanations for AI decisions that are both decision-correct and audience-appropriate, without needing human-annotated training data.

Details

Motivation: Current explainable AI methods use numerical feature attributions that lack coherent narratives. LLMs can generate natural language explanations but face challenges: ensuring decision correctness and faithfulness, serving multiple audiences without changing decision rules, and training without large human-scored explanation datasets.

Method: LEXMA uses reinforcement learning with reflection-augmented supervised fine-tuning and two stages of Group Relative Policy Optimization (GRPO). It fine-tunes separate parameter sets for decision correctness and stylistic requirements for different audiences, using reward signals that don’t require human-annotated explanations.

Result: In mortgage approval decisions, LEXMA significantly improves predictive performance over other LLM baselines. Human evaluations show expert-facing explanations are more risk-focused, while consumer-facing explanations are clearer, more actionable, and more polite.

Conclusion: LEXMA provides a cost-efficient, systematic LLM fine-tuning approach to enhance explanation quality for business decisions, offering scalable deployment potential for transparent AI systems.

Abstract: Artificial Intelligence (AI) models increasingly drive high-stakes consumer interactions, yet their decision logic often remains opaque. Prevailing explainable AI techniques rely on post hoc numerical feature attributions, which fail to provide coherent narratives behind model decisions. Large language models (LLMs) present an opportunity to generate natural-language explanations, but three design challenges remain unresolved: explanations must be both decision-correct and faithful to the factors that drive the prediction; they should be able to serve multiple audiences without shifting the underlying decision rule; and they should be trained in a label-efficient way that does not depend on large corpora of human-scored explanations. To address these challenges, we introduce LEXMA (LLM-based EXplanations for Multi-Audience decisions), a reinforcement-learning-based fine-tuning framework that produces narrative-driven, audience-appropriate explanations. LEXMA combines reflection-augmented supervised fine-tuning with two stages of Group Relative Policy Optimization (GRPO). Specifically, it fine-tunes two separate parameter sets to improve decision correctness and satisfy stylistic requirements for different audiences, using reward signals that do not rely on human-annotated explanations. We instantiate LEXMA in the context of mortgage approval decisions. Results demonstrate that LEXMA yields significant improvements in predictive performance compared with other LLM baselines. Moreover, human evaluations show that expert-facing explanations generated by our approach are more risk-focused, and consumer-facing explanations are clearer, more actionable, and more polite. Our study contributes a cost-efficient, systematic LLM fine-tuning approach to enhance explanation quality for business decisions, offering strong potential for scalable deployment of transparent AI systems.

[15] A Unified Spoken Language Model with Injected Emotional-Attribution Thinking for Human-like Interaction

Qing Wang, Zehan Li, Yaodong Song, Hongjie Chen, Jian Kang, Jie Lian, Jie Li, Yongxiang Li, Xuelong Li

Main category: cs.CL

TL;DR: A unified spoken language model with emotional intelligence using Injected Emotional-Attribution Thinking (IEAT) to internalize emotion-aware reasoning, achieving top performance on emotional benchmarks.

Details

Motivation: Current spoken language models lack deep emotional understanding and reasoning capabilities. There's a need for models that can internalize emotional states and their causes rather than treating emotion as explicit supervision, enabling more natural and empathetic spoken dialogue systems.

Method: Proposes Injected Emotional-Attribution Thinking (IEAT) that incorporates user emotional states and their underlying causes into model reasoning. Uses two-stage training: 1) speech-text alignment and emotional attribute modeling via self-distillation, 2) end-to-end cross-modal joint optimization for consistency between textual and spoken emotional expressions.

Result: Achieves top-ranked performance on the HumDial Emotional Intelligence benchmark across emotional trajectory modeling, emotional reasoning, and empathetic response generation under both LLM-based and human evaluations.

Conclusion: The IEAT approach successfully enables emotion-aware reasoning to be internalized in spoken language models, leading to superior emotional intelligence capabilities in dialogue systems without requiring explicit emotional supervision.

Abstract: This paper presents a unified spoken language model for emotional intelligence, enhanced by a novel data construction strategy termed Injected Emotional-Attribution Thinking (IEAT). IEAT incorporates user emotional states and their underlying causes into the model’s internal reasoning process, enabling emotion-aware reasoning to be internalized rather than treated as explicit supervision. The model is trained with a two-stage progressive strategy. The first stage performs speech-text alignment and emotional attribute modeling via self-distillation, while the second stage conducts end-to-end cross-modal joint optimization to ensure consistency between textual and spoken emotional expressions. Experiments on the Human-like Spoken Dialogue Systems Challenge (HumDial) Emotional Intelligence benchmark demonstrate that the proposed approach achieves top-ranked performance across emotional trajectory modeling, emotional reasoning, and empathetic response generation under both LLM-based and human evaluations.

[16] Leveraging Language Models and RAG for Efficient Knowledge Discovery in Clinical Environments

Seokhwan Ko, Donghyeon Lee, Jaewoo Chun, Hyungsoo Han, Junghwan Cho

Main category: cs.CL

TL;DR: A locally-deployed RAG system using PubMedBERT and LLaMA3 for recommending research collaborators based on PubMed publications within hospital privacy constraints.

Details

Motivation: LLMs are valuable in medical settings but hospital privacy regulations require sensitive data to be processed locally, creating a need for systems that can operate within these constraints while supporting biomedical knowledge discovery.

Method: Developed a retrieval-augmented generation (RAG) system using PubMedBERT for domain-specific embedding generation and a locally deployed LLaMA3 model for generative synthesis to recommend research collaborators based on PubMed publications.

Result: Demonstrated the feasibility and utility of integrating domain-specialized encoders with lightweight LLMs to support biomedical knowledge discovery under local deployment constraints.

Conclusion: The study shows that combining domain-specific encoders like PubMedBERT with lightweight LLMs like LLaMA3 enables effective biomedical knowledge discovery while maintaining compliance with hospital privacy and network security regulations.

Abstract: Large language models (LLMs) are increasingly recognized as valuable tools across the medical environment, supporting clinical, research, and administrative workflows. However, strict privacy and network security regulations in hospital settings require that sensitive data be processed within fully local infrastructures. Within this context, we developed and evaluated a retrieval-augmented generation (RAG) system designed to recommend research collaborators based on PubMed publications authored by members of a medical institution. The system utilizes PubMedBERT for domain-specific embedding generation and a locally deployed LLaMA3 model for generative synthesis. This study demonstrates the feasibility and utility of integrating domain-specialized encoders with lightweight LLMs to support biomedical knowledge discovery under local deployment constraints.

[17] Complexity Agnostic Recursive Decomposition of Thoughts

Kaleem Ullah Qasim, Jiashu Zhang, Hafiz Saif Ur Rehman

Main category: cs.CL

TL;DR: CARD is a framework that predicts problem complexity before generation and adapts decomposition strategies accordingly, achieving higher accuracy with significantly fewer tokens on math reasoning tasks.

Details

Motivation: Large language models often fail on multi-step reasoning due to fixed reasoning strategies that ignore problem-specific difficulty, leading to inefficient token usage and suboptimal performance.

Method: CARD uses MRCE (Multi-dimensional Reasoning Complexity Estimator), a 0.6B Qwen model that predicts 30 fine-grained complexity features, followed by a two-stage recursive solver: (1) hierarchical decomposition into K steps based on task profile, and (2) per-step thought budget allocation via recursive MRCE profiling.

Result: On GSM8K, CARD achieves 81.4% to 89.2% accuracy while reducing token cost by 1.88x to 2.40x compared to fixed decomposition baselines. On MATH-500, it reaches 75.1% to 86.8% accuracy using 1.71x to 5.74x fewer tokens.

Conclusion: Preemptive complexity estimation enables both higher accuracy and significant efficiency gains in multi-step reasoning tasks, demonstrating the value of adaptive decomposition strategies over fixed approaches.

Abstract: Large language models often fail on multi-step reasoning due to fixed reasoning strategies that ignore problem specific difficulty. We introduce CARD (Complexity Agnostic Recursive Decomposition), a framework that predicts problem complexity before generation and adapts decomposition accordingly. Our system comprises MRCE (Multi-dimensional Reasoning Complexity Estimator), a 0.6B Qwen model predicting 30 fine-grained features from question text and a two-stage recursive solver: (1) hierarchical decomposition into K steps based on task profile and (2) per-step thought budget allocation (1, 5-9, or 10 thoughts) via recursive MRCE profiling. Evaluated on three reasoning models (Qwen3-0.6B, DeepSeek-R1-Distill-Qwen-1.5B, Qwen3-1.7B), CARD achieves 81.4% to 89.2% accuracy on GSM8K while reducing token cost by 1.88x to 2.40x compared to fixed decomposition baselines. On MATH-500, CARD reaches 75.1 to 86.8% accuracy using 1.71x to 5.74x fewer tokens. Our results demonstrate that preemptive complexity estimation enables both higher accuracy and significant efficiency gains.

[18] Qwerty AI: Explainable Automated Age Rating and Content Safety Assessment for Russian-Language Screenplays

Nikita Zmanovskii

Main category: cs.CL

TL;DR: Qwerty AI is an automated system for age-rating Russian screenplays according to Russian law, using fine-tuned Phi-3-mini model to detect content violations and assign age ratings with explanations.

Details

Motivation: Addresses the need for automated content-safety assessment of Russian-language screenplays according to Federal Law No. 436-FZ, solving real editorial challenges in the Russian media industry with production-ready efficiency.

Method: End-to-end system that processes full scripts, segments them into narrative units, detects violations across five categories (violence, sexual content, profanity, substances, frightening elements), and uses fine-tuned Phi-3-mini model with 4-bit quantization for classification.

Result: Achieves 80% rating accuracy and 80-95% segmentation precision, processes 700-page scripts in under 2 minutes, operates within 80GB VRAM limit, and was successfully deployed on Yandex Cloud with CUDA acceleration.

Conclusion: Qwerty AI demonstrates practical applicability for production workflows in the Russian media industry, developed under strict constraints during a hackathon and showing promising results for automated content assessment.

Abstract: We present Qwerty AI, an end-to-end system for automated age-rating and content-safety assessment of Russian-language screenplays according to Federal Law No. 436-FZ. The system processes full-length scripts (up to 700 pages in under 2 minutes), segments them into narrative units, detects content violations across five categories (violence, sexual content, profanity, substances, frightening elements), and assigns age ratings (0+, 6+, 12+, 16+, 18+) with explainable justifications. Our implementation leverages a fine-tuned Phi-3-mini model with 4-bit quantization, achieving 80% rating accuracy and 80-95% segmentation precision (format-dependent). The system was developed under strict constraints: no external API calls, 80GB VRAM limit, and <5 minute processing time for average scripts. Deployed on Yandex Cloud with CUDA acceleration, Qwerty AI demonstrates practical applicability for production workflows. We achieved these results during the Wink hackathon (November 2025), where our solution addressed real editorial challenges in the Russian media industry.

[19] TrueBrief: Faithful Summarization through Small Language Models

Kumud Lakara, Ruibo Shi, Fran Silavong

Main category: cs.CL

TL;DR: TrueBrief is a framework that improves faithfulness of small LLMs for text summarization using preference optimization with controlled hallucination injection.

Details

Motivation: LLMs often produce hallucinations, which is problematic for security-critical applications. Small LLMs need improvement in generating faithful text, especially for summarization tasks.

Method: End-to-end framework with data generation module for controlled hallucination injection to create synthetic preference data, using preference-optimization paradigm to enhance faithfulness.

Result: Provides insights into how data quality and model size affect preference-based optimization, identifying conditions where these methods work best.

Conclusion: TrueBrief effectively enhances faithfulness of small LLMs for text summarization through systematic preference optimization with controlled hallucination data.

Abstract: Large language models (LLMs) have exhibited remarkable proficiency in generating high-quality text; however, their propensity for producing hallucinations poses a significant challenge for their deployment in security-critical domains. In this work, we present TrueBrief, an end-to-end framework specifically designed to enhance the faithfulness of small LLMs (SLMs) primarily for the task of text summarization through a preference-optimization paradigm. Central to our framework is a data generation module that facilitates controlled hallucination injection to generate synthetic preference data. Our work provides insights into the impact of data quality and model size on preference-based optimization, highlighting the conditions under which these methods are most effective.

[20] AnimatedLLM: Explaining LLMs with Interactive Visualizations

Zdeněk Kasner, Ondřej Dušek

Main category: cs.CL

TL;DR: AnimatedLLM is an interactive web app that visualizes Transformer LLM mechanics step-by-step for educational purposes, running entirely in browser with pre-computed traces of open LLMs.

Details

Motivation: LLMs are becoming central to NLP education, but there's a lack of materials that show their internal mechanics and workings in an accessible way.

Method: Created an interactive web application that provides step-by-step visualizations of Transformer language models, running entirely in browser using pre-computed traces of open LLMs applied to manually curated inputs.

Result: The application is available at https://animatedllm.github.io as both a teaching aid and for self-educational purposes, making LLM mechanics more accessible and understandable.

Conclusion: AnimatedLLM addresses the gap in educational materials for understanding LLM mechanics through interactive visualizations, supporting both classroom teaching and individual learning.

Abstract: Large language models (LLMs) are becoming central to natural language processing education, yet materials showing their mechanics are sparse. We present AnimatedLLM, an interactive web application that provides step-by-step visualizations of a Transformer language model. AnimatedLLM runs entirely in the browser, using pre-computed traces of open LLMs applied on manually curated inputs. The application is available at https://animatedllm.github.io, both as a teaching aid and for self-educational purposes.

[21] From Domains to Instances: Dual-Granularity Data Synthesis for LLM Unlearning

Xiaoyu Xu, Minxin Du, Zitong Li, Zi Liang, Zhibiao Guo, Shiyu Zhang, Peizhao Hu, Qingqing Ye, Haibo Hu

Main category: cs.CL

TL;DR: BiForget is an automated framework for synthesizing high-quality forget sets to evaluate LLM unlearning, using the target model itself to generate data that matches its internal knowledge distribution.

Details

Motivation: Current machine unlearning benchmarks often fail to faithfully represent the true "forgetting scope" learned by models, making it difficult to properly evaluate LLM unlearning methods for removing private, harmful, or copyrighted content.

Method: BiForget formalizes two unlearning granularities (domain-level and instance-level) and uses seed-guided and adversarial prompting to exploit the target model itself to elicit data matching its internal knowledge distribution, rather than relying on external generators.

Result: In Harry Potter domain experiments, BiForget improves relevance by ~20 and diversity by ~0.05 while halving total data size compared to state-of-the-art methods, achieving superior balance of relevance, diversity, and efficiency across diverse benchmarks.

Conclusion: BiForget facilitates more robust forgetting and better utility preservation, providing a more rigorous foundation for evaluating LLM unlearning by generating higher quality forget sets that better match models’ internal knowledge distributions.

Abstract: Although machine unlearning is essential for removing private, harmful, or copyrighted content from LLMs, current benchmarks often fail to faithfully represent the true “forgetting scope” learned by the model. We formalize two distinct unlearning granularities, domain-level and instance-level, and propose BiForget, an automated framework for synthesizing high-quality forget sets. Unlike prior work relying on external generators, BiForget exploits the target model per se to elicit data that matches its internal knowledge distribution through seed-guided and adversarial prompting. Our experiments across diverse benchmarks show that it achieves a superior balance of relevance, diversity, and efficiency. Quantitatively, in the Harry Potter domain, it improves relevance by ${\sim}20$ and diversity by ${\sim}$0.05 while halving the total data size compared to SOTAs. Ultimately, it facilitates more robust forgetting and better utility preservation, providing a more rigorous foundation for evaluating LLM unlearning.

[22] RIGOURATE: Quantifying Scientific Exaggeration with Evidence-Aligned Claim Evaluation

Joseph James, Chenghao Xiao, Yucheng Li, Nafise Sadat Moosavi, Chenghua Lin

Main category: cs.CL

TL;DR: RIGOURATE is a multimodal framework that retrieves supporting evidence from papers and scores claim overstatement, using a dataset of 10K+ claim-evidence sets from ICLR/NeurIPS papers with LLM annotations validated by human evaluation.

Details

Motivation: Scientific papers often overstate claims beyond what their results support, sidelining scientific rigor in favor of bold statements. There's a need to operationalize evidential proportionality and support clearer, more transparent scientific communication.

Method: Two-stage multimodal framework: 1) Fine-tuned reranker for evidence retrieval from paper bodies, 2) Fine-tuned model to predict overstatement scores with justification. Uses dataset of 10K+ claim-evidence sets from ICLR/NeurIPS papers annotated by eight LLMs, with scores calibrated using peer-review comments and human validation.

Result: RIGOURATE enables improved evidence retrieval and overstatement detection compared to strong baselines. The framework successfully operationalizes evidential proportionality in scientific papers.

Conclusion: RIGOURATE supports clearer, more transparent scientific communication by providing a systematic approach to assess claim overstatement and evidence support, addressing the problem of scientific claims being overstated beyond what results actually support.

Abstract: Scientific rigour tends to be sidelined in favour of bold statements, leading authors to overstate claims beyond what their results support. We present RIGOURATE, a two-stage multimodal framework that retrieves supporting evidence from a paper’s body and assigns each claim an overstatement score. The framework consists of a dataset of over 10K claim-evidence sets from ICLR and NeurIPS papers, annotated using eight LLMs, with overstatement scores calibrated using peer-review comments and validated through human evaluation. It employes a fine-tuned reranker for evidence retrieval and a fine-tuned model to predict overstatement scores with justification. Compared to strong baselines, RIGOURATE enables improved evidence retrieval and overstatement detection. Overall, our work operationalises evidential proportionality and supports clearer, more transparent scientific communication.

[23] Dialect Matters: Cross-Lingual ASR Transfer for Low-Resource Indic Language Varieties

Akriti Dhasmana, Aarohi Srivastava, David Chiang

Main category: cs.CL

TL;DR: Cross-lingual ASR transfer study on Indic dialects shows phylogenetic distance matters but doesn’t fully explain dialect performance; fine-tuning on small dialect data can match large high-resource language performance.

Details

Motivation: To understand how ASR systems perform on spontaneous, noisy, code-mixed speech across diverse Indic dialects and language varieties, and to examine cross-lingual transfer patterns in dialectal settings.

Method: Empirical study of cross-lingual transfer using various Indic dialects; includes case study on Garhwali (low-resource Pahari variety); evaluation of multiple contemporary ASR models; analysis of transcription errors to examine bias.

Result: ASR performance improves with reduced phylogenetic distance between languages, but this factor alone doesn’t fully explain dialect performance. Fine-tuning on small dialectal data often yields comparable performance to fine-tuning on large amounts of phylogenetically-related high-resource languages.

Conclusion: Phylogenetic distance is important but insufficient for explaining ASR performance on dialects; small dialect-specific data can be highly effective; ASR systems show bias toward pre-training languages, highlighting challenges for dialectal and non-standardized speech.

Abstract: We conduct an empirical study of cross-lingual transfer using spontaneous, noisy, and code-mixed speech across a wide range of Indic dialects and language varieties. Our results indicate that although ASR performance is generally improved with reduced phylogenetic distance between languages, this factor alone does not fully explain performance in dialectal settings. Often, fine-tuning on smaller amounts of dialectal data yields performance comparable to fine-tuning on larger amounts of phylogenetically-related, high-resource standardized languages. We also present a case study on Garhwali, a low-resource Pahari language variety, and evaluate multiple contemporary ASR models. Finally, we analyze transcription errors to examine bias toward pre-training languages, providing additional insight into challenges faced by ASR systems on dialectal and non-standardized speech.

[24] Disco-RAG: Discourse-Aware Retrieval-Augmented Generation

Dongqi Liu, Hang Ding, Qiming Feng, Jian Li, Xurong Xie, Zhucun Xue, Chengjie Wang, Jiangning Zhang, Yabiao Wang

Main category: cs.CL

TL;DR: Disco-RAG: A discourse-aware RAG framework that uses discourse trees and rhetorical graphs to inject structural cues into generation, achieving SOTA results on QA and summarization without fine-tuning.

Details

Motivation: Existing RAG strategies treat retrieved passages in a flat, unstructured way, which prevents models from capturing structural cues and constrains their ability to synthesize knowledge from dispersed evidence across documents.

Method: Constructs intra-chunk discourse trees to capture local hierarchies and builds inter-chunk rhetorical graphs to model cross-passage coherence. These structures are jointly integrated into a planning blueprint that conditions the generation process.

Result: Achieves state-of-the-art results on question answering and long-document summarization benchmarks without requiring fine-tuning.

Conclusion: Discourse structure plays an important role in advancing RAG systems, and explicitly injecting discourse signals into the generation process significantly enhances performance on knowledge-intensive tasks.

Abstract: Retrieval-Augmented Generation (RAG) has emerged as an important means of enhancing the performance of large language models (LLMs) in knowledge-intensive tasks. However, most existing RAG strategies treat retrieved passages in a flat and unstructured way, which prevents the model from capturing structural cues and constrains its ability to synthesize knowledge from dispersed evidence across documents. To overcome these limitations, we propose Disco-RAG, a discourse-aware framework that explicitly injects discourse signals into the generation process. Our method constructs intra-chunk discourse trees to capture local hierarchies and builds inter-chunk rhetorical graphs to model cross-passage coherence. These structures are jointly integrated into a planning blueprint that conditions the generation. Experiments on question answering and long-document summarization benchmarks show the efficacy of our approach. Disco-RAG achieves state-of-the-art results on the benchmarks without fine-tuning. These findings underscore the important role of discourse structure in advancing RAG systems.

[25] MiJaBench: Revealing Minority Biases in Large Language Models via Hate Speech Jailbreaking

Iago Alves Brito, Walcy Santos Rezende Rios, Julia Soares Dollis, Diogo Fernandes Costa Silva, Arlindo Rodrigues Galvão Filho

Main category: cs.CL

TL;DR: Current LLM safety evaluations create a false sense of universality by aggregating scores that hide systemic vulnerabilities against specific minority groups. The paper introduces MiJaBench, a bilingual adversarial benchmark revealing that safety alignment is not generalized but forms a demographic hierarchy, with disparities worsening with model scaling.

Details

Motivation: Current safety evaluations for large language models aggregate metrics like "Identity Hate" into scalar scores, creating a dangerous illusion of universal safety. This approach masks systemic vulnerabilities against specific minority populations, failing to reveal selective safety patterns where models may protect some groups while failing others.

Method: The authors introduce MiJaBench, a bilingual (English and Portuguese) adversarial benchmark comprising 44,000 prompts across 16 minority groups. They generate 528,000 prompt-response pairs from 12 state-of-the-art LLMs and curate MiJaBench-Align to analyze safety alignment patterns. The method examines how defense rates vary across demographic groups and how these disparities change with model scaling.

Result: The study reveals that safety alignment is not a generalized semantic capability but forms a demographic hierarchy, with defense rates fluctuating by up to 33% within the same model based solely on the target group. Crucially, model scaling exacerbates these disparities, showing that current alignment techniques reinforce memorized refusal boundaries only for specific groups rather than creating a principle of non-discrimination.

Conclusion: Current LLM safety alignment techniques do not create universal non-discrimination principles but instead reinforce selective refusal boundaries, forming a demographic hierarchy of protection. The findings challenge current scaling laws of security and highlight the need for granular demographic alignment research. The authors release all datasets and scripts to encourage further investigation into these disparities.

Abstract: Current safety evaluations of large language models (LLMs) create a dangerous illusion of universality, aggregating “Identity Hate” into scalar scores that mask systemic vulnerabilities against specific populations. To expose this selective safety, we introduce MiJaBench, a bilingual (English and Portuguese) adversarial benchmark comprising 44,000 prompts across 16 minority groups. By generating 528,000 prompt-response pairs from 12 state-of-the-art LLMs, we curate MiJaBench-Align, revealing that safety alignment is not a generalized semantic capability but a demographic hierarchy: defense rates fluctuate by up to 33% within the same model solely based on the target group. Crucially, we demonstrate that model scaling exacerbates these disparities, suggesting that current alignment techniques do not create principle of non-discrimination but reinforces memorized refusal boundaries only for specific groups, challenging the current scaling laws of security. We release all datasets and scripts to encourage research into granular demographic alignment at GitHub.

[26] ARREST: Adversarial Resilient Regulation Enhancing Safety and Truth in Large Language Models

Sharanya Dasgupta, Arkaprabha Basu, Sujoy Nath, Swagatam Das

Main category: cs.CL

TL;DR: ARREST is a unified framework that regulates LLM hallucinations and unsafe outputs by identifying and correcting drifted features in latent activation space using an external network, without fine-tuning model parameters.

Details

Motivation: LLMs lack human cognition's ability to self-correct between imagination and reality, leading to factual and safety failures. Current approaches treat these as separate alignment issues, but the authors argue both arise from representational misalignment in latent activation space.

Method: Propose ARREST (Adversarial Resilient Regulation Enhancing Safety and Truth) - an external network trained to understand fluctuations in LLM activations. It selectively intervenes to regulate falsehood into truthfulness and unsafe into safe outputs through both soft/hard refusals and factual corrections, without fine-tuning the base model.

Result: ARREST effectively regulates misalignment and is more versatile than RLHF-aligned models in generating soft refusals due to adversarial training. The framework demonstrates capability to correct both factual and safety issues through targeted intervention.

Conclusion: Factual and safety failures in LLMs stem from representational misalignment in latent space, and an external network can effectively regulate these issues without parameter fine-tuning. ARREST provides a unified approach to address both hallucination and safety problems through selective intervention.

Abstract: Human cognition, driven by complex neurochemical processes, oscillates between imagination and reality and learns to self-correct whenever such subtle drifts lead to hallucinations or unsafe associations. In recent years, LLMs have demonstrated remarkable performance in a wide range of tasks. However, they still lack human cognition to balance factuality and safety. Bearing the resemblance, we argue that both factual and safety failures in LLMs arise from a representational misalignment in their latent activation space, rather than addressing those as entirely separate alignment issues. We hypothesize that an external network, trained to understand the fluctuations, can selectively intervene in the model to regulate falsehood into truthfulness and unsafe output into safe output without fine-tuning the model parameters themselves. Reflecting the hypothesis, we propose ARREST (Adversarial Resilient Regulation Enhancing Safety and Truth), a unified framework that identifies and corrects drifted features, engaging both soft and hard refusals in addition to factual corrections. Our empirical results show that ARREST not only regulates misalignment but is also more versatile compared to the RLHF-aligned models in generating soft refusals due to adversarial training. We make our codebase available at https://github.com/sharanya-dasgupta001/ARREST.

[27] Interpreting Transformers Through Attention Head Intervention

Mason Kadem, Rong Zheng

Main category: cs.CL

TL;DR: The paper argues for mechanistic interpretability of neural networks to understand their decision-making processes, enabling accountability, studying digital cognition, and discovering new knowledge from AI systems.

Details

Motivation: Neural networks are becoming increasingly capable but remain poorly understood. Understanding their internal decision-making mechanisms is crucial for accountability in high-stakes applications, studying the emergence of cognition in digital systems, and leveraging AI systems that outperform humans to discover new knowledge.

Method: The paper advocates for mechanistic interpretability as an approach to reverse-engineer neural networks and understand their internal computational processes and decision-making mechanisms.

Result: The paper presents a framework for why mechanistic interpretability is important, outlining three key benefits: enabling accountability and control, facilitating the study of digital brains and cognitive emergence, and allowing discovery of new knowledge from superior AI systems.

Conclusion: Mechanistic interpretability is essential for understanding neural networks’ decision-making processes, with significant implications for AI safety, cognitive science research, and knowledge discovery from advanced AI systems.

Abstract: Neural networks are growing more capable on their own, but we do not understand their neural mechanisms. Understanding these mechanisms’ decision-making processes, or mechanistic interpretability, enables (1) accountability and control in high-stakes domains, (2) the study of digital brains and the emergence of cognition, and (3) discovery of new knowledge when AI systems outperform humans.

[28] Gavel: Agent Meets Checklist for Evaluating LLMs on Long-Context Legal Summarization

Yao Dou, Wei Xu

Main category: cs.CL

TL;DR: The paper introduces Gavel-Ref, a reference-based evaluation framework for multi-document legal case summarization, and Gavel-Agent, an autonomous agent scaffold that reduces token usage while maintaining performance on complex long-context tasks.

Details

Motivation: LLMs now support contexts up to 1M tokens, but their effectiveness on complex long-context tasks like multi-document legal case summarization (100K-500K tokens per case) remains unclear. Current evaluations report only single aggregate scores, lacking systematic analysis of model performance on specific aspects.

Method: 1) Introduced Gavel-Ref: a reference-based evaluation framework with multi-value checklist evaluation over 26 items, plus residual fact and writing-style evaluations. 2) Systematically evaluated 12 frontier LLMs on 100 legal cases (32K-512K tokens). 3) Developed Gavel-Agent: an efficient autonomous agent scaffold with six tools to navigate/extract checklists directly from case documents.

Result: Even the strongest model (Gemini 2.5 Pro) achieved only ~50% on SGavel-Ref, showing task difficulty. Models performed well on simple checklist items but struggled on multi-value or rare ones. Gavel-Agent with Qwen3 reduced token usage by 36% with only 7% drop in Schecklist compared to end-to-end extraction with GPT-4.1.

Conclusion: Multi-document legal summarization remains challenging for current LLMs. The proposed Gavel-Ref framework enables detailed evaluation beyond aggregate scores, while Gavel-Agent offers an efficient approach for long-context tasks as human references become less reliable with improving LLMs.

Abstract: Large language models (LLMs) now support contexts of up to 1M tokens, but their effectiveness on complex long-context tasks remains unclear. In this paper, we study multi-document legal case summarization, where a single case often spans many documents totaling 100K-500K tokens. We introduce Gavel-Ref, a reference-based evaluation framework with multi-value checklist evaluation over 26 items, as well as residual fact and writing-style evaluations. Using Gavel-Ref, we go beyond the single aggregate scores reported in prior work and systematically evaluate 12 frontier LLMs on 100 legal cases ranging from 32K to 512K tokens, primarily from 2025. Our results show that even the strongest model, Gemini 2.5 Pro, achieves only around 50 of $S_{\text{Gavel-Ref}}$, highlighting the difficulty of the task. Models perform well on simple checklist items (e.g., filing date) but struggle on multi-value or rare ones such as settlements and monitor reports. As LLMs continue to improve and may surpass human-written summaries – making human references less reliable – we develop Gavel-Agent, an efficient and autonomous agent scaffold that equips LLMs with six tools to navigate and extract checklists directly from case documents. With Qwen3, Gavel-Agent reduces token usage by 36% while resulting in only a 7% drop in $S_{\text{checklist}}$ compared to end-to-end extraction with GPT-4.1.

[29] Accommodation and Epistemic Vigilance: A Pragmatic Account of Why LLMs Fail to Challenge Harmful Beliefs

Myra Cheng, Robert D. Hawkins, Dan Jurafsky

Main category: cs.CL

TL;DR: LLMs often fail to challenge harmful user beliefs due to excessive accommodation and insufficient epistemic vigilance. Simple pragmatic interventions like “wait a minute” significantly improve safety performance.

Details

Motivation: LLMs frequently fail to challenge users' harmful beliefs in domains like medical advice and social reasoning, which poses safety concerns. The paper aims to understand these failures as pragmatic issues related to LLMs defaulting to accommodating user assumptions and lacking epistemic vigilance.

Method: The study examines how social and linguistic factors (at-issueness, linguistic encoding, source reliability) affect accommodation in LLMs, similar to human behavior. Tests performance across three safety benchmarks: Cancer-Myth, SAGE-Eval (misinformation), and ELEPHANT (sycophancy). Introduces simple pragmatic interventions like adding “wait a minute” to prompts.

Result: Social and linguistic factors influence LLM accommodation similarly to humans. Simple pragmatic interventions significantly improve performance on safety benchmarks while maintaining low false-positive rates. The phrase “wait a minute” was particularly effective.

Conclusion: LLM safety failures can be understood through pragmatic accommodation patterns. Considering pragmatics is crucial for evaluating LLM behavior and improving safety. Simple interventions can effectively enhance models’ ability to challenge harmful beliefs.

Abstract: Large language models (LLMs) frequently fail to challenge users’ harmful beliefs in domains ranging from medical advice to social reasoning. We argue that these failures can be understood and addressed pragmatically as consequences of LLMs defaulting to accommodating users’ assumptions and exhibiting insufficient epistemic vigilance. We show that social and linguistic factors known to influence accommodation in humans (at-issueness, linguistic encoding, and source reliability) similarly affect accommodation in LLMs, explaining performance differences across three safety benchmarks that test models’ ability to challenge harmful beliefs, spanning misinformation (Cancer-Myth, SAGE-Eval) and sycophancy (ELEPHANT). We further show that simple pragmatic interventions, such as adding the phrase “wait a minute”, significantly improve performance on these benchmarks while preserving low false-positive rates. Our results highlight the importance of considering pragmatics for evaluating LLM behavior and improving LLM safety.

[30] Learning to Simulate Human Dialogue

Kanishk Gandhi, Agam Bhatia, Noah D. Goodman

Main category: cs.CL

TL;DR: Optimizing for LLM-as-a-judge rewards improves judge scores but decreases human-likeness, while directly maximizing log-probability of human responses yields better prediction of actual human dialogue.

Details

Motivation: To understand how to better model human thinking through next-turn dialogue prediction, comparing different learning approaches for predicting what people actually say in conversations.

Method: Compare learning approaches along two dimensions: (1) thinking before responding (chain-of-thought), and (2) reward types (LLM-as-a-judge scoring vs. maximizing log-probability of human responses). Derive lower bound on log-probability treating chain-of-thought as latent variable.

Result: Optimizing for judge-based rewards increases judge scores but decreases likelihood of ground truth human responses and human-likeness win rates. Directly maximizing log-probability of human responses improves both log-probability and win rate evaluations. Chain-of-thought as latent variable optimization yields best results on all evaluations.

Conclusion: Thinking helps primarily when trained with distribution-matching objectives grounded in real human dialogue. Scaling this approach may produce models with more nuanced understanding of human behavior.

Abstract: To predict what someone will say is to model how they think. We study this through next-turn dialogue prediction: given a conversation, predict the next utterance produced by a person. We compare learning approaches along two dimensions: (1) whether the model is allowed to think before responding, and (2) how learning is rewarded either through an LLM-as-a-judge that scores semantic similarity and information completeness relative to the ground-truth response, or by directly maximizing the log-probability of the true human dialogue. We find that optimizing for judge-based rewards indeed increases judge scores throughout training, however it decreases the likelihood assigned to ground truth human responses and decreases the win rate when human judges choose the most human-like response among a real and synthetic option. This failure is amplified when the model is allowed to think before answering. In contrast, by directly maximizing the log-probability of observed human responses, the model learns to better predict what people actually say, improving on both log-probability and win rate evaluations. Treating chain-of-thought as a latent variable, we derive a lower bound on the log-probability. Optimizing this objective yields the best results on all our evaluations. These results suggest that thinking helps primarily when trained with a distribution-matching objective grounded in real human dialogue, and that scaling this approach to broader conversational data may produce models with a more nuanced understanding of human behavior.

[31] Merging Triggers, Breaking Backdoors: Defensive Poisoning for Instruction-Tuned Language Models

San Kim, Gary Geunbae Lee

Main category: cs.CL

TL;DR: MB-Defense is a two-stage training pipeline that immunizes instruction-tuned LLMs against backdoor attacks by merging attacker and defensive triggers into unified backdoor representations, then breaking them to restore clean behavior.

Details

Motivation: Instruction-tuned LLMs are vulnerable to backdoor attacks through poisoned training data, but defenses for these models remain underexplored despite the growing security risk.

Method: Two-stage framework: (1) defensive poisoning merges attacker and defensive triggers into unified backdoor representations, (2) weight recovery breaks these representations through additional training to restore clean behavior.

Result: Extensive experiments show MB-Defense substantially lowers attack success rates while preserving instruction-following ability, offering generalizable and data-efficient defense against unseen backdoor attacks.

Conclusion: MB-Defense provides an effective defense strategy that improves robustness of instruction-tuned LLMs against diverse backdoor threats without compromising their core functionality.

Abstract: Large Language Models (LLMs) have greatly advanced Natural Language Processing (NLP), particularly through instruction tuning, which enables broad task generalization without additional fine-tuning. However, their reliance on large-scale datasets-often collected from human or web sources-makes them vulnerable to backdoor attacks, where adversaries poison a small subset of data to implant hidden behaviors. Despite this growing risk, defenses for instruction-tuned models remain underexplored. We propose MB-Defense (Merging & Breaking Defense Framework), a novel training pipeline that immunizes instruction-tuned LLMs against diverse backdoor threats. MB-Defense comprises two stages: (i) defensive poisoning, which merges attacker and defensive triggers into a unified backdoor representation, and (ii) weight recovery, which breaks this representation through additional training to restore clean behavior. Extensive experiments across multiple LLMs show that MB-Defense substantially lowers attack success rates while preserving instruction-following ability. Our method offers a generalizable and data-efficient defense strategy, improving the robustness of instruction-tuned LLMs against unseen backdoor attacks.

[32] Users Mispredict Their Own Preferences for AI Writing Assistance

Vivian Lai, Zana Buçinca, Nil-Jana Akpinar, Mo Houtti, Hyeonsu B. Kang, Kevin Chian, Namjoon Suh, Alex C. Williams

Main category: cs.CL

TL;DR: Users’ stated preferences for AI writing assistance don’t match their actual behavior - they say urgency matters most but actually respond to compositional effort, creating a preference inversion that misleads system design.

Details

Motivation: Proactive AI writing assistants need to predict when users want help, but there's a lack of empirical understanding about what drives user preferences for drafting assistance.

Method: Factorial vignette study with 50 participants making 750 pairwise comparisons to analyze drivers of user preferences for AI writing assistance.

Result: Compositional effort dominates decisions (ρ=0.597) while urgency shows no predictive power (ρ≈0). Users rank urgency first in self-reports despite it being weakest behavioral driver - a complete preference inversion. Systems using stated preferences achieve only 57.7% accuracy vs 61.3% for behavioral patterns.

Conclusion: Relying on user introspection for system design actively misleads optimization; behavioral data should guide proactive NLG systems rather than self-reported preferences.

Abstract: Proactive AI writing assistants need to predict when users want drafting help, yet we lack empirical understanding of what drives preferences. Through a factorial vignette study with 50 participants making 750 pairwise comparisons, we find compositional effort dominates decisions ($ρ= 0.597$) while urgency shows no predictive power ($ρ\approx 0$). More critically, users exhibit a striking perception-behavior gap: they rank urgency first in self-reports despite it being the weakest behavioral driver, representing a complete preference inversion. This misalignment has measurable consequences. Systems designed from users’ stated preferences achieve only 57.7% accuracy, underperforming even naive baselines, while systems using behavioral patterns reach significantly higher 61.3% ($p < 0.05$). These findings demonstrate that relying on user introspection for system design actively misleads optimization, with direct implications for proactive natural language generation (NLG) systems.

[33] Beyond Static Summarization: Proactive Memory Extraction for LLM Agents

Chengyuan Yang, Zequn Sun, Wei Wei, Wei Hu

Main category: cs.CL

TL;DR: ProMem introduces proactive memory extraction with recurrent feedback loops to address limitations of static summarization in LLM agents, improving memory completeness and QA accuracy.

Details

Motivation: Existing summary-based memory extraction methods for LLM agents have two major limitations: 1) summarization is "ahead-of-time" (blind feed-forward process that misses important details without knowing future tasks), and 2) extraction is "one-off" (lacks feedback loop to verify facts, leading to accumulation of information loss).

Method: ProMem treats memory extraction as an iterative cognitive process with recurrent feedback loops. The agent uses self-questioning to actively probe dialogue history, allowing recovery of missing information and error correction.

Result: ProMem significantly improves completeness of extracted memory and QA accuracy. It also achieves superior trade-off between extraction quality and token cost compared to existing methods.

Conclusion: Proactive memory extraction with recurrent feedback loops addresses fundamental limitations of static summarization approaches, providing a more effective memory management solution for LLM agents in long-term interaction and personalization scenarios.

Abstract: Memory management is vital for LLM agents to handle long-term interaction and personalization. Most research focuses on how to organize and use memory summary, but often overlooks the initial memory extraction stage. In this paper, we argue that existing summary-based methods have two major limitations based on the recurrent processing theory. First, summarization is “ahead-of-time”, acting as a blind “feed-forward” process that misses important details because it doesn’t know future tasks. Second, extraction is usually “one-off”, lacking a feedback loop to verify facts, which leads to the accumulation of information loss. To address these issues, we propose proactive memory extraction (namely ProMem). Unlike static summarization, ProMem treats extraction as an iterative cognitive process. We introduce a recurrent feedback loop where the agent uses self-questioning to actively probe the dialogue history. This mechanism allows the agent to recover missing information and correct errors. Our ProMem significantly improves the completeness of the extracted memory and QA accuracy. It also achieves a superior trade-off between extraction quality and token cost.

[34] Concept Tokens: Learning Behavioral Embeddings Through Concept Definitions

Ignacio Sastre, Aiala Rosá

Main category: cs.CL

TL;DR: Concept Tokens add special tokens to frozen LLMs, learning embeddings from concept definitions to control model behavior with minimal parameter updates.

Details

Motivation: To create a lightweight method for controlling frozen LLM behavior using compact concept representations learned from definitions, enabling targeted steering without full model retraining.

Method: Add special tokens to pretrained LLMs, learn only their embeddings from multiple natural language definitions (with concept occurrences replaced by tokens), keep LLM frozen, optimize embeddings with standard language modeling objective.

Result: 1) Reduced hallucinations in HotpotQA QA (negating token increases abstentions, asserting increases hallucinations); 2) Induced recasting teaching strategy with same directional effect; 3) Better preserved instruction compliance vs. in-context definitions; 4) Qualitative study shows embeddings capture concept information with limitations.

Conclusion: Concept Tokens provide compact control signals learned from definitions that can effectively steer behavior in frozen LLMs with minimal parameter updates.

Abstract: We propose Concept Tokens, a lightweight method that adds a new special token to a pretrained LLM and learns only its embedding from multiple natural language definitions of a target concept, where occurrences of the concept are replaced by the new token. The LLM is kept frozen and the embedding is optimized with the standard language-modeling objective. We evaluate Concept Tokens in three settings. First, we study hallucinations in closed-book question answering on HotpotQA and find a directional effect: negating the hallucination token reduces hallucinated answers mainly by increasing abstentions, whereas asserting it increases hallucinations and lowers precision. Second, we induce recasting, a pedagogical feedback strategy for second language teaching, and observe the same directional effect. Moreover, compared to providing the full definitional corpus in-context, concept tokens better preserve compliance with other instructions (e.g., asking follow-up questions). Finally, we include a qualitative study with the Eiffel Tower and a fictional “Austral Tower” to illustrate what information the learned embeddings capture and where their limitations emerge. Overall, Concept Tokens provide a compact control signal learned from definitions that can steer behavior in frozen LLMs.

[35] SampoNLP: A Self-Referential Toolkit for Morphological Analysis of Subword Tokenizers

Iaroslav Chelombitko, Ekaterina Chelombitko, Aleksey Komissarov

Main category: cs.CL

TL;DR: SampoNLP toolkit creates morphological lexicons for Uralic languages using MDL-inspired scoring, enabling systematic evaluation of BPE tokenizers and providing optimal vocabulary size recommendations.

Details

Motivation: Evaluating tokenizers for morphologically rich Uralic languages is difficult due to lack of clean morpheme lexicons, which hinders proper assessment of subword tokenization quality.

Method: Developed SampoNLP toolkit using MDL-inspired Self-Referential Atomicity Scoring to create high-purity morphological lexicons from corpus data without requiring existing lexicons. Used these lexicons to systematically evaluate BPE tokenizers across vocabulary sizes (8k-256k) and proposed Integrated Performance Score (IPS) metric to balance morpheme coverage and over-splitting.

Result: Generated high-purity lexicons for Finnish, Hungarian, and Estonian; identified optimal vocabulary sizes via IPS curve analysis; demonstrated limitations of standard BPE for agglutinative languages; provided first empirically grounded recommendations for optimal vocabulary sizes in these languages.

Conclusion: SampoNLP enables corpus-free lexicon creation for low-resource settings, provides practical guidance for tokenizer optimization in Uralic languages, and quantitatively shows BPE limitations for agglutinative languages. Toolkit and resources are publicly available.

Abstract: The quality of subword tokenization is critical for Large Language Models, yet evaluating tokenizers for morphologically rich Uralic languages is hampered by the lack of clean morpheme lexicons. We introduce SampoNLP, a corpus-free toolkit for morphological lexicon creation using MDL-inspired Self-Referential Atomicity Scoring, which filters composite forms through internal structural cues - suited for low-resource settings. Using the high-purity lexicons generated by SampoNLP for Finnish, Hungarian, and Estonian, we conduct a systematic evaluation of BPE tokenizers across a range of vocabulary sizes (8k-256k). We propose a unified metric, the Integrated Performance Score (IPS), to navigate the trade-off between morpheme coverage and over-splitting. By analyzing the IPS curves, we identify the “elbow points” of diminishing returns and provide the first empirically grounded recommendations for optimal vocabulary sizes (k) in these languages. Our study not only offers practical guidance but also quantitatively demonstrates the limitations of standard BPE for highly agglutinative languages. The SampoNLP library and all generated resources are made publicly available: https://github.com/AragonerUA/SampoNLP

[36] LinguaGame: A Linguistically Grounded Game-Theoretic Paradigm for Multi-Agent Dialogue Generation

Yuxiao Ye, Yiming Zhang, Yiran Ma, Huiyuan Xie, Huining Zhu, Zhiyuan Liu

Main category: cs.CL

TL;DR: LinguaGame: A game-theoretic framework for improving communication efficiency in LLM-based multi-agent systems by modeling dialogue as intentional signaling games.

Details

Motivation: While recent LLM-based multi-agent systems focus on architecture design (role assignment, workflow orchestration), there's a gap in improving the interaction process itself. Current systems don't effectively help agents convey their intended meaning through language, leading to inefficient communication.

Method: Proposes LinguaGame, a linguistically-grounded game-theoretic paradigm that models dialogue as a signaling game over communicative intents and strategies. Uses a training-free equilibrium approximation algorithm for inference-time decision adjustment. Unlike prior game-theoretic MASs, it relies on linguistically informed reasoning with minimal task-specific coupling.

Result: Evaluated in simulated courtroom proceedings and debates, with human expert assessments showing significant gains in communication efficiency compared to baseline approaches.

Conclusion: Treating dialogue as intentional and strategic communication (requiring agents to infer others’ intents and strategies) significantly improves communication efficiency in LLM-based multi-agent systems, offering a linguistically-grounded alternative to task-specific game designs.

Abstract: Large Language Models (LLMs) have enabled Multi-Agent Systems (MASs) where agents interact through natural language to solve complex tasks or simulate multi-party dialogues. Recent work on LLM-based MASs has mainly focused on architecture design, such as role assignment and workflow orchestration. In contrast, this paper targets the interaction process itself, aiming to improve agents’ communication efficiency by helping them convey their intended meaning more effectively through language. To this end, we propose LinguaGame, a linguistically-grounded game-theoretic paradigm for multi-agent dialogue generation. Our approach models dialogue as a signalling game over communicative intents and strategies, solved with a training-free equilibrium approximation algorithm for inference-time decision adjustment. Unlike prior game-theoretic MASs, whose game designs are often tightly coupled with task-specific objectives, our framework relies on linguistically informed reasoning with minimal task-specific coupling. Specifically, it treats dialogue as intentional and strategic communication, requiring agents to infer what others aim to achieve (intents) and how they pursue those goals (strategies). We evaluate our framework in simulated courtroom proceedings and debates, with human expert assessments showing significant gains in communication efficiency.

[37] GRACE: Reinforcement Learning for Grounded Response and Abstention under Contextual Evidence

Yibo Zhao, Jiapeng Zhu, Zichen Ding, Xiang Li

Main category: cs.CL

TL;DR: GRACE is a reinforcement learning framework that simultaneously addresses two critical flaws in RAG systems: providing correct answers without evidence and fabricating responses when context is insufficient.

Details

Motivation: Current RAG systems suffer from two key problems: (1) giving correct answers without explicit grounded evidence, and (2) producing fabricated responses when retrieved context is insufficient. While prior work addressed these issues separately, there's no unified framework integrating evidence-based grounding with reliable abstention.

Method: GRACE uses reinforcement learning with a data construction method employing heterogeneous retrievers to generate diverse training samples without manual annotation. It features a multi-stage gated reward function that trains models to: assess evidence sufficiency, extract key supporting evidence, and provide answers or explicitly abstain.

Result: On two benchmarks, GRACE achieves state-of-the-art overall accuracy and strikes a favorable balance between accurate response and rejection, while requiring only 10% of the annotation costs compared to prior methods.

Conclusion: GRACE provides an effective unified framework that simultaneously mitigates both evidence-grounding and abstention problems in RAG systems, achieving strong performance with significantly reduced annotation costs.

Abstract: Retrieval-Augmented Generation (RAG) integrates external knowledge to enhance Large Language Models (LLMs), yet systems remain susceptible to two critical flaws: providing correct answers without explicit grounded evidence and producing fabricated responses when the retrieved context is insufficient. While prior research has addressed these issues independently, a unified framework that integrates evidence-based grounding and reliable abstention is currently lacking. In this paper, we propose GRACE, a reinforcement-learning framework that simultaneously mitigates both types of flaws. GRACE employs a data construction method that utilizes heterogeneous retrievers to generate diverse training samples without manual annotation. A multi-stage gated reward function is then employed to train the model to assess evidence sufficiency, extract key supporting evidence, and provide answers or explicitly abstain. Experimental results on two benchmarks demonstrate that GRACE achieves state-of-the-art overall accuracy and strikes a favorable balance between accurate response and rejection, while requiring only 10% of the annotation costs of prior methods. Our code is available at https://github.com/YiboZhao624/Grace..

[38] BanglaLorica: Design and Evaluation of a Robust Watermarking Algorithm for Large Language Models in Bangla Text Generation

Amit Bin Tariqul, A N M Zahid Hossain Milkan, Sahab-Al-Chowdhury, Syed Rifat Raiyan, Hasan Mahmud, Md Kamrul Hasan

Main category: cs.CL

TL;DR: First systematic evaluation of text watermarking methods for Bangla LLM generation shows existing methods fail under cross-lingual translation attacks, but proposed layered watermarking improves robustness 3-4×.

Details

Motivation: Existing watermarking methods work well for high-resource languages but their robustness in low-resource languages like Bangla is unknown, especially under cross-lingual attacks like round-trip translation.

Method: Systematically evaluate three SOTA watermarking methods (KGW, EXP, Waterfall) for Bangla LLM text generation under benign and cross-lingual round-trip translation attacks. Propose layered watermarking combining embedding-time and post-generation watermarks.

Result: Under benign conditions, KGW and EXP achieve >88% detection accuracy with minimal quality degradation. However, RTT attacks collapse detection to 9-13%. Layered watermarking improves post-RTT accuracy by 25-35%, achieving 40-50% accuracy (3-4× improvement over single methods).

Conclusion: Token-level watermarking fundamentally fails under cross-lingual attacks for low-resource languages. Layered watermarking provides practical, training-free solution with controlled quality trade-off, establishing robustness-quality trade-off in multilingual watermarking.

Abstract: As large language models (LLMs) are increasingly deployed for text generation, watermarking has become essential for authorship attribution, intellectual property protection, and misuse detection. While existing watermarking methods perform well in high-resource languages, their robustness in low-resource languages remains underexplored. This work presents the first systematic evaluation of state-of-the-art text watermarking methods: KGW, Exponential Sampling (EXP), and Waterfall, for Bangla LLM text generation under cross-lingual round-trip translation (RTT) attacks. Under benign conditions, KGW and EXP achieve high detection accuracy (>88%) with negligible perplexity and ROUGE degradation. However, RTT causes detection accuracy to collapse below RTT causes detection accuracy to collapse to 9-13%, indicating a fundamental failure of token-level watermarking. To address this, we propose a layered watermarking strategy that combines embedding-time and post-generation watermarks. Experimental results show that layered watermarking improves post-RTT detection accuracy by 25-35%, achieving 40-50% accuracy, representing a 3$\times$ to 4$\times$ relative improvement over single-layer methods, at the cost of controlled semantic degradation. Our findings quantify the robustness-quality trade-off in multilingual watermarking and establish layered watermarking as a practical, training-free solution for low-resource languages such as Bangla. Our code and data will be made public.

[39] Identifying Good and Bad Neurons for Task-Level Controllable LLMs

Wenjie Li, Guansong Pang, Hezhe Qiao, Debin Gao, David Lo

Main category: cs.CL

TL;DR: NeuronLLM is a task-level framework that identifies both supportive and inhibitive neurons in LLMs using functional antagonism principles and contrastive learning to mitigate fortuitous behaviors.

Details

Motivation: Current neuron attribution methods are ability-specific and only focus on supportive neurons, ignoring inhibitive roles and being misled by fortuitous behaviors where LLMs answer correctly by chance rather than genuine understanding.

Method: NeuronLLM adopts biological functional antagonism principles, modeling both good (facilitative) and bad (inhibitive) neurons via contrastive learning, and uses augmented question sets to reduce fortuitous behaviors.

Result: Comprehensive experiments on various LLM sizes and families show NeuronLLM’s superiority over existing methods across four NLP tasks, providing new insights into LLM functional organization.

Conclusion: NeuronLLM offers a more holistic approach to understanding LLM neurons by considering both supportive and inhibitive roles, addressing limitations of previous methods and providing better task-level neuron attribution.

Abstract: Large Language Models have demonstrated remarkable capabilities on multiple-choice question answering benchmarks, but the complex mechanisms underlying their large-scale neurons remain opaque, posing significant challenges for understanding and steering LLMs. While recent studies made progress on identifying responsible neurons for certain abilities, these ability-specific methods are infeasible for task-focused scenarios requiring coordinated use of multiple abilities. Moreover, these approaches focus only on supportive neurons that correlate positively with task completion, while neglecting neurons with other roles-such as inhibitive roles-and misled neuron attribution due to fortuitous behaviors in LLMs (i.e., correctly answer the questions by chance rather than genuine understanding). To address these challenges, we propose NeuronLLM, a novel task-level LLM understanding framework that adopts the biological principle of functional antagonism for LLM neuron identification. The key insight is that task performance is jointly determined by neurons with two opposing roles: good neurons that facilitate task completion and bad neurons that inhibit it. NeuronLLM achieves a holistic modeling of neurons via contrastive learning of good and bad neurons, while leveraging augmented question sets to mitigate the fortuitous behaviors in LLMs. Comprehensive experiments on LLMs of different sizes and families show the superiority of NeuronLLM over existing methods in four NLP tasks, providing new insights into LLM functional organization.

[40] FeedEval: Pedagogically Aligned Evaluation of LLM-Generated Essay Feedback

Seongyeub Chu, Jongwoo Kim, Munyong Yi

Main category: cs.CL

TL;DR: FeedEval is an LLM-based framework that evaluates LLM-generated essay feedback on specificity, helpfulness, and validity dimensions, filtering high-quality feedback to improve essay scoring models and revision outcomes.

Details

Motivation: Current automated essay scoring research focuses on generating high-quality feedback, but often uses LLM-generated feedback without quality validation, leading to noise propagation in downstream applications.

Method: FeedEval uses dimension-specialized LLM evaluators trained on curated datasets to assess multiple feedback candidates along three pedagogically grounded dimensions (specificity, helpfulness, validity) and select high-quality feedback.

Result: FeedEval aligns closely with human expert judgments, and essay scoring models trained with FeedEval-filtered high-quality feedback achieve superior scoring performance on ASAP++ benchmark. High-quality feedback identified by FeedEval also leads to more effective essay revisions with small LLMs.

Conclusion: FeedEval effectively addresses the quality validation gap in LLM-generated essay feedback, improving both scoring model performance and revision outcomes, with code and datasets to be released.

Abstract: Going beyond the prediction of numerical scores, recent research in automated essay scoring has increasingly emphasized the generation of high-quality feedback that provides justification and actionable guidance. To mitigate the high cost of expert annotation, prior work has commonly relied on LLM-generated feedback to train essay assessment models. However, such feedback is often incorporated without explicit quality validation, resulting in the propagation of noise in downstream applications. To address this limitation, we propose FeedEval, an LLM-based framework for evaluating LLM-generated essay feedback along three pedagogically grounded dimensions: specificity, helpfulness, and validity. FeedEval employs dimension-specialized LLM evaluators trained on datasets curated in this study to assess multiple feedback candidates and select high-quality feedback for downstream use. Experiments on the ASAP++ benchmark show that FeedEval closely aligns with human expert judgments and that essay scoring models trained with FeedEval-filtered high-quality feedback achieve superior scoring performance. Furthermore, revision experiments using small LLMs show that the high-quality feedback identified by FeedEval leads to more effective essay revisions. We will release our code and curated datasets upon accepted.

[41] Aligning Text, Code, and Vision: A Multi-Objective Reinforcement Learning Framework for Text-to-Visualization

Mizanur Rahman, Mohammed Saidul Islam, Md Tahmid Rahman Laskar, Shafiq Joty, Enamul Hoque

Main category: cs.CL

TL;DR: RL-Text2Vis is a reinforcement learning framework that improves text-to-visualization generation by optimizing for textual accuracy, code validity, and visualization quality using post-execution feedback, achieving significant improvements over existing methods.

Details

Motivation: Current Text2Vis systems using LLMs produce visualizations that often lack semantic alignment and clarity. While closed-source models generate functional code, the visualizations are poor, and open-source models struggle with executability. Supervised fine-tuning improves code execution but fails to enhance overall visualization quality since traditional loss functions can't capture post-execution feedback.

Method: Proposes RL-Text2Vis, a reinforcement learning framework built on Group Relative Policy Optimization (GRPO). Uses a novel multi-objective reward function that jointly optimizes three aspects: textual accuracy (how well the visualization answers the query), code validity (executability), and visualization quality (post-execution assessment of chart clarity and effectiveness).

Result: Achieves 22% relative improvement in chart quality over GPT-4o on Text2Vis benchmark. Boosts code execution success from 78% to 97% relative to zero-shot baseline. Models (Qwen2.5 7B and 14B) significantly outperform zero-shot and supervised baselines and demonstrate robust generalization to out-of-domain datasets (VIS-Eval and NVBench).

Conclusion: RL-Text2Vis establishes GRPO as an effective strategy for structured, multimodal reasoning in visualization generation, successfully addressing the limitations of existing approaches by incorporating post-execution feedback through reinforcement learning.

Abstract: Text-to-Visualization (Text2Vis) systems translate natural language queries over tabular data into concise answers and executable visualizations. While closed-source LLMs generate functional code, the resulting charts often lack semantic alignment and clarity, qualities that can only be assessed post-execution. Open-source models struggle even more, frequently producing non-executable or visually poor outputs. Although supervised fine-tuning can improve code executability, it fails to enhance overall visualization quality, as traditional SFT loss cannot capture post-execution feedback. To address this gap, we propose RL-Text2Vis, the first reinforcement learning framework for Text2Vis generation. Built on Group Relative Policy Optimization (GRPO), our method uses a novel multi-objective reward that jointly optimizes textual accuracy, code validity, and visualization quality using post-execution feedback. By training Qwen2.5 models (7B and 14B), RL-Text2Vis achieves a 22% relative improvement in chart quality over GPT-4o on the Text2Vis benchmark and boosts code execution success from 78% to 97% relative to its zero-shot baseline. Our models significantly outperform strong zero-shot and supervised baselines and also demonstrate robust generalization to out-of-domain datasets like VIS-Eval and NVBench. These results establish GRPO as an effective strategy for structured, multimodal reasoning in visualization generation. We release our code at https://github.com/vis-nlp/RL-Text2Vis.

[42] THaLLE-ThaiLLM: Domain-Specialized Small LLMs for Finance and Thai – Technical Report

KBTG Labs, :, Anuruth Lertpiya, Danupat Khamnuansin, Kantapong Sucharitpongpan, Pornchanan Balee, Tawunrat Chalothorn, Thadpong Pongthawornkamol, Monchai Lertsutthiwong

Main category: cs.CL

TL;DR: Model merging enables efficient creation of multi-capability LLMs by combining specialized models, demonstrated through merging Qwen-8B with Thai language and financial models to improve performance across Thai language and financial domains.

Details

Motivation: Organizations need multi-capability LLMs for banking/finance but face trade-offs between deploying multiple specialized models vs. expensive training of single multi-capability models. Privacy/regulatory concerns favor on-premise deployment, and Thai language capabilities need enhancement for local industry adoption.

Method: Explored model merging as resource-efficient alternative. Conducted two experiments: 1) Merged Qwen-8B with ThaiLLM-8B to enhance Thai language capabilities; 2) Merged Qwen-8B with both ThaiLLM-8B and THaLLE-CFA-8B to combine general, Thai language, and financial capabilities.

Result: First experiment showed ThaiLLM-8B enhanced Thai general capabilities with uplift in M3 and M6 O-NET exams over base Qwen-8B. Second experiment demonstrated further improvements across both general and financial domains with uplift in M3/M6 O-NET, Flare-CFA, and Thai-IC benchmarks.

Conclusion: Model merging is a viable, resource-efficient approach for creating high-performance multi-capability LLMs, enabling organizations to leverage specialized models without prohibitive training costs while addressing privacy/regulatory concerns through on-premise deployment.

Abstract: Large Language Models (LLMs) have demonstrated significant potential across various domains, particularly in banking and finance, where they can automate complex tasks and enhance decision-making at scale. Due to privacy, security, and regulatory concerns, organizations often prefer on-premise deployment of LLMs. The ThaiLLM initiative aims to enhance Thai language capabilities in open-LLMs, enabling Thai industry to leverage advanced language models. However, organizations often face a trade-off between deploying multiple specialized models versus the prohibitive expense of training a single multi-capability model. To address this, we explore model merging as a resource-efficient alternative for developing high-performance, multi-capability LLMs. We present results from two key experiments: first, merging Qwen-8B with ThaiLLM-8B demonstrates how ThaiLLM-8B enhances Thai general capabilities, showing an uplift of M3 and M6 O-NET exams over the general instruction-following Qwen-8B. Second, we merge Qwen-8B with both ThaiLLM-8B and THaLLE-CFA-8B. This combination results in further improvements in performance across both general and financial domains, by demonstrating an uplift in both M3 and M6 O-NET, Flare-CFA, and Thai-IC benchmarks. The report showcases the viability of model merging for efficiently creating multi-capability LLMs.

[43] On the Limitations of Rank-One Model Editing in Answering Multi-hop Questions

Zhiyuan He, Binghan Chen, Tianxiang Xiong, Ziyang Sun, Mozhao Zhu, Xi Chen

Main category: cs.CL

TL;DR: ROME knowledge editing struggles with multi-hop reasoning due to three failure modes: hopping-too-late, generalization decay, and overfitting. Redundant Editing improves 2-hop accuracy by 15.5+ percentage points.

Details

Motivation: Current knowledge editing methods like ROME work well for single-hop fact updates but fail on multi-hop reasoning tasks requiring knowledge chaining, limiting their practical applicability for complex reasoning.

Method: Analyzed ROME editing across different layer depths, identified three failure modes, and proposed Redundant Editing strategy to address hopping-too-late and generalization issues.

Result: Redundant Editing improves accuracy on 2-hop questions by at least 15.5 percentage points (96% increase over single-edit strategy), though with trade-offs in specificity and language naturalness.

Conclusion: Redundant Editing effectively addresses key limitations of knowledge editing for multi-hop reasoning, significantly improving performance while highlighting remaining challenges in maintaining specificity and natural language quality.

Abstract: Recent advances in Knowledge Editing (KE), particularly Rank-One Model Editing (ROME), show superior efficiency over fine-tuning and in-context learning for updating single-hop facts in transformers. However, these methods face significant challenges when applied to multi-hop reasoning tasks requiring knowledge chaining. In this work, we study the effect of editing knowledge with ROME on different layer depths and identify three key failure modes. First, the “hopping-too-late” problem occurs as later layers lack access to necessary intermediate representations. Second, generalization ability deteriorates sharply when editing later layers. Third, the model overfits to edited knowledge, incorrectly prioritizing edited-hop answers regardless of context. To mitigate the issues of “hopping-too-late” and generalisation decay, we propose Redundant Editing, a simple yet effective strategy that enhances multi-hop reasoning. Our experiments demonstrate that this approach can improve accuracy on 2-hop questions by at least 15.5 percentage points, representing a 96% increase over the previous single-edit strategy, while trading off some specificity and language naturalness.

[44] When More Words Say Less: Decoupling Length and Specificity in Image Description Evaluation

Rhea Kapur, Robert Hawkins, Elisa Kreiss

Main category: cs.CL

TL;DR: VLMs conflate description specificity with length; the paper argues they should be disentangled, showing that specificity (ability to pick out target image) matters more than verbosity.

Details

Motivation: Current vision-language models often conflate description specificity with length, assuming longer descriptions are more specific. The authors argue these concepts must be disentangled since descriptions can be concise yet informative or lengthy yet vacuous.

Method: Define specificity relative to a contrast set (how well description picks out target image vs. other images). Construct dataset controlling for length while varying information content. Validate through human preference studies.

Result: People reliably prefer more specific descriptions regardless of length. Controlling for length alone cannot account for specificity differences - how length budget is allocated matters significantly.

Conclusion: Evaluation approaches should directly prioritize specificity over verbosity. The distinction between specificity and length is crucial for developing better vision-language models that generate truly informative descriptions.

Abstract: Vision-language models (VLMs) are increasingly used to make visual content accessible via text-based descriptions. In current systems, however, description specificity is often conflated with their length. We argue that these two concepts must be disentangled: descriptions can be concise yet dense with information, or lengthy yet vacuous. We define specificity relative to a contrast set, where a description is more specific to the extent that it picks out the target image better than other possible images. We construct a dataset that controls for length while varying information content, and validate that people reliably prefer more specific descriptions regardless of length. We find that controlling for length alone cannot account for differences in specificity: how the length budget is allocated makes a difference. These results support evaluation approaches that directly prioritize specificity over verbosity.

[45] Character-R1: Enhancing Role-Aware Reasoning in Role-Playing Agents via RLVR

Yihong Tang, Kehai Chen, Xuefeng Bai, Benyou Wang, Zeming Liu, Haifeng Wang, Min Zhang

Main category: cs.CL

TL;DR: Character-R1 is a framework that provides verifiable reward signals for role-playing agents to improve cognitive consistency and reduce out-of-character errors through structured internal cognition analysis, reference-guided optimization, and character-conditioned reward normalization.

Details

Motivation: Current role-playing agents imitate surface-level behaviors but lack internal cognitive consistency, leading to out-of-character errors in complex situations. There's a need for comprehensive verifiable reward signals for effective role-aware reasoning that existing methods don't provide.

Method: Character-R1 framework with three core designs: (1) Cognitive Focus Reward - explicit label-based analysis of 10 character elements (e.g., worldview) to structure internal cognition; (2) Reference-Guided Reward - overlap-based metrics with reference responses as optimization anchors; (3) Character-Conditioned Reward Normalization - adjusts reward distributions based on character categories for robust optimization across heterogeneous roles.

Result: Extensive experiments demonstrate that Character-R1 significantly outperforms existing methods in knowledge, memory and other metrics.

Conclusion: Character-R1 provides a comprehensive framework for verifiable reward signals that addresses the cognitive consistency problem in role-playing agents, enabling more effective role-aware reasoning and reducing out-of-character errors.

Abstract: Current role-playing agents (RPAs) are typically constructed by imitating surface-level behaviors, but this approach lacks internal cognitive consistency, often causing out-of-character errors in complex situations. To address this, we propose Character-R1, a framework designed to provide comprehensive verifiable reward signals for effective role-aware reasoning, which are missing in recent studies. Specifically, our framework comprises three core designs: (1) Cognitive Focus Reward, which enforces explicit label-based analysis of 10 character elements (e.g., worldview) to structure internal cognition; (2) Reference-Guided Reward, which utilizes overlap-based metrics with reference responses as optimization anchors to enhance exploration and performance; and (3) Character-Conditioned Reward Normalization, which adjusts reward distributions based on character categories to ensure robust optimization across heterogeneous roles. Extensive experiments demonstrate that Character-R1 significantly outperforms existing methods in knowledge, memory and others.

[46] From National Curricula to Cultural Awareness: Constructing Open-Ended Culture-Specific Question Answering Dataset

Haneul Yoo, Won Ik Cho, Geunhye Kim, Jiyoon Han

Main category: cs.CL

TL;DR: CuCu framework uses national social studies curricula to create culture-specific QA datasets, addressing LLM cultural bias by generating KCaQA with 34.1k Korean culture-grounded QA pairs.

Details

Motivation: LLMs show uneven performance across languages and cultures due to English-centric training data biases, creating need for practical cultural alignment methods.

Method: CuCu: automated multi-agent LLM framework that transforms national textbook curricula into open-ended, culture-specific question-answer pairs, applied to Korean social studies curriculum.

Result: Created KCaQA dataset with 34.1k open-ended QA pairs covering Korean culture-specific topics with responses grounded in local sociocultural contexts.

Conclusion: National curricula provide scalable foundation for culture-aware supervision, enabling LLMs to better reflect diverse cultural values beyond English-centric biases.

Abstract: Large language models (LLMs) achieve strong performance on many tasks, but their progress remains uneven across languages and cultures, often reflecting values latent in English-centric training data. To enable practical cultural alignment, we propose a scalable approach that leverages national social studies curricula as a foundation for culture-aware supervision. We introduce CuCu, an automated multi-agent LLM framework that transforms national textbook curricula into open-ended, culture-specific question-answer pairs. Applying CuCu to the Korean national social studies curriculum, we construct KCaQA, comprising 34.1k open-ended QA pairs. Our quantitative and qualitative analyses suggest that KCaQA covers culture-specific topics and produces responses grounded in local sociocultural contexts.

[47] MAGA-Bench: Machine-Augment-Generated Text via Alignment Detection Benchmark

Anyang Song, Ying Cheng, Yiqian Xu, Rui Feng

Main category: cs.CL

TL;DR: Proposes MAGA (Machine-Augment-Generated Text via Alignment) to improve LLM-generated text alignment for better detector robustness testing and generalization enhancement.

Details

Motivation: As LLM alignment evolves, machine-generated text becomes harder to distinguish from human-written text, exacerbating abuse issues like fake news and fraud. Existing detectors have limited generalization that depends on dataset quality, and simply expanding MGT sources is insufficient.

Method: MAGA pipeline achieves comprehensive alignment from prompt construction to reasoning process, featuring RLDF (Reinforced Learning from Detectors Feedback) as a key component to systematically enhance text alignment.

Result: RoBERTa detector fine-tuned on MAGA training set achieved 4.60% average improvement in generalization detection AUC. MAGA Dataset caused average 8.13% decrease in selected detectors’ AUC, demonstrating effectiveness for robustness testing.

Conclusion: MAGA provides an effective approach to enhance LLM-generated text alignment, improving detector generalization while enabling better robustness testing, with potential significance for future detector research.

Abstract: Large Language Models (LLMs) alignment is constantly evolving. Machine-Generated Text (MGT) is becoming increasingly difficult to distinguish from Human-Written Text (HWT). This has exacerbated abuse issues such as fake news and online fraud. Fine-tuned detectors’ generalization ability is highly dependent on dataset quality, and simply expanding the sources of MGT is insufficient. Further augment of generation process is required. According to HC-Var’s theory, enhancing the alignment of generated text can not only facilitate attacks on existing detectors to test their robustness, but also help improve the generalization ability of detectors fine-tuned on it. Therefore, we propose \textbf{M}achine-\textbf{A}ugment-\textbf{G}enerated Text via \textbf{A}lignment (MAGA). MAGA’s pipeline achieves comprehensive alignment from prompt construction to reasoning process, among which \textbf{R}einforced \textbf{L}earning from \textbf{D}etectors \textbf{F}eedback (RLDF), systematically proposed by us, serves as a key component. In our experiments, the RoBERTa detector fine-tuned on MAGA training set achieved an average improvement of 4.60% in generalization detection AUC. MAGA Dataset caused an average decrease of 8.13% in the AUC of the selected detectors, expecting to provide indicative significance for future research on the generalization detection ability of detectors.

[48] SpeechMedAssist: Efficiently and Effectively Adapting Speech Language Models for Medical Consultation

Sirry Chen, Jieyi Wang, Wei Chen, Zhongyu Wei

Main category: cs.CL

TL;DR: SpeechMedAssist enables speech-based medical consultations using a two-stage training approach that reduces speech data requirements to only 10k synthesized samples.

Details

Motivation: Medical consultations are naturally speech-centric, but existing approaches rely on cumbersome text-based interactions. While SpeechLMs offer more natural interaction, they face challenges due to scarce medical speech data and inefficient fine-tuning methods.

Method: A two-stage training paradigm: (1) Knowledge & Capability Injection via Text to build medical knowledge, and (2) Modality Re-alignment with Limited Speech Data using only 10k synthesized speech samples. This exploits SpeechLM architectural properties to decouple training.

Result: SpeechMedAssist outperforms all baselines in both effectiveness and robustness across single-turn QA and multi-turn simulated interactions in medical consultation scenarios.

Conclusion: The proposed approach successfully enables speech-based medical consultation with minimal speech data requirements, demonstrating superior performance over existing methods while maintaining natural patient interaction.

Abstract: Medical consultations are intrinsically speech-centric. However, most prior works focus on long-text-based interactions, which are cumbersome and patient-unfriendly. Recent advances in speech language models (SpeechLMs) have enabled more natural speech-based interaction, yet the scarcity of medical speech data and the inefficiency of directly fine-tuning on speech data jointly hinder the adoption of SpeechLMs in medical consultation. In this paper, we propose SpeechMedAssist, a SpeechLM natively capable of conducting speech-based multi-turn interactions with patients. By exploiting the architectural properties of SpeechLMs, we decouple the conventional one-stage training into a two-stage paradigm consisting of (1) Knowledge & Capability Injection via Text and (2) Modality Re-alignment with Limited Speech Data, thereby reducing the requirement for medical speech data to only 10k synthesized samples. To evaluate SpeechLMs for medical consultation scenarios, we design a benchmark comprising both single-turn question answering and multi-turn simulated interactions. Experimental results show that our model outperforms all baselines in both effectiveness and robustness in most evaluation settings.

[49] CRANE: Causal Relevance Analysis of Language-Specific Neurons in Multilingual Large Language Models

Yifan Le, Yunliang Li

Main category: cs.CL

TL;DR: CRANE is a relevance-based framework that identifies language-specific neurons in multilingual LLMs through targeted interventions, showing these neurons are language-selective but not exclusive.

Details

Motivation: Current methods for identifying language-related neurons in multilingual LLMs rely on activation-based heuristics, which conflate language preference with functional importance. There's a need for better understanding how language capabilities are organized at the neuron level.

Method: CRANE uses relevance-based analysis with targeted neuron-level interventions to identify language-specific neurons based on their functional necessity rather than activation magnitude. It characterizes neuron specialization by their contribution to language-conditioned predictions.

Result: Neuron-level interventions reveal asymmetric patterns: masking target language neurons degrades performance on that language while largely preserving performance on other languages. CRANE isolates language-specific components more precisely than activation-based methods across English, Chinese, and Vietnamese benchmarks.

Conclusion: CRANE provides a more accurate framework for understanding language organization in multilingual LLMs, revealing language-selective but non-exclusive neuron specializations that current activation-based methods fail to capture properly.

Abstract: Multilingual large language models (LLMs) achieve strong performance across languages, yet how language capabilities are organized at the neuron level remains poorly understood. Prior work has identified language-related neurons mainly through activation-based heuristics, which conflate language preference with functional importance. Prior work has identified language-related neurons mainly through activation-based heuristics, which conflate language preference with functional importance. We propose CRANE, a relevance-based analysis framework that redefines language specificity in terms of functional necessity, identifying language-specific neurons through targeted neuron-level interventions. CRANE characterizes neuron specialization by their contribution to language-conditioned predictions rather than activation magnitude. Our implementation will be made publicly available. Neuron-level interventions reveal a consistent asymmetric pattern: masking neurons relevant to a target language selectively degrades performance on that language while preserving performance on other languages to a substantial extent, indicating language-selective but non-exclusive neuron specializations. Experiments on English, Chinese, and Vietnamese across multiple benchmarks, together with a dedicated relevance-based metric and base-to-chat model transfer analysis, show that CRANE isolates language-specific components more precisely than activation-based methods.

[50] ToolGate: Contract-Grounded and Verified Tool Execution for LLMs

Yanming Liu, Xinyue Peng, Jiannan Cao, Xinyi Wang, Songhang Deng, Jintao Chen, Jianwei Yin, Xuhong Zhang

Main category: cs.CL

TL;DR: ToolGate is a framework that provides logical safety guarantees and verifiable state evolution for LLM tool calling through formal contracts and runtime verification.

Details

Motivation: Existing LLM tool-augmentation frameworks rely on natural language reasoning without formal guarantees for logical safety and verifiability, risking invalid or hallucinated results corrupting world representations.

Method: ToolGate maintains an explicit symbolic state space as typed key-value mapping, formalizes tools as Hoare-style contracts (preconditions and postconditions), gates tool invocation via precondition checks, and commits results through runtime verification of postconditions.

Result: Experimental validation shows ToolGate significantly improves reliability and verifiability of tool-augmented LLM systems while maintaining competitive performance on complex multi-step reasoning tasks.

Conclusion: ToolGate establishes a foundation for building more trustworthy and debuggable AI systems that integrate language models with external tools through formal safety guarantees.

Abstract: Large Language Models (LLMs) augmented with external tools have demonstrated remarkable capabilities in complex reasoning tasks. However, existing frameworks rely heavily on natural language reasoning to determine when tools can be invoked and whether their results should be committed, lacking formal guarantees for logical safety and verifiability. We present \textbf{ToolGate}, a forward execution framework that provides logical safety guarantees and verifiable state evolution for LLM tool calling. ToolGate maintains an explicit symbolic state space as a typed key-value mapping representing trusted world information throughout the reasoning process. Each tool is formalized as a Hoare-style contract consisting of a precondition and a postcondition, where the precondition gates tool invocation by checking whether the current state satisfies the required conditions, and the postcondition determines whether the tool’s result can be committed to update the state through runtime verification. Our approach guarantees that the symbolic state evolves only through verified tool executions, preventing invalid or hallucinated results from corrupting the world representation. Experimental validation demonstrates that ToolGate significantly improves the reliability and verifiability of tool-augmented LLM systems while maintaining competitive performance on complex multi-step reasoning tasks. This work establishes a foundation for building more trustworthy and debuggable AI systems that integrate language models with external tools.

[51] See, Explain, and Intervene: A Few-Shot Multimodal Agent Framework for Hateful Meme Moderation

Naquee Rizwan, Subhankar Swain, Paramananda Bhaskar, Gagan Aryan, Shehryaar Shah Khan, Animesh Mukherjee

Main category: cs.CL

TL;DR: A novel framework using generative AI for hateful meme detection, explanation, and intervention under limited data conditions.

Details

Motivation: Current approaches study detection, explanation, and intervention separately, which doesn't reflect real-world moderation needs. Additionally, curating large annotated datasets for meme moderation is prohibitively expensive.

Method: Leverages task-specific generative multimodal agents and few-shot adaptability of large multimodal models to handle different meme types with limited data.

Result: Proposes the first framework for generalizable hateful meme moderation under limited data conditions, integrating detection, explanation, and intervention.

Conclusion: The approach has strong potential for real-world deployment and addresses the practical challenges of meme moderation with limited annotated data.

Abstract: In this work, we examine hateful memes from three complementary angles - how to detect them, how to explain their content and how to intervene them prior to being posted - by applying a range of strategies built on top of generative AI models. To the best of our knowledge, explanation and intervention have typically been studied separately from detection, which does not reflect real-world conditions. Further, since curating large annotated datasets for meme moderation is prohibitively expensive, we propose a novel framework that leverages task-specific generative multimodal agents and the few-shot adaptability of large multimodal models to cater to different types of memes. We believe this is the first work focused on generalizable hateful meme moderation under limited data conditions, and has strong potential for deployment in real-world production scenarios. Warning: Contains potentially toxic contents.

[52] Thunder-KoNUBench: A Corpus-Aligned Benchmark for Korean Negation Understanding

Sungmok Jung, Yeonkyoung So, Joonhak Lee, Sangho Kim, Yelim Ahn, Jaejin Lee

Main category: cs.CL

TL;DR: Thunder-KoNUBench: A Korean negation benchmark showing LLMs struggle with negation, with fine-tuning improving performance.

Details

Motivation: Negation challenges LLMs, but Korean negation benchmarks are scarce. Need to evaluate and improve LLM understanding of Korean negation phenomena.

Method: 1) Corpus analysis of Korean negation patterns, 2) Create Thunder-KoNUBench benchmark reflecting empirical distribution, 3) Evaluate 47 LLMs, 4) Fine-tune models on benchmark.

Result: LLM performance degrades under negation. Model size and instruction tuning affect performance. Fine-tuning on Thunder-KoNUBench improves both negation understanding and broader contextual comprehension in Korean.

Conclusion: Thunder-KoNUBench addresses Korean negation evaluation gap, shows LLMs struggle with negation, and demonstrates fine-tuning effectiveness for improving Korean language understanding.

Abstract: Although negation is known to challenge large language models (LLMs), benchmarks for evaluating negation understanding, especially in Korean, are scarce. We conduct a corpus-based analysis of Korean negation and show that LLM performance degrades under negation. We then introduce Thunder-KoNUBench, a sentence-level benchmark that reflects the empirical distribution of Korean negation phenomena. Evaluating 47 LLMs, we analyze the effects of model size and instruction tuning, and show that fine-tuning on Thunder-KoNUBench improves negation understanding and broader contextual comprehension in Korean.

[53] PRISM: A Unified Framework for Post-Training LLMs Without Verifiable Rewards

Mukesh Ghimire, Aosong Feng, Liwen You, Youzhi Luo, Fang Liu, Xuan Zhu

Main category: cs.CL

TL;DR: PRISM is a training framework that combines process reward models with model self-certainty for stable unsupervised learning from unlabeled data, addressing reliability issues in existing consistency-based methods.

Details

Motivation: Current post-training methods for LLMs rely on costly human supervision or external verifiers, but as models improve, high-quality solutions to difficult problems become unavailable to humans. Existing unsupervised methods using internal consistency metrics (entropy/self-certainty) are unreliable for large-scale, long-term training.

Method: PRISM (Process Reward Model) framework combines a Process Reward Model (PRM) with the model’s internal self-certainty to guide learning without ground-truth labels. The PRM provides external verification while self-certainty provides internal confidence, creating a balanced training approach.

Result: Effectively combining PRM with self-certainty leads to both stable training and better test-time performance while keeping the model’s internal confidence in check. The unified framework addresses reliability issues of pure consistency-based methods.

Conclusion: PRISM provides a practical solution for unsupervised learning from unlabeled data by combining external process verification with internal confidence metrics, enabling stable large-scale training without human supervision.

Abstract: Current techniques for post-training Large Language Models (LLMs) rely either on costly human supervision or on external verifiers to boost performance on tasks such as mathematical reasoning and code generation. However, as LLMs improve their problem-solving, any further improvement will potentially require high-quality solutions to difficult problems that are not available to humans. As a result, learning from unlabeled data is becoming increasingly attractive in the research community. Existing methods extract learning signal from a model’s consistency, either by majority voting or by converting the model’s internal confidence into reward. Although internal consistency metric such as entropy or self-certainty require no human intervention, as we show in this work, these are unreliable signals for large-scale and long-term training. To address the unreliability, we propose PRISM, a unified training framework that uses a Process Reward Model (PRM) to guide learning alongside model’s internal confidence in the absence of ground-truth labels. We show that effectively combining PRM with self-certainty can lead to both stable training and better test-time performance, and also keep the model’s internal confidence in check.

[54] Prior-Informed Zeroth-Order Optimization with Adaptive Direction Alignment for Memory-Efficient LLM Fine-Tuning

Feihu Jin, Shipeng Cen, Ying Tan

Main category: cs.CL

TL;DR: Proposes a plug-and-play zeroth-order optimization method with prior-informed perturbations to reduce gradient estimation variance and accelerate convergence for fine-tuning large language models.

Details

Motivation: Fine-tuning LLMs faces memory bottlenecks during backpropagation, while conventional zeroth-order methods suffer from high variance in gradient estimation due to random perturbations, leading to slow convergence.

Method: Incorporates prior-informed perturbations to refine gradient estimation, dynamically computing a guiding vector from Gaussian samples to direct perturbations toward more informative directions. Also investigates greedy perturbation strategy.

Result: Method outperforms traditional ZO optimization across all 11 benchmark tasks on OPT-13B model and surpasses gradient-based baselines on 9 out of 11 tasks, achieving faster convergence and superior performance.

Conclusion: Proposed method establishes a robust balance between efficiency and accuracy, seamlessly integrates into existing optimization methods, and enhances optimization efficiency through better gradient direction alignment.

Abstract: Fine-tuning large language models (LLMs) has achieved remarkable success across various NLP tasks, but the substantial memory overhead during backpropagation remains a critical bottleneck, especially as model scales grow. Zeroth-order (ZO) optimization alleviates this issue by estimating gradients through forward passes and Gaussian sampling, avoiding the need for backpropagation. However, conventional ZO methods suffer from high variance in gradient estimation due to their reliance on random perturbations, leading to slow convergence and suboptimal performance. We propose a simple plug-and-play method that incorporates prior-informed perturbations to refine gradient estimation. Our method dynamically computes a guiding vector from Gaussian samples, which directs perturbations toward more informative directions, significantly accelerating convergence compared to standard ZO approaches. We further investigate a greedy perturbation strategy to explore the impact of prior knowledge on gradient estimation. Theoretically, we prove that our gradient estimator achieves stronger alignment with the true gradient direction, enhancing optimization efficiency. Extensive experiments across LLMs of varying scales and architectures demonstrate that our proposed method could seamlessly integrate into existing optimization methods, delivering faster convergence and superior performance. Notably, on the OPT-13B model, our method outperforms traditional ZO optimization across all 11 benchmark tasks and surpasses gradient-based baselines on 9 out of 11 tasks, establishing a robust balance between efficiency and accuracy.

[55] DSC2025 – ViHallu Challenge: Detecting Hallucination in Vietnamese LLMs

Anh Thi-Hoang Nguyen, Khanh Quoc Tran, Tin Van Huynh, Phuoc Tan-Hoang Nguyen, Cam Tan Nguyen, Kiet Van Nguyen

Main category: cs.CL

TL;DR: First large-scale shared task for detecting hallucinations in Vietnamese LLMs, introducing ViHallu dataset with 10K annotated samples and achieving 84.80% macro-F1 score with best system.

Details

Motivation: LLM reliability is limited by hallucinations, but existing benchmarks focus on English while low-to-medium resource languages like Vietnamese lack standardized evaluation frameworks for hallucination detection.

Method: Created ViHallu dataset with 10,000 annotated (context, prompt, response) triplets categorized into three hallucination types (none, intrinsic, extrinsic) and three prompt types (factual, noisy, adversarial). Organized DSC2025 ViHallu Challenge with 111 participating teams using various detection methodologies.

Result: Best-performing system achieved 84.80% macro-F1 score, significantly outperforming baseline encoder-only score of 32.83%. Instruction-tuned LLMs with structured prompting and ensemble strategies performed best, though intrinsic hallucinations remain particularly challenging.

Conclusion: Established rigorous benchmark for Vietnamese hallucination detection, demonstrating that specialized approaches outperform generic architectures but perfect performance remains elusive, especially for contradiction-based hallucinations. Provides foundation for future research on Vietnamese AI system trustworthiness.

Abstract: The reliability of large language models (LLMs) in production environments remains significantly constrained by their propensity to generate hallucinations – fluent, plausible-sounding outputs that contradict or fabricate information. While hallucination detection has recently emerged as a priority in English-centric benchmarks, low-to-medium resource languages such as Vietnamese remain inadequately covered by standardized evaluation frameworks. This paper introduces the DSC2025 ViHallu Challenge, the first large-scale shared task for detecting hallucinations in Vietnamese LLMs. We present the ViHallu dataset, comprising 10,000 annotated triplets of (context, prompt, response) samples systematically partitioned into three hallucination categories: no hallucination, intrinsic, and extrinsic hallucinations. The dataset incorporates three prompt types – factual, noisy, and adversarial – to stress-test model robustness. A total of 111 teams participated, with the best-performing system achieving a macro-F1 score of 84.80%, compared to a baseline encoder-only score of 32.83%, demonstrating that instruction-tuned LLMs with structured prompting and ensemble strategies substantially outperform generic architectures. However, the gap to perfect performance indicates that hallucination detection remains a challenging problem, particularly for intrinsic (contradiction-based) hallucinations. This work establishes a rigorous benchmark and explores a diverse range of detection methodologies, providing a foundation for future research into the trustworthiness and reliability of Vietnamese language AI systems.

[56] Fame Fades, Nature Remains: Disentangling the Character Identity of Role-Playing Agents

Yonghyun Jun, Junhyuk Choi, Jihyeong Park, Hwanhee Lee

Main category: cs.CL

TL;DR: The paper proposes Character Identity as a multidimensional construct separating characters into Parametric Identity (pre-trained knowledge) and Attributive Identity (behavioral properties), revealing that fame advantages fade quickly while negative social traits remain a bottleneck in role-playing agent fidelity.

Details

Motivation: Current role-playing agents treat characters as arbitrary text inputs without formal structural dimensions, lacking systematic understanding of what defines character identity in LLM-based agents.

Method: Proposed Character Identity framework with two layers (Parametric and Attributive), constructed unified character profile schema, generated Famous and Synthetic characters under identical constraints, evaluated across single-turn and multi-turn interactions.

Result: Two key findings: 1) “Fame Fades” - famous characters’ initial advantage from parametric knowledge disappears quickly as models prioritize conversational context; 2) “Nature Remains” - models robustly portray general personality but performance is highly sensitive to morality and interpersonal relationship valence, with negative social natures being the primary bottleneck.

Conclusion: Character identity should be understood as multidimensional, with negative social traits being the main challenge for role-playing agent fidelity, providing guidance for future character construction and evaluation.

Abstract: Despite the rapid proliferation of Role-Playing Agents (RPAs) based on Large Language Models (LLMs), the structural dimensions defining a character’s identity remain weakly formalized, often treating characters as arbitrary text inputs. In this paper, we propose the concept of \textbf{Character Identity}, a multidimensional construct that disentangles a character into two distinct layers: \textbf{(1) Parametric Identity}, referring to character-specific knowledge encoded from the LLM’s pre-training, and \textbf{(2) Attributive Identity}, capturing fine-grained behavioral properties such as personality traits and moral values. To systematically investigate these layers, we construct a unified character profile schema and generate both Famous and Synthetic characters under identical structural constraints. Our evaluation across single-turn and multi-turn interactions reveals two critical phenomena. First, we identify \textit{“Fame Fades”}: while famous characters hold a significant advantage in initial turns due to parametric knowledge, this edge rapidly vanishes as models prioritize accumulating conversational context over pre-trained priors. Second, we find that \textit{“Nature Remains”}: while models robustly portray general personality traits regardless of polarity, RPA performance is highly sensitive to the valence of morality and interpersonal relationships. Our findings pinpoint negative social natures as the primary bottleneck in RPA fidelity, guiding future character construction and evaluation.

[57] Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking

Mingxin Li, Yanzhao Zhang, Dingkun Long, Keqin Chen, Sibo Song, Shuai Bai, Zhibo Yang, Pengjun Xie, An Yang, Dayiheng Liu, Jingren Zhou, Junyang Lin

Main category: cs.CL

TL;DR: Qwen3-VL-Embedding and Qwen3-VL-Reranker are multimodal embedding and reranking models that provide an end-to-end pipeline for high-precision multimodal search across text, images, documents, and video.

Details

Motivation: To create a unified multimodal search system that can handle diverse modalities (text, images, documents, video) in a single representation space, supporting multilingual capabilities and flexible deployment options.

Method: Multi-stage training paradigm: large-scale contrastive pre-training followed by reranking model distillation. Qwen3-VL-Embedding uses Matryoshka Representation Learning for flexible dimensions and supports 32k token inputs. Qwen3-VL-Reranker uses cross-encoder architecture with cross-attention for fine-grained relevance estimation.

Result: State-of-the-art results on multimodal embedding benchmarks, with Qwen3-VL-Embedding-8B achieving 77.8 overall score on MMEB-V2 (ranked first as of Jan 8, 2025). Models come in 2B and 8B parameter sizes, support 30+ languages, and perform well on image-text retrieval, VQA, and video-text matching.

Conclusion: The Qwen3-VL-Embedding and Qwen3-VL-Reranker series provide an effective end-to-end multimodal search pipeline with state-of-the-art performance, multilingual support, and flexible deployment options for diverse multimodal retrieval tasks.

Abstract: In this report, we introduce the Qwen3-VL-Embedding and Qwen3-VL-Reranker model series, the latest extensions of the Qwen family built on the Qwen3-VL foundation model. Together, they provide an end-to-end pipeline for high-precision multimodal search by mapping diverse modalities, including text, images, document images, and video, into a unified representation space. The Qwen3-VL-Embedding model employs a multi-stage training paradigm, progressing from large-scale contrastive pre-training to reranking model distillation, to generate semantically rich high-dimensional vectors. It supports Matryoshka Representation Learning, enabling flexible embedding dimensions, and handles inputs up to 32k tokens. Complementing this, Qwen3-VL-Reranker performs fine-grained relevance estimation for query-document pairs using a cross-encoder architecture with cross-attention mechanisms. Both model series inherit the multilingual capabilities of Qwen3-VL, supporting more than 30 languages, and are released in $\textbf{2B}$ and $\textbf{8B}$ parameter sizes to accommodate diverse deployment requirements. Empirical evaluations demonstrate that the Qwen3-VL-Embedding series achieves state-of-the-art results across diverse multimodal embedding evaluation benchmarks. Specifically, Qwen3-VL-Embedding-8B attains an overall score of $\textbf{77.8}$ on MMEB-V2, ranking first among all models (as of January 8, 2025). This report presents the architecture, training methodology, and practical capabilities of the series, demonstrating their effectiveness on various multimodal retrieval tasks, including image-text retrieval, visual question answering, and video-text matching.

[58] Automatic Classifiers Underdetect Emotions Expressed by Men

Ivan Smirnov, Segun T. Aroyehun, Paul Plener, David Garcia

Main category: cs.CL

TL;DR: Systematic gender bias found in emotion detection models - error rates consistently higher for texts authored by men vs women across 414 model-class combinations.

Details

Motivation: Current emotion classifiers are assessed using third-party annotations rather than self-reported emotions, potentially concealing systematic biases. Need to ensure reliable performance across different populations.

Method: Used unique large-scale dataset of 1M+ self-annotated posts with pre-registered research design to investigate gender biases across 414 combinations of models and emotion-related classes.

Result: Error rates consistently higher for texts authored by men compared to women across different classifier types and underlying emotions. Quantified how this bias affects downstream applications.

Conclusion: Sentiment analysis is not yet solved for equitable model behavior. Current ML tools, including LLMs, should be applied with caution when gender composition is unknown or variable.

Abstract: The widespread adoption of automatic sentiment and emotion classifiers makes it important to ensure that these tools perform reliably across different populations. Yet their reliability is typically assessed using benchmarks that rely on third-party annotators rather than the individuals experiencing the emotions themselves, potentially concealing systematic biases. In this paper, we use a unique, large-scale dataset of more than one million self-annotated posts and a pre-registered research design to investigate gender biases in emotion detection across 414 combinations of models and emotion-related classes. We find that across different types of automatic classifiers and various underlying emotions, error rates are consistently higher for texts authored by men compared to those authored by women. We quantify how this bias could affect results in downstream applications and show that current machine learning tools, including large language models, should be applied with caution when the gender composition of a sample is not known or variable. Our findings demonstrate that sentiment analysis is not yet a solved problem, especially in ensuring equitable model behaviour across demographic groups.

Han Zhu, Jiale Chen, Chengkun Cai, Shengjie Sun, Haoran Li, Yujin Zhou, Chi-Min Chan, Pengcheng Wen, Lei Li, Sirui Han, Yike Guo

Main category: cs.CL

TL;DR: InterSafe-V dataset and AM³Safety framework improve multi-modal LLM safety in multi-turn dialogues, reducing attack success by 10+% while improving helpfulness and harmlessness.

Details

Motivation: MLLMs deployed in interactive applications have safety vulnerabilities in multi-turn scenarios where harmful intent can be reconstructed across turns and security protocols fade. Existing RLHF alignment methods are designed for single-turn VQA tasks and require costly manual annotations, limiting effectiveness and scalability in dialogues.

Method: 1) Created InterSafe-V dataset: 11,270 multi-modal dialogues + 500 refusal VQA samples, constructed through model interactions to reflect real-world scenarios. 2) Proposed AM³Safety framework: combines cold-start refusal phase with Group Relative Policy Optimization (GRPO) fine-tuning using turn-aware dual-objective rewards across entire dialogues.

Result: Experiments on Qwen2.5-VL-7B-Instruct and LLaVA-NeXT-7B show: >10% decrease in Attack Success Rate (ASR), at least 8% increment in harmless dimension, over 13% increment in helpful dimension on multi-modal multi-turn safety benchmarks, while preserving general abilities.

Conclusion: The InterSafe-V dataset and AM³Safety framework effectively address multi-turn safety vulnerabilities in MLLMs, significantly improving safety while maintaining helpfulness and general capabilities, providing a scalable solution for multi-modal dialogue safety alignment.

Abstract: Multi-modal Large Language Models (MLLMs) are increasingly deployed in interactive applications. However, their safety vulnerabilities become pronounced in multi-turn multi-modal scenarios, where harmful intent can be gradually reconstructed across turns, and security protocols fade into oblivion as the conversation progresses. Existing Reinforcement Learning from Human Feedback (RLHF) alignment methods are largely developed for single-turn visual question-answer (VQA) task and often require costly manual preference annotations, limiting their effectiveness and scalability in dialogues. To address this challenge, we present InterSafe-V, an open-source multi-modal dialogue dataset containing 11,270 dialogues and 500 specially designed refusal VQA samples. This dataset, constructed through interaction between several models, is designed to more accurately reflect real-world scenarios and includes specialized VQA pairs tailored for specific domains. Building on this dataset, we propose AM$^3$Safety, a framework that combines a cold-start refusal phase with Group Relative Policy Optimization (GRPO) fine-tuning using turn-aware dual-objective rewards across entire dialogues. Experiments on Qwen2.5-VL-7B-Instruct and LLaVA-NeXT-7B show more than 10% decrease in Attack Success Rate (ASR) together with an increment of at least 8% in harmless dimension and over 13% in helpful dimension of MLLMs on multi-modal multi-turn safety benchmarks, while preserving their general abilities.

[60] RiskAtlas: Exposing Domain-Specific Risks in LLMs through Knowledge-Graph-Guided Harmful Prompt Generation

Huawei Zheng, Xinqi Jiang, Sen Yang, Shouling Ji, Yingcai Wu, Dazhen Deng

Main category: cs.CL

TL;DR: Proposes a framework for generating implicit harmful prompts using knowledge graphs and dual-path obfuscation to create domain-specific safety datasets for LLMs.

Details

Motivation: Domain-specific LLM applications (finance, healthcare) face unique safety risks, but existing harmful prompt datasets are scarce, manually constructed, and focus on explicit prompts that modern defenses can detect. Implicit harmful prompts using domain knowledge are harder to detect and better reflect real-world threats.

Method: End-to-end framework with two main components: 1) Knowledge-graph-guided harmful prompt generation to systematically produce domain-relevant prompts, and 2) Dual-path obfuscation rewriting that converts explicit harmful prompts into implicit variants through direct and context-enhanced rewriting.

Result: The framework yields high-quality datasets combining strong domain relevance with implicitness, enabling more realistic red-teaming and advancing LLM safety research. Code and datasets are released on GitHub.

Conclusion: The proposed approach addresses the scarcity of domain-specific harmful prompt datasets and the limitations of explicit prompts by generating implicit harmful prompts that better reflect real-world threats, advancing LLM safety research in specialized domains.

Abstract: Large language models (LLMs) are increasingly applied in specialized domains such as finance and healthcare, where they introduce unique safety risks. Domain-specific datasets of harmful prompts remain scarce and still largely rely on manual construction; public datasets mainly focus on explicit harmful prompts, which modern LLM defenses can often detect and refuse. In contrast, implicit harmful prompts-expressed through indirect domain knowledge-are harder to detect and better reflect real-world threats. We identify two challenges: transforming domain knowledge into actionable constraints and increasing the implicitness of generated harmful prompts. To address them, we propose an end-to-end framework that first performs knowledge-graph-guided harmful prompt generation to systematically produce domain-relevant prompts, and then applies dual-path obfuscation rewriting to convert explicit harmful prompts into implicit variants via direct and context-enhanced rewriting. This framework yields high-quality datasets combining strong domain relevance with implicitness, enabling more realistic red-teaming and advancing LLM safety research. We release our code and datasets at GitHub.

[61] Tool-MAD: A Multi-Agent Debate Framework for Fact Verification with Diverse Tool Augmentation and Adaptive Retrieval

Seyeon Jeong, Yeonjun Choi, JongWook Kim, Beakcheol Jang

Main category: cs.CL

TL;DR: Tool-MAD: Multi-agent debate framework where agents use different external tools (search, RAG) to improve factual verification and reduce hallucinations in LLMs.

Details

Motivation: Existing multi-agent debate (MAD) systems rely on internal knowledge or static documents, making them vulnerable to hallucinations. While MADKE introduces external evidence, its one-time retrieval limits adaptability to new arguments during debates.

Method: Tool-MAD assigns each agent a distinct external tool (search API, RAG module). It features: 1) multi-agent debate with heterogeneous tools for diverse perspectives, 2) adaptive query formulation that iteratively refines evidence retrieval based on debate flow, and 3) integration of Faithfulness and Answer Relevance scores for quantitative assessment by the Judge agent.

Result: Outperforms state-of-the-art MAD frameworks on four fact verification benchmarks with up to 5.5% accuracy improvement. Shows strong robustness and adaptability in medically specialized domains across various tool configurations.

Conclusion: Tool-MAD demonstrates potential for broader real-world fact-checking applications by effectively reducing hallucinations through tool-enhanced multi-agent debates with adaptive evidence retrieval and quantitative assessment mechanisms.

Abstract: Large Language Models (LLMs) suffer from hallucinations and factual inaccuracies, especially in complex reasoning and fact verification tasks. Multi-Agent Debate (MAD) systems aim to improve answer accuracy by enabling multiple LLM agents to engage in dialogue, promoting diverse reasoning and mutual verification. However, existing MAD frameworks primarily rely on internal knowledge or static documents, making them vulnerable to hallucinations. While MADKE introduces external evidence to mitigate this, its one-time retrieval mechanism limits adaptability to new arguments or emerging information during the debate. To address these limitations, We propose Tool-MAD, a multi-agent debate framework that enhances factual verification by assigning each agent a distinct external tool, such as a search API or RAG module. Tool-MAD introduces three key innovations: (1) a multi-agent debate framework where agents leverage heterogeneous external tools, encouraging diverse perspectives, (2) an adaptive query formulation mechanism that iteratively refines evidence retrieval based on the flow of the debate, and (3) the integration of Faithfulness and Answer Relevance scores into the final decision process, allowing the Judge agent to quantitatively assess the coherence and question alignment of each response and effectively detect hallucinations. Experimental results on four fact verification benchmarks demonstrate that Tool-MAD consistently outperforms state-of-the-art MAD frameworks, achieving up to 5.5% accuracy improvement. Furthermore, in medically specialized domains, Tool-MAD exhibits strong robustness and adaptability across various tool configurations and domain conditions, confirming its potential for broader real-world fact-checking applications.

[62] PILOT-Bench: A Benchmark for Legal Reasoning in the Patent Domain with IRAC-Aligned Classification Tasks

Yehoon Jang, Chaewon Lee, Hyun-seok Min, Sungchul Choi

Main category: cs.CL

TL;DR: PILOT-Bench is the first PTAB-centric benchmark for evaluating LLMs’ structured legal reasoning in patent appeals, featuring three IRAC-aligned classification tasks with significant performance gaps between closed-source and open-source models.

Details

Motivation: Current LLM applications in patent and legal practice are limited to lightweight tasks, with no systematic way to evaluate structured legal reasoning capabilities in the patent domain, particularly for PTAB appeals that require integration of technical understanding and legal reasoning.

Method: Created PILOT-Bench by aligning PTAB decisions with USPTO patent data at case-level and formalizing three IRAC-aligned classification tasks: Issue Type, Board Authorities, and Subdecision. Evaluated diverse closed-source (commercial) and open-source LLMs across multiple perspectives including input-variation settings, model families, and error tendencies.

Result: Closed-source models consistently exceed 0.75 Micro-F1 score on Issue Type task, while strongest open-source model (Qwen-8B) achieves only around 0.56, showing substantial reasoning capability gap. Benchmark establishes foundation for systematic evaluation of patent-domain legal reasoning.

Conclusion: PILOT-Bench provides the first systematic benchmark for evaluating LLMs’ patent-domain legal reasoning, revealing significant performance disparities between commercial and open-source models, and points toward future improvements through dataset design and model alignment.

Abstract: The Patent Trial and Appeal Board (PTAB) of the USPTO adjudicates thousands of ex parte appeals each year, requiring the integration of technical understanding and legal reasoning. While large language models (LLMs) are increasingly applied in patent and legal practice, their use has remained limited to lightweight tasks, with no established means of systematically evaluating their capacity for structured legal reasoning in the patent domain. In this work, we introduce PILOT-Bench, the first PTAB-centric benchmark that aligns PTAB decisions with USPTO patent data at the case-level and formalizes three IRAC-aligned classification tasks: Issue Type, Board Authorities, and Subdecision. We evaluate a diverse set of closed-source (commercial) and open-source LLMs and conduct analyses across multiple perspectives, including input-variation settings, model families, and error tendencies. Notably, on the Issue Type task, closed-source models consistently exceed 0.75 in Micro-F1 score, whereas the strongest open-source model (Qwen-8B) achieves performance around 0.56, highlighting a substantial gap in reasoning capabilities. PILOT-Bench establishes a foundation for the systematic evaluation of patent-domain legal reasoning and points toward future directions for improving LLMs through dataset design and model alignment. All data, code, and benchmark resources are available at https://github.com/TeamLab/pilot-bench.

[63] Differential syntactic and semantic encoding in LLMs

Santiago Acevedo, Alessandro Laio, Marco Baroni

Main category: cs.CL

TL;DR: LLM representations encode syntax and semantics linearly; subtracting syntactic/semantic centroids reduces similarity with matched sentences, showing differential encoding patterns.

Details

Motivation: To understand how syntactic and semantic information is encoded in the inner layer representations of large language models, specifically DeepSeek-V3, and whether these linguistic features are linearly separable.

Method: Averaging hidden-representation vectors of sentences sharing syntactic structure or meaning to create syntactic and semantic “centroids,” then subtracting these centroids from sentence vectors to analyze their effects on similarity with syntactically/semantically matched sentences.

Result: Subtracting syntactic/semantic centroids strongly reduces similarity with corresponding matched sentences, suggesting linear encoding. Cross-layer encoding profiles differ for syntax vs. semantics, and the two signals can be partially decoupled, indicating differential encoding.

Conclusion: Syntax and semantics are at least partially linearly encoded in LLM representations, with distinct encoding patterns across layers, allowing for some decoupling of these linguistic information types.

Abstract: We study how syntactic and semantic information is encoded in inner layer representations of Large Language Models (LLMs), focusing on the very large DeepSeek-V3. We find that, by averaging hidden-representation vectors of sentences sharing syntactic structure or meaning, we obtain vectors that capture a significant proportion of the syntactic and semantic information contained in the representations. In particular, subtracting these syntactic and semantic ``centroids’’ from sentence vectors strongly affects their similarity with syntactically and semantically matched sentences, respectively, suggesting that syntax and semantics are, at least partially, linearly encoded. We also find that the cross-layer encoding profiles of syntax and semantics are different, and that the two signals can to some extent be decoupled, suggesting differential encoding of these two types of linguistic information in LLM representations.

[64] Revisiting Judge Decoding from First Principles via Training-Free Distributional Divergence

Shengyin Sun, Yiming Li, Renxi Liu, Weizhe Lin, Hui-Ling Zhen, Xianzhi Yu, Mingxuan Yuan, Chen Ma

Main category: cs.CL

TL;DR: Training-free verification for speculative decoding using KL divergence instead of supervised judges

Details

Motivation: Current judge decoding methods for LLM inference acceleration require expensive and noisy supervision, creating a bottleneck. The paper aims to eliminate this supervision requirement by discovering that the necessary "criticality" scores are already encoded in the draft-target distributional divergence.

Method: The authors theoretically prove that learned linear judges correspond to Kullback-Leibler (KL) divergence, showing they rely on the same underlying logit primitives. Based on this insight, they propose a simple, training-free verification mechanism using KL divergence instead of supervised judges.

Result: Extensive experiments across reasoning and coding benchmarks show the method matches or outperforms complex trained judges (like AutoJudge), offers superior robustness to domain shifts, and completely eliminates the supervision bottleneck.

Conclusion: The supervision requirement for judge decoding can be eliminated by using KL divergence for verification, as the necessary criticality information is already present in the draft-target distributional divergence, leading to simpler, more robust, and equally effective inference acceleration.

Abstract: Judge Decoding accelerates LLM inference by relaxing the strict verification of Speculative Decoding, yet it typically relies on expensive and noisy supervision. In this work, we revisit this paradigm from first principles, revealing that the ``criticality’’ scores learned via costly supervision are intrinsically encoded in the draft-target distributional divergence. We theoretically prove a structural correspondence between learned linear judges and Kullback-Leibler (KL) divergence, demonstrating they rely on the same underlying logit primitives. Guided by this, we propose a simple, training-free verification mechanism based on KL divergence. Extensive experiments across reasoning and coding benchmarks show that our method matches or outperforms complex trained judges (e.g., AutoJudge), offering superior robustness to domain shifts and eliminating the supervision bottleneck entirely.

[65] LANGSAE EDITING: Improving Multilingual Information Retrieval via Post-hoc Language Identity Removal

Dongjun Kim, Jeongho Yoon, Chanjun Park, Heuiseok Lim

Main category: cs.CL

TL;DR: LANGSAE EDITING removes language-identity signals from multilingual embeddings to improve cross-language retrieval without retraining base models.

Details

Motivation: Multilingual embeddings encode language identity alongside semantics, which inflates similarity for same-language pairs and reduces effectiveness for cross-language retrieval in mixed-language collections.

Method: Post-hoc sparse autoencoder trained on pooled embeddings identifies language-associated latent units using cross-language activation statistics, suppresses these units at inference, and reconstructs embeddings in original dimensionality.

Result: Experiments show consistent improvements in ranking quality and cross-language coverage across multiple languages, with especially strong gains for script-distinct languages.

Conclusion: LANGSAE EDITING enables controllable removal of language-identity signals, making it compatible with existing vector databases without retraining base encoders or re-encoding raw text.

Abstract: Dense retrieval in multilingual settings often searches over mixed-language collections, yet multilingual embeddings encode language identity alongside semantics. This language signal can inflate similarity for same-language pairs and crowd out relevant evidence written in other languages. We propose LANGSAE EDITING, a post-hoc sparse autoencoder trained on pooled embeddings that enables controllable removal of language-identity signal directly in vector space. The method identifies language-associated latent units using cross-language activation statistics, suppresses these units at inference time, and reconstructs embeddings in the original dimensionality, making it compatible with existing vector databases without retraining the base encoder or re-encoding raw text. Experiments across multiple languages show consistent improvements in ranking quality and cross-language coverage, with especially strong gains for script-distinct languages.

[66] NC2C: Automated Convexification of Generic Non-Convex Optimization Problems

Xinyue Peng, Yanming Liu, Yihan Cang, Yuwei Zhang, Xinyi Wang, Songhang Deng, Jiannan Cao

Main category: cs.CL

TL;DR: NC2C is an LLM-based framework that automatically transforms non-convex optimization problems into solvable convex forms, achieving 89.3% execution rate and 76% success rate on diverse problems.

Details

Motivation: Non-convex optimization problems are pervasive but intractable for traditional solvers due to complex objectives and constraints. Manual convexification is inefficient and relies heavily on expert knowledge, creating a need for automated solutions.

Method: NC2C uses large language models to autonomously detect non-convex components, select optimal convexification strategies, and generate rigorous convex equivalents. It integrates symbolic reasoning, adaptive transformation techniques, iterative validation, error correction loops, and feasibility domain correction mechanisms.

Result: On a diverse dataset of 100 generic non-convex problems, NC2C achieves 89.3% execution rate and 76% success rate in producing feasible, high-quality convex transformations, significantly outperforming baseline methods.

Conclusion: NC2C demonstrates that LLMs can effectively automate non-convex to convex transformation, reducing expert dependency and enabling efficient deployment of convex solvers for previously intractable optimization tasks.

Abstract: Non-convex optimization problems are pervasive across mathematical programming, engineering design, and scientific computing, often posing intractable challenges for traditional solvers due to their complex objective functions and constrained landscapes. To address the inefficiency of manual convexification and the over-reliance on expert knowledge, we propose NC2C, an LLM-based end-to-end automated framework designed to transform generic non-convex optimization problems into solvable convex forms using large language models. NC2C leverages LLMs’ mathematical reasoning capabilities to autonomously detect non-convex components, select optimal convexification strategies, and generate rigorous convex equivalents. The framework integrates symbolic reasoning, adaptive transformation techniques, and iterative validation, equipped with error correction loops and feasibility domain correction mechanisms to ensure the robustness and validity of transformed problems. Experimental results on a diverse dataset of 100 generic non-convex problems demonstrate that NC2C achieves an 89.3% execution rate and a 76% success rate in producing feasible, high-quality convex transformations. This outperforms baseline methods by a significant margin, highlighting NC2C’s ability to leverage LLMs for automated non-convex to convex transformation, reduce expert dependency, and enable efficient deployment of convex solvers for previously intractable optimization tasks.

[67] Belief in Authority: Impact of Authority in Multi-Agent Evaluation Framework

Junhyuk Choi, Jeongyoun Kwon, Heeju Kim, Haeun Cho, Hayeong Jung, Sehee Min, Bugeun Kim

Main category: cs.CL

TL;DR: First systematic analysis of role-based authority bias in multi-agent LLM systems shows Expert and Referent power roles have stronger influence than Legitimate power roles, with bias emerging through authoritative roles maintaining positions while general agents show flexibility.

Details

Motivation: Multi-agent systems using LLMs often assign authoritative roles to improve performance, but the impact of authority bias on agent interactions remains underexplored, creating a gap in understanding how different types of authority influence agent dynamics.

Method: Used ChatEval for free-form multi-agent evaluation, applying French and Raven’s power-based theory to classify authoritative roles into legitimate, referent, and expert types, analyzing their influence across 12-turn conversations with GPT-4o and DeepSeek R1 models.

Result: Expert and Referent power roles exert stronger influence than Legitimate power roles. Authority bias emerges not through active conformity by general agents, but through authoritative roles consistently maintaining positions while general agents demonstrate flexibility. Authority influence requires clear position statements, as neutral responses fail to generate bias.

Conclusion: These findings provide key insights for designing multi-agent frameworks with asymmetric interaction patterns, highlighting the differential impact of authority types and the conditions under which authority bias manifests in LLM-based multi-agent systems.

Abstract: Multi-agent systems utilizing large language models often assign authoritative roles to improve performance, yet the impact of authority bias on agent interactions remains underexplored. We present the first systematic analysis of role-based authority bias in free-form multi-agent evaluation using ChatEval. Applying French and Raven’s power-based theory, we classify authoritative roles into legitimate, referent, and expert types and analyze their influence across 12-turn conversations. Experiments with GPT-4o and DeepSeek R1 reveal that Expert and Referent power roles exert stronger influence than Legitimate power roles. Crucially, authority bias emerges not through active conformity by general agents, but through authoritative roles consistently maintaining their positions while general agents demonstrate flexibility. Furthermore, authority influence requires clear position statements, as neutral responses fail to generate bias. These findings provide key insights for designing multi-agent frameworks with asymmetric interaction patterns.

[68] When AI Settles Down: Late-Stage Stability as a Signature of AI-Generated Text Detection

Ke Sun, Guangsheng Bao, Han Cui, Yue Zhang

Main category: cs.CL

TL;DR: The paper identifies “Late-Stage Volatility Decay” in AI-generated text (stabilizing log probability fluctuations in later generation stages) and proposes two simple late-stage features for zero-shot detection, achieving SOTA performance without extra model access.

Details

Motivation: Current zero-shot detection methods aggregate token-level statistics across entire sequences, ignoring temporal dynamics of autoregressive generation. The authors aim to leverage these temporal patterns to improve AI-generated text detection.

Method: Analyzed 120k+ text samples to identify Late-Stage Volatility Decay pattern. Proposed two features: Derivative Dispersion and Local Volatility, computed exclusively from late-stage statistics. No perturbation sampling or additional model access required.

Result: AI-generated text shows 24-32% lower volatility in second half of sequences compared to human writing. The method achieves state-of-the-art performance on EvoBench and MAGE benchmarks and demonstrates strong complementarity with existing global methods.

Conclusion: Temporal dynamics in autoregressive generation provide valuable signals for AI-generated text detection. Simple late-stage features can achieve excellent performance without complex sampling or additional model access, complementing existing approaches.

Abstract: Zero-shot detection methods for AI-generated text typically aggregate token-level statistics across entire sequences, overlooking the temporal dynamics inherent to autoregressive generation. We analyze over 120k text samples and reveal Late-Stage Volatility Decay: AI-generated text exhibits rapidly stabilizing log probability fluctuations as generation progresses, while human writing maintains higher variability throughout. This divergence peaks in the second half of sequences, where AI-generated text shows 24–32% lower volatility. Based on this finding, we propose two simple features: Derivative Dispersion and Local Volatility, which computed exclusively from late-stage statistics. Without perturbation sampling or additional model access, our method achieves state-of-the-art performance on EvoBench and MAGE benchmarks and demonstrates strong complementarity with existing global methods.

[69] RAAR: Retrieval Augmented Agentic Reasoning for Cross-Domain Misinformation Detection

Zhiwei Liu, Runteng Guo, Baojie Qu, Yuechen Jiang, Min Peng, Qianqian Xie, Sophia Ananiadou

Main category: cs.CL

TL;DR: RAAR is a retrieval-augmented agentic reasoning framework for cross-domain misinformation detection that uses multi-perspective evidence retrieval and multi-agent collaboration to overcome domain transfer limitations.

Details

Motivation: Cross-domain misinformation detection is challenging due to substantial domain differences in knowledge and discourse. Existing methods struggle with generalization to challenging domains, while LLMs are limited to same-distribution data and lack systematic reasoning.

Method: RAAR retrieves multi-perspective source-domain evidence aligned with target samples’ semantics, sentiment, and style. It uses specialized multi-agent collaboration where perspective-specific agents produce complementary analyses and a summary agent integrates them under verifier guidance. The framework applies supervised fine-tuning and reinforcement learning to train a multi-task verifier.

Result: RAAR substantially enhances base model capabilities and outperforms other cross-domain methods, advanced LLMs, and LLM-based adaptation approaches on three cross-domain misinformation detection tasks. The authors trained RAAR-8b and RAAR-14b models.

Conclusion: RAAR successfully addresses cross-domain misinformation detection challenges through retrieval-augmented agentic reasoning, demonstrating superior performance over existing methods and enabling better generalization across domains.

Abstract: Cross-domain misinformation detection is challenging, as misinformation arises across domains with substantial differences in knowledge and discourse. Existing methods often rely on single-perspective cues and struggle to generalize to challenging or underrepresented domains, while reasoning large language models (LLMs), though effective on complex tasks, are limited to same-distribution data. To address these gaps, we introduce RAAR, the first retrieval-augmented agentic reasoning framework for cross-domain misinformation detection. To enable cross-domain transfer beyond same-distribution assumptions, RAAR retrieves multi-perspective source-domain evidence aligned with each target sample’s semantics, sentiment, and writing style. To overcome single-perspective modeling and missing systematic reasoning, RAAR constructs verifiable multi-step reasoning paths through specialized multi-agent collaboration, where perspective-specific agents produce complementary analyses and a summary agent integrates them under verifier guidance. RAAR further applies supervised fine-tuning and reinforcement learning to train a single multi-task verifier to enhance verification and reasoning capabilities. Based on RAAR, we trained the RAAR-8b and RAAR-14b models. Evaluation on three cross-domain misinformation detection tasks shows that RAAR substantially enhances the capabilities of the base models and outperforms other cross-domain methods, advanced LLMs, and LLM-based adaptation approaches. The project will be released at https://github.com/lzw108/RAAR.

[70] Token Maturation: Autoregressive Language Generation via Continuous Token Dynamics

Oshri Naparstek

Main category: cs.CL

TL;DR: Continuous autoregressive language model where tokens evolve as continuous vectors over multiple steps before discretization, enabling stable deterministic generation without token-level sampling.

Details

Motivation: Traditional autoregressive models commit to discrete tokens early, forcing uncertainty resolution through token-level sampling which causes instability, repetition, and sensitivity to decoding heuristics.

Method: Introduces continuous autoregressive formulation where tokens are represented as continuous vectors that mature over multiple update steps before discretization. Uses deterministic dynamical process to evolve continuous token representations, committing to discrete tokens only when representations have sufficiently converged.

Result: The maturation process alone produces coherent and diverse text using deterministic decoding (argmax) without token-level sampling, diffusion-style denoising, or auxiliary stabilization mechanisms. Additional perturbations can be incorporated naturally but are not required.

Conclusion: First autoregressive language model that generates text by evolving continuous token representations to convergence prior to discretization, enabling stable generation without token-level sampling.

Abstract: Autoregressive language models are conventionally defined over discrete token sequences, committing to a specific token at every generation step. This early discretization forces uncertainty to be resolved through token-level sampling, often leading to instability, repetition, and sensitivity to decoding heuristics. In this work, we introduce a continuous autoregressive formulation of language generation in which tokens are represented as continuous vectors that \emph{mature} over multiple update steps before being discretized. Rather than sampling tokens, the model evolves continuous token representations through a deterministic dynamical process, committing to a discrete token only when the representation has sufficiently converged. Discrete text is recovered via hard decoding, while uncertainty is maintained and resolved in the continuous space. We show that this maturation process alone is sufficient to produce coherent and diverse text using deterministic decoding (argmax), without reliance on token-level sampling, diffusion-style denoising, or auxiliary stabilization mechanisms. Additional perturbations, such as stochastic dynamics or history smoothing, can be incorporated naturally but are not required for the model to function. To our knowledge, this is the first autoregressive language model that generates text by evolving continuous token representations to convergence prior to discretization, enabling stable generation without token-level sampling.

[71] MisSpans: Fine-Grained False Span Identification in Cross-Domain Fake News

Zhiwei Liu, Paul Thompson, Jiaqi Rong, Baojie Qu, Runteng Guo, Min Peng, Qianqian Xie, Sophia Ananiadou

Main category: cs.CL

TL;DR: MisSpans is a new benchmark for fine-grained misinformation detection at the span level, featuring three tasks: identifying false spans, categorizing misinformation types, and providing explanations.

Details

Motivation: Existing benchmarks evaluate veracity at coarse levels (whole claims/paragraphs) with binary labels, which obscures how true and false details often co-exist within single sentences and limits interpretability for identifying specific misleading segments or differentiating falsehood types.

Method: Created MisSpans benchmark with paired real and fake news stories, annotated by experts using standardized guidelines and consistency checks. Evaluated 15 representative LLMs (including reasoning-enhanced and non-reasoning variants) under zero-shot and one-shot settings on three tasks: MisSpansIdentity (pinpointing false spans), MisSpansType (categorizing false spans by misinformation type), and MisSpansExplanation (providing rationales).

Result: Results show the challenging nature of fine-grained misinformation identification and analysis, highlighting the need to understand how performance is influenced by multiple interacting factors including model size, reasoning capabilities, and domain-specific textual features.

Conclusion: MisSpans enables fine-grained localization, nuanced characterization beyond true/false, and actionable explanations for misinformation, addressing limitations of existing coarse-grained approaches and providing a benchmark for more detailed misinformation analysis.

Abstract: Online misinformation is increasingly pervasive, yet most existing benchmarks and methods evaluate veracity at the level of whole claims or paragraphs using coarse binary labels, obscuring how true and false details often co-exist within single sentences. These simplifications also limit interpretability: global explanations cannot identify which specific segments are misleading or differentiate how a detail is false (e.g., distorted vs. fabricated). To address these gaps, we introduce MisSpans, the first multi-domain, human-annotated benchmark for span-level misinformation detection and analysis, consisting of paired real and fake news stories. MisSpans defines three complementary tasks: MisSpansIdentity for pinpointing false spans within sentences, MisSpansType for categorising false spans by misinformation type, and MisSpansExplanation for providing rationales grounded in identified spans. Together, these tasks enable fine-grained localisation, nuanced characterisation beyond true/false and actionable explanations. Expert annotators were guided by standardised guidelines and consistency checks, leading to high inter-annotator agreement. We evaluate 15 representative LLMs, including reasoning-enhanced and non-reasoning variants, under zero-shot and one-shot settings. Results reveal the challenging nature of fine-grained misinformation identification and analysis, and highlight the need for a deeper understanding of how performance may be influenced by multiple interacting factors, including model size and reasoning capabilities, along with domain-specific textual features. This project will be available at https://github.com/lzw108/MisSpans.

[72] A Navigational Approach for Comprehensive RAG via Traversal over Proposition Graphs

Maxime Delmas, Lei Xu, André Freitas

Main category: cs.CL

TL;DR: ToPG is a novel RAG framework that uses heterogeneous proposition graphs and iterative suggestion-selection cycles to handle both simple factual and complex multi-hop queries effectively.

Details

Motivation: Current RAG approaches have limitations: standard chunking-based RAG fails on complex multi-hop queries, reasoning-interleaved methods lack global corpus awareness, and KG-based RAG struggles with simple fact-oriented queries. There's a need for a unified approach that handles both simple and complex queries effectively.

Method: ToPG models knowledge as a heterogeneous graph containing propositions, entities, and passages. It uses iterative Suggestion-Selection cycles: the Suggestion phase performs query-aware graph traversal, and the Selection phase uses LLM feedback to prune irrelevant propositions and seed the next iteration.

Result: Evaluated on three QA tasks (Simple, Complex, and Abstract QA), ToPG demonstrates strong performance across both accuracy- and quality-based metrics, showing effectiveness across different query types.

Conclusion: ToPG shows that query-aware graph traversal combined with factual granularity is critical for efficient structured RAG systems, bridging the gap between simple factual retrieval and complex multi-hop reasoning.

Abstract: Standard RAG pipelines based on chunking excel at simple factual retrieval but fail on complex multi-hop queries due to a lack of structural connectivity. Conversely, initial strategies that interleave retrieval with reasoning often lack global corpus awareness, while Knowledge Graph (KG)-based RAG performs strongly on complex multi-hop tasks but suffers on fact-oriented single-hop queries. To bridge this gap, we propose a novel RAG framework: ToPG (Traversal over Proposition Graphs). ToPG models its knowledge base as a heterogeneous graph of propositions, entities, and passages, effectively combining the granular fact density of propositions with graph connectivity. We leverage this structure using iterative Suggestion-Selection cycles, where the Suggestion phase enables a query-aware traversal of the graph, and the Selection phase provides LLM feedback to prune irrelevant propositions and seed the next iteration. Evaluated on three distinct QA tasks (Simple, Complex, and Abstract QA), ToPG demonstrates strong performance across both accuracy- and quality-based metrics. Overall, ToPG shows that query-aware graph traversal combined with factual granularity is a critical component for efficient structured RAG systems. ToPG is available at https://github.com/idiap/ToPG.

[73] EvolSQL: Structure-Aware Evolution for Scalable Text-to-SQL Data Synthesis

Xuanguang Pan, Chongyang Tao, Jiayuan Bai, Jianling Gao, Zhengwei Tao, Xiansheng Zhou, Gavin Cheung, Shuai Ma

Main category: cs.CL

TL;DR: EvolSQL is a structure-aware data synthesis framework that evolves SQL queries from seed data to create diverse, complex Text-to-SQL training data, outperforming existing methods with much less data.

Details

Motivation: Existing Text-to-SQL training suffers from limited high-quality, diverse datasets - either relying on scarce human annotations or simple LLM prompting without structural control, resulting in poor structural diversity and complexity.

Method: EvolSQL uses a two-stage approach: 1) exploratory Query-SQL expansion for question diversity and schema coverage, 2) adaptive directional evolution with six AST-based transformation operators to increase complexity across relational, predicate, aggregation, and nesting dimensions, plus execution-grounded refinement and schema-aware deduplication.

Result: A 7B model fine-tuned on EvolSQL data outperforms one trained on the much larger SynSQL dataset using only 1/18 of the data, demonstrating superior data efficiency and quality.

Conclusion: EvolSQL provides an effective structure-aware framework for synthesizing high-quality, structurally diverse Text-to-SQL training data that significantly improves model performance with much less data compared to existing methods.

Abstract: Training effective Text-to-SQL models remains challenging due to the scarcity of high-quality, diverse, and structurally complex datasets. Existing methods either rely on limited human-annotated corpora, or synthesize datasets directly by simply prompting LLMs without explicit control over SQL structures, often resulting in limited structural diversity and complexity. To address this, we introduce EvolSQL, a structure-aware data synthesis framework that evolves SQL queries from seed data into richer and more semantically diverse forms. EvolSQL starts with an exploratory Query-SQL expansion to broaden question diversity and improve schema coverage, and then applies an adaptive directional evolution strategy using six atomic transformation operators derived from the SQL Abstract Syntax Tree to progressively increase query complexity across relational, predicate, aggregation, and nesting dimensions. An execution-grounded SQL refinement module and schema-aware deduplication further ensure the creation of high-quality, structurally diverse mapping pairs. Experimental results show that a 7B model fine-tuned on our data outperforms one trained on the much larger SynSQL dataset using only 1/18 of the data.

[74] Mind2Report: A Cognitive Deep Research Agent for Expert-Level Commercial Report Synthesis

Mingyue Cheng, Daoyu Wang, Qi Liu, Shuo Yu, Xiaoyu Tao, Yuqian Wang, Chengzhong Chu, Yu Duan, Mingkang Long, Enhong Chen

Main category: cs.CL

TL;DR: Mind2Report is a cognitive deep research agent that emulates commercial analysts to synthesize expert-level reports from web sources, outperforming existing baselines through a training-free workflow with dynamic memory.

Details

Motivation: Current deep research agents produce reports with limited quality, reliability, and coverage, which is problematic for high-stakes business decisions that require informative commercial reports synthesized from massive, noisy web sources.

Method: Mind2Report uses a training-free agentic workflow that augments LLMs with dynamic memory. It follows a three-step cognitive process: 1) probes fine-grained intent, 2) searches web sources and records distilled information on the fly, and 3) iteratively synthesizes the report.

Result: Experiments using QRC-Eval (200 real-world commercial tasks) show Mind2Report outperforms leading baselines including OpenAI and Gemini deep research agents in terms of report quality, reliability, and coverage.

Conclusion: Mind2Report serves as a foundation for advancing commercial deep research agent design, demonstrating that cognitive agentic workflows with dynamic memory can effectively synthesize expert-level reports from web sources.

Abstract: Synthesizing informative commercial reports from massive and noisy web sources is critical for high-stakes business decisions. Although current deep research agents achieve notable progress, their reports still remain limited in terms of quality, reliability, and coverage. In this work, we propose Mind2Report, a cognitive deep research agent that emulates the commercial analyst to synthesize expert-level reports. Specifically, it first probes fine-grained intent, then searches web sources and records distilled information on the fly, and subsequently iteratively synthesizes the report. We design Mind2Report as a training-free agentic workflow that augments general large language models (LLMs) with dynamic memory to support these long-form cognitive processes. To rigorously evaluate Mind2Report, we further construct QRC-Eval comprising 200 real-world commercial tasks and establish a holistic evaluation strategy to assess report quality, reliability, and coverage. Experiments demonstrate that Mind2Report outperforms leading baselines, including OpenAI and Gemini deep research agents. Although this is a preliminary study, we expect it to serve as a foundation for advancing the future design of commercial deep research agents. Our code and data are available at https://github.com/Melmaphother/Mind2Report.

[75] CuMA: Aligning LLMs with Sparse Cultural Values via Demographic-Aware Mixture of Adapters

Ao Sun, Xiaoyu Wang, Zhe Tan, Yu Li, Jiachen Zhu, Shu Su, Yuheng Jia

Main category: cs.CL

TL;DR: CuMA proposes a Cultural Mixture of Adapters framework that uses demographic-aware routing to prevent “mean collapse” in LLMs, enabling culturally pluralistic alignment by separating conflicting value distributions into specialized expert subspaces.

Details

Motivation: As LLMs serve global audiences, alignment must respect cultural pluralism rather than enforce universal consensus. Current dense models suffer from "mean collapse" when forced to fit conflicting value distributions, converging to a generic average that fails to represent diverse cultural groups.

Method: CuMA frames alignment as a conditional capacity separation problem. It uses demographic-aware routing to internalize a Latent Cultural Topology, disentangling conflicting gradients into specialized expert subspaces through a Cultural Mixture of Adapters framework.

Result: Extensive evaluations on WorldValuesBench, Community Alignment, and PRISM show CuMA achieves state-of-the-art performance, significantly outperforming both dense baselines and semantic-only MoEs. Analysis confirms CuMA effectively mitigates mean collapse while preserving cultural diversity.

Conclusion: CuMA successfully addresses the challenge of cultural pluralism in LLM alignment by preventing mean collapse through conditional capacity separation, enabling models to better represent diverse cultural values while maintaining high performance.

Abstract: As Large Language Models (LLMs) serve a global audience, alignment must transition from enforcing universal consensus to respecting cultural pluralism. We demonstrate that dense models, when forced to fit conflicting value distributions, suffer from \textbf{Mean Collapse}, converging to a generic average that fails to represent diverse groups. We attribute this to \textbf{Cultural Sparsity}, where gradient interference prevents dense parameters from spanning distinct cultural modes. To resolve this, we propose \textbf{\textsc{CuMA}} (\textbf{Cu}ltural \textbf{M}ixture of \textbf{A}dapters), a framework that frames alignment as a \textbf{conditional capacity separation} problem. By incorporating demographic-aware routing, \textsc{CuMA} internalizes a \textit{Latent Cultural Topology} to explicitly disentangle conflicting gradients into specialized expert subspaces. Extensive evaluations on WorldValuesBench, Community Alignment, and PRISM demonstrate that \textsc{CuMA} achieves state-of-the-art performance, significantly outperforming both dense baselines and semantic-only MoEs. Crucially, our analysis confirms that \textsc{CuMA} effectively mitigates mean collapse, preserving cultural diversity. Our code is available at https://github.com/Throll/CuMA.

[76] Faithful Summarisation under Disagreement via Belief-Level Aggregation

Favour Yahdii Aghaebe, Tanefa Apekey, Elizabeth Williams, Nafise Sadat Moosavi

Main category: cs.CL

TL;DR: A disagreement-aware summarization pipeline that separates belief aggregation from language generation to better handle conflicting viewpoints in opinion-heavy documents.

Details

Motivation: Existing LLM-based summarization approaches often smooth over disagreements and over-represent majority opinions, limiting faithfulness in opinion-heavy settings with conflicting viewpoints.

Method: Two-stage pipeline: 1) Represent documents as structured belief sets and aggregate using distance-based belief merging operators that explicitly model conflict, 2) Use LLMs only for natural language generation of aggregated beliefs.

Result: While large models can match belief-level aggregation when aggregation is handled during generation, this behavior is unstable across architectures/capacities. Belief-level aggregation with simple prompting yields consistently strong disagreement-aware performance across models while maintaining fluency.

Conclusion: Separating belief aggregation from language generation provides more stable and faithful disagreement-aware summarization across different model types and sizes compared to approaches that handle aggregation during generation.

Abstract: Opinion and multi-document summarisation often involve genuinely conflicting viewpoints, yet many existing approaches, particularly LLM-based systems, implicitly smooth disagreement and over-represent majority opinions. This limits the faithfulness of generated summaries in opinion-heavy settings. We introduce a disagreement-aware synthesis pipeline that separates belief-level aggregation from language generation. Documents are first represented as structured belief sets and aggregated using distance-based belief merging operators that explicitly model conflict. Large language models are then used only to realise the aggregated beliefs as natural language summaries. We evaluate the approach across multiple model families and scales, comparing it to methods that perform explicit aggregation during generation. Our results show that while sufficiently large models can match belief-level aggregation when aggregation is handled at generation time, this behaviour is not stable across architectures or capacities. In contrast, belief-level aggregation combined with simple prompting yields consistently strong disagreement-aware performance across models, while maintaining fluent and grounded summaries.

[77] Can AI-Generated Persuasion Be Detected? Persuaficial Benchmark and AI vs. Human Linguistic Differences

Arkadiusz Modzelewski, Paweł Golik, Anna Kołos, Giovanni Da San Martino

Main category: cs.CL

TL;DR: LLM-generated persuasive text detection is challenging - while overt persuasion is easier to detect, subtle LLM persuasion degrades detection performance. The paper introduces Persuaficial benchmark and provides linguistic analysis for better detection tools.

Details

Motivation: LLMs can generate highly persuasive text, raising concerns about misuse for propaganda and manipulation. The central question is whether LLM-generated persuasion is more difficult to automatically detect than human-written persuasion.

Method: Categorize controllable generation approaches for producing persuasive content with LLMs, introduce Persuaficial (a high-quality multilingual benchmark covering 6 languages), conduct extensive empirical evaluations comparing human-authored and LLM-generated persuasive texts, and provide comprehensive linguistic analysis.

Result: Overtly persuasive LLM-generated texts are easier to detect than human-written ones, but subtle LLM-generated persuasion consistently degrades automatic detection performance. The paper provides the first comprehensive linguistic analysis contrasting human and LLM-generated persuasive texts.

Conclusion: LLM-generated persuasion presents detection challenges, especially when subtle. The linguistic analysis offers insights that may guide development of more interpretable and robust detection tools to address concerns about LLM misuse for harmful persuasive purposes.

Abstract: Large Language Models (LLMs) can generate highly persuasive text, raising concerns about their misuse for propaganda, manipulation, and other harmful purposes. This leads us to our central question: Is LLM-generated persuasion more difficult to automatically detect than human-written persuasion? To address this, we categorize controllable generation approaches for producing persuasive content with LLMs and introduce Persuaficial, a high-quality multilingual benchmark covering six languages: English, German, Polish, Italian, French and Russian. Using this benchmark, we conduct extensive empirical evaluations comparing human-authored and LLM-generated persuasive texts. We find that although overtly persuasive LLM-generated texts can be easier to detect than human-written ones, subtle LLM-generated persuasion consistently degrades automatic detection performance. Beyond detection performance, we provide the first comprehensive linguistic analysis contrasting human and LLM-generated persuasive texts, offering insights that may guide the development of more interpretable and robust detection tools.

[78] GenProve: Learning to Generate Text with Fine-Grained Provenance

Jingxuan Wei, Xingyue Wang, Yanghaoyu Liao, Jie Dong, Yuchen Liu, Caijun Jia, Bihui Yu, Junnan Zhu

Main category: cs.CL

TL;DR: The paper introduces Generation-time Fine-grained Provenance, a task requiring LLMs to generate answers with structured sentence-level provenance triples. They create ReFInE dataset with expert annotations distinguishing Quotation, Compression, and Inference, and propose GenProve framework using SFT+GRPO that outperforms 14 LLMs in joint evaluation.

Details

Motivation: LLMs often hallucinate, and while adding citations helps, it's insufficient for accountability because users struggle to verify how cited sources support generated claims. Existing methods are coarse-grained and fail to distinguish between direct quotes and complex reasoning.

Method: 1) Introduce Generation-time Fine-grained Provenance task requiring models to generate fluent answers with structured sentence-level provenance triples. 2) Create ReFInE dataset with expert-verified annotations distinguishing Quotation, Compression, and Inference. 3) Propose GenProve framework combining Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO) to optimize composite reward for answer fidelity and provenance correctness.

Result: GenProve significantly outperforms 14 strong LLMs in joint evaluation. Analysis reveals a reasoning gap: models excel at surface-level quotation but struggle significantly with inference-based provenance, suggesting verifiable reasoning remains a distinct frontier challenge from surface-level citation.

Conclusion: The paper demonstrates that fine-grained provenance generation is crucial for LLM accountability, and that inference-based provenance represents a distinct challenge requiring specialized approaches beyond simple citation mechanisms.

Abstract: Large language models (LLM) often hallucinate, and while adding citations is a common solution, it is frequently insufficient for accountability as users struggle to verify how a cited source supports a generated claim. Existing methods are typically coarse-grained and fail to distinguish between direct quotes and complex reasoning. In this paper, we introduce Generation-time Fine-grained Provenance, a task where models must generate fluent answers while simultaneously producing structured, sentence-level provenance triples. To enable this, we present ReFInE (Relation-aware Fine-grained Interpretability & Evidence), a dataset featuring expert verified annotations that distinguish between Quotation, Compression, and Inference. Building on ReFInE, we propose GenProve, a framework that combines Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO). By optimizing a composite reward for answer fidelity and provenance correctness, GenProve significantly outperforms 14 strong LLMs in joint evaluation. Crucially, our analysis uncovers a reasoning gap where models excel at surface-level quotation but struggle significantly with inference-based provenance, suggesting that verifiable reasoning remains a frontier challenge distinct from surface-level citation.

[79] Text as a Universal Interface for Transferable Personalization

Yuting Liu, Jian Guan, Jia-Nan Li, Wei Wu, Jiang-Ming Yang, Jianzhe Zhao, Guibing Guo

Main category: cs.CL

TL;DR: The paper proposes using natural language as a universal interface for representing user preferences in LLMs, creating interpretable and transferable preference descriptions through a two-stage training framework called AlignXplore+.

Details

Motivation: Current LLM personalization methods use opaque "black-box" vector representations that are difficult to interpret, transfer across models/tasks, and evolve over time. The authors want a more transparent, universal preference representation.

Method: Two-stage training framework: 1) Supervised fine-tuning on high-quality synthesized data, 2) Reinforcement learning to optimize long-term utility and cross-task transferability. This produces AlignXplore+, a model that generates textual preference summaries.

Result: The 8B AlignXplore+ model achieves state-of-the-art performance on nine benchmarks, outperforming substantially larger open-source models, while showing strong transferability across tasks, model families, and interaction formats.

Conclusion: Natural language provides an effective universal interface for preference representation in LLMs, enabling interpretable, reusable, and evolvable preference descriptions that outperform traditional opaque vector representations.

Abstract: We study the problem of personalization in large language models (LLMs). Prior work predominantly represents user preferences as implicit, model-specific vectors or parameters, yielding opaque ``black-box’’ profiles that are difficult to interpret and transfer across models and tasks. In contrast, we advocate natural language as a universal, model- and task-agnostic interface for preference representation. The formulation leads to interpretable and reusable preference descriptions, while naturally supporting continual evolution as new interactions are observed. To learn such representations, we introduce a two-stage training framework that combines supervised fine-tuning on high-quality synthesized data with reinforcement learning to optimize long-term utility and cross-task transferability. Based on this framework, we develop AlignXplore+, a universal preference reasoning model that generates textual preference summaries. Experiments on nine benchmarks show that our 8B model achieves state-of-the-art performanc – outperforming substantially larger open-source models – while exhibiting strong transferability across tasks, model families, and interaction formats.

[80] Learning from Mistakes: Negative Reasoning Samples Enhance Out-of-Domain Generalization

Xueyun Tian, Minghua Ma, Bingbing Xu, Nuoyan Lyu, Wei Li, Heng Dong, Zheng Chu, Yuanzhuo Wang, Huawei Shen

Main category: cs.CL

TL;DR: Incorporating negative reasoning trajectories (incorrect final answers) into supervised fine-tuning improves out-of-domain generalization by mitigating overfitting and boosting exploration, with a proposed adaptive loss weighting method (GLOW) further enhancing performance.

Details

Motivation: Standard SFT on CoT trajectories only uses positive examples (correct final answers), discarding negative trajectories. This wastes valuable supervision and exacerbates overfitting, limiting out-of-domain generalization. The authors argue that negative trajectories often contain valid intermediate reasoning despite incorrect final answers.

Method: 1) Incorporate negative trajectories into SFT training alongside positives. 2) Systematically analyze data, training dynamics, and inference behavior to identify 22 recurring patterns in negative chains. 3) Propose Gain-based LOss Weighting (GLOW), an adaptive sample-aware scheme that rescales per-sample loss based on inter-epoch progress to exploit distinctive training dynamics.

Result: Negative trajectories yield substantial OOD generalization gains over positive-only training. They moderate loss descent to mitigate overfitting and boost policy entropy by 35.67% during inference to facilitate exploration. GLOW efficiently leverages unfiltered trajectories, achieving 5.51% OOD gain over positive-only SFT on Qwen2.5-7B and boosting MMLU from 72.82% to 76.47% as an RL initialization.

Conclusion: Incorporating negative reasoning trajectories into SFT provides significant benefits for OOD generalization by preventing overfitting and enhancing exploration. The proposed GLOW method further optimizes this approach by adaptively weighting samples based on training progress, demonstrating that unfiltered trajectories contain valuable supervision that should not be discarded.

Abstract: Supervised fine-tuning (SFT) on chain-of-thought (CoT) trajectories demonstrations is a common approach for enabling reasoning in large language models. Standard practices typically only retain trajectories with correct final answers (positives) while ignoring the rest (negatives). We argue that this paradigm discards substantial supervision and exacerbates overfitting, limiting out-of-domain (OOD) generalization. Specifically, we surprisingly find that incorporating negative trajectories into SFT yields substantial OOD generalization gains over positive-only training, as these trajectories often retain valid intermediate reasoning despite incorrect final answers. To understand this effect in depth, we systematically analyze data, training dynamics, and inference behavior, identifying 22 recurring patterns in negative chains that serve a dual role: they moderate loss descent to mitigate overfitting during training and boost policy entropy by 35.67% during inference to facilitate exploration. Motivated by these observations, we further propose Gain-based LOss Weighting (GLOW), an adaptive, sample-aware scheme that exploits such distinctive training dynamics by rescaling per-sample loss based on inter-epoch progress. Empirically, GLOW efficiently leverages unfiltered trajectories, yielding a 5.51% OOD gain over positive-only SFT on Qwen2.5-7B and boosting MMLU from 72.82% to 76.47% as an RL initialization.

[81] Can Large Language Models Resolve Semantic Discrepancy in Self-Destructive Subcultures? Evidence from Jirai Kei

Peng Wang, Xilin Tao, Siyi Yao, Jiageng Wu, Yuntao Zou, Zhuotao Tian, Libo Qin, Dagang Li

Main category: cs.CL

TL;DR: SAS is a multi-agent framework that improves LLM-based detection of self-destructive behaviors in subcultures by addressing knowledge lag and semantic misalignment through automatic retrieval and subcultural alignment.

Details

Motivation: Self-destructive behaviors are hard to diagnose, especially in subcultures with unique expressions. While LLMs show promise for detection, they face challenges with rapidly evolving subcultural slang and nuanced semantic understanding specific to these groups.

Method: Proposed Subcultural Alignment Solver (SAS), a multi-agent framework incorporating automatic retrieval to address knowledge lag and subculture alignment techniques to handle semantic misalignment in detecting self-destructive behaviors.

Result: SAS outperforms the current advanced multi-agent framework OWL and competes well with fine-tuned LLMs, demonstrating significant enhancement in detecting self-destructive behavior within subcultural contexts.

Conclusion: SAS advances self-destructive behavior detection in subcultures and serves as a valuable resource for future research, effectively addressing the challenges of knowledge lag and semantic misalignment in LLM-based approaches.

Abstract: Self-destructive behaviors are linked to complex psychological states and can be challenging to diagnose. These behaviors may be even harder to identify within subcultural groups due to their unique expressions. As large language models (LLMs) are applied across various fields, some researchers have begun exploring their application for detecting self-destructive behaviors. Motivated by this, we investigate self-destructive behavior detection within subcultures using current LLM-based methods. However, these methods have two main challenges: (1) Knowledge Lag: Subcultural slang evolves rapidly, faster than LLMs’ training cycles; and (2) Semantic Misalignment: it is challenging to grasp the specific and nuanced expressions unique to subcultures. To address these issues, we proposed Subcultural Alignment Solver (SAS), a multi-agent framework that incorporates automatic retrieval and subculture alignment, significantly enhancing the performance of LLMs in detecting self-destructive behavior. Our experimental results show that SAS outperforms the current advanced multi-agent framework OWL. Notably, it competes well with fine-tuned LLMs. We hope that SAS will advance the field of self-destructive behavior detection in subcultural contexts and serve as a valuable resource for future researchers.

[82] Hán Dān Xué Bù (Mimicry) or Qīng Chū Yú Lán (Mastery)? A Cognitive Perspective on Reasoning Distillation in Large Language Models

Yueqing Hu, Xinyang Peng, Shuting Peng, Hanqi Wang, Tianhong Wang

Main category: cs.CL

TL;DR: Reasoning distillation via SFT fails to transfer cognitive structure from teacher models, causing functional alignment collapse where students mimic linguistic form without internalizing resource allocation policies.

Details

Motivation: Large Reasoning Models trained via RL show natural alignment with human cognitive costs, but current reasoning distillation methods may not preserve this cognitive structure, potentially leading to superficial mimicry.

Method: Tested the “Hán Dān Xué Bù” (Superficial Mimicry) hypothesis across 14 models, comparing teacher models’ alignment with human difficulty scaling to distilled students’ performance, analyzing the degradation in cognitive alignment.

Result: Teacher models show strong alignment with human difficulty scaling (r̄=0.64), but distilled students significantly degrade this alignment (r̄=0.34), often underperforming pre-distillation baselines (Negative Transfer). SFT induces a “Cargo Cult” effect where students replicate linguistic form without internalizing dynamic resource allocation.

Conclusion: Reasoning distillation decouples computational cost from cognitive demand, revealing that human-like cognition emerges from active reinforcement learning rather than passive imitation through supervised fine-tuning.

Abstract: Recent Large Reasoning Models trained via reinforcement learning exhibit a “natural” alignment with human cognitive costs. However, we show that the prevailing paradigm of reasoning distillation – training student models to mimic these traces via Supervised Fine-Tuning (SFT) – fails to transmit this cognitive structure. Testing the “Hán Dān Xué Bù” (Superficial Mimicry) hypothesis across 14 models, we find that distillation induces a “Functional Alignment Collapse”: while teacher models mirror human difficulty scaling ($\bar{r}=0.64$), distilled students significantly degrade this alignment ($\bar{r}=0.34$), often underperforming their own pre-distillation baselines (“Negative Transfer”). Our analysis suggests that SFT induces a “Cargo Cult” effect, where students ritualistically replicate the linguistic form of reasoning (verbosity) without internalizing the teacher’s dynamic resource allocation policy. Consequently, reasoning distillation decouples computational cost from cognitive demand, revealing that human-like cognition is an emergent property of active reinforcement, not passive imitation.

[83] ArcAligner: Adaptive Recursive Aligner for Compressed Context Embeddings in RAG

Jianbo Li, Yi Jiang, Sendong Zhao, Bairui Hu, Haochun Wang, Bing Qin

Main category: cs.CL

TL;DR: ArcAligner is a lightweight module that helps LLMs better understand highly compressed context in RAG systems, using adaptive gating to add processing only when needed, improving performance on knowledge-intensive QA tasks.

Details

Motivation: RAG helps LLMs stay accurate but feeding long documents into prompts makes models slow and expensive. While context compression techniques exist, they create a trade-off: more compression leads to worse model understanding of the compressed data.

Method: ArcAligner (Adaptive recursive context Aligner) is a lightweight module integrated into language model layers. It uses an adaptive gating system that only adds extra processing power when information is complex, helping models better utilize highly compressed context representations for downstream generation.

Result: ArcAligner consistently beats compression baselines at comparable compression rates across knowledge-intensive QA benchmarks, especially on multi-hop and long-tail settings.

Conclusion: ArcAligner effectively addresses the compression-understanding trade-off in RAG systems, providing better performance while maintaining efficiency through adaptive processing.

Abstract: Retrieval-Augmented Generation (RAG) helps LLMs stay accurate, but feeding long documents into a prompt makes the model slow and expensive. This has motivated context compression, ranging from token pruning and summarization to embedding-based compression. While researchers have tried ‘‘compressing’’ these documents into smaller summaries or mathematical embeddings, there is a catch: the more you compress the data, the more the LLM struggles to understand it. To address this challenge, we propose ArcAligner (Adaptive recursive context Aligner), a lightweight module integrated into the language model layers to help the model better utilize highly compressed context representations for downstream generation. It uses an adaptive ‘‘gating’’ system that only adds extra processing power when the information is complex, keeping the system fast. Across knowledge-intensive QA benchmarks, ArcAligner consistently beats compression baselines at comparable compression rates, especially on multi-hop and long-tail settings. The source code is publicly available.

[84] Compositional Steering of Large Language Models with Steering Tokens

Gorjan Radevski, Kiril Gashteovski, Giwon Hong, Carolin Lawrence, Goran Glavaš

Main category: cs.CL

TL;DR: Compositional steering tokens enable multi-behavior LLM control by embedding behaviors into input tokens that can be combined, allowing effective zero-shot composition and generalization to unseen behavior combinations.

Details

Motivation: Real-world LLM applications require controllable outputs that satisfy multiple desiderata simultaneously. While single-behavior steering has been extensively studied, compositional steering (steering LLMs toward multiple behaviors at once) remains underexplored.

Method: Proposes compositional steering tokens: 1) Embed individual behaviors (natural language instructions) into dedicated tokens via self-distillation, 2) Train a composition token on behavior pairs that learns to combine behaviors, enabling generalization to unseen compositions including those with unseen behaviors and varying numbers of behaviors.

Result: Steering tokens achieve superior multi-behavior control compared to instructions, activation steering, and LoRA merging across different LLM architectures. They complement natural language instructions, with combined approaches yielding further performance gains.

Conclusion: Compositional steering tokens provide an effective approach for multi-behavior LLM control that operates in input token space rather than activation space, enabling better zero-shot composition and generalization to novel behavior combinations.

Abstract: Deploying LLMs in real-world applications requires controllable output that satisfies multiple desiderata at the same time. While existing work extensively addresses LLM steering for a single behavior, \textit{compositional steering} – i.e., steering LLMs simultaneously towards multiple behaviors – remains an underexplored problem. In this work, we propose \emph{compositional steering tokens} for multi-behavior steering. We first embed individual behaviors, expressed as natural language instructions, into dedicated tokens via self-distillation. Contrary to most prior work, which operates in the activation space, our behavior steers live in the space of input tokens, enabling more effective zero-shot composition. We then train a dedicated \textit{composition token} on pairs of behaviors and show that it successfully captures the notion of composition: it generalizes well to \textit{unseen} compositions, including those with unseen behaviors as well as those with an unseen \textit{number} of behaviors. Our experiments across different LLM architectures show that steering tokens lead to superior multi-behavior control compared to competing approaches (instructions, activation steering, and LoRA merging). Moreover, we show that steering tokens complement natural language instructions, with their combination resulting in further gains.

[85] SemPA: Improving Sentence Embeddings of Large Language Models through Semantic Preference Alignment

Ziyang Chen, Zhenxuan Huang, Yile Wang, Weiqin Wang, Lu Yin, Hui Huang

Main category: cs.CL

TL;DR: SemPA improves sentence embeddings in LLMs using semantic preference alignment without compromising generative capabilities.

Details

Motivation: Existing LLM-based embedding methods either use fixed prompts (limited performance) or modify model architecture (compromises generative ability). Need a method that enhances semantic representations while preserving LLMs' generative capabilities.

Method: Uses semantic preference alignment via sentence-level Direct Preference Optimization (DPO) on paraphrase generation tasks. Theoretically connects DPO to contrastive learning under Plackett-Luce model framework.

Result: Achieves better semantic representations on semantic textual similarity tasks and various LLM benchmarks without sacrificing generative capability.

Conclusion: SemPA successfully boosts sentence representations in LLMs while preserving their inherent generative abilities through semantic preference alignment.

Abstract: Traditional sentence embedding methods employ token-level contrastive learning on non-generative pre-trained models. Recently, there have emerged embedding methods based on generative large language models (LLMs). These methods either rely on fixed prompt templates or involve modifications to the model architecture. The former lacks further optimization of the model and results in limited performance, while the latter alters the internal computational mechanisms of the model, thereby compromising its generative capabilities. We propose SemPA, a novel approach that boosts the sentence representations while preserving the generative ability of LLMs via semantic preference alignment. We leverage sentence-level Direct Preference Optimization (DPO) to efficiently optimize LLMs on a paraphrase generation task, where the model learns to discriminate semantically equivalent sentences while preserving inherent generative capacity. Theoretically, we establish a formal connection between DPO and contrastive learning under the Plackett-Luce model framework. Empirically, experimental results on both semantic textual similarity tasks and various benchmarks for LLMs show that SemPA achieves better semantic representations without sacrificing the inherent generation capability of LLMs.

[86] Code-Mix Sentiment Analysis on Hinglish Tweets

Aashi Garg, Aneshya Das, Arshi Arya, Anushka Goyal, Aditi

Main category: cs.CL

TL;DR: The paper proposes a fine-tuned mBERT framework with subword tokenization for accurate sentiment analysis of Hinglish tweets, addressing the limitations of traditional monolingual NLP models in handling code-mixed Indian social media content.

Details

Motivation: Traditional NLP models fail to interpret the syntactic and semantic complexity of Hinglish (Hindi-English code-mixed language) used widely on Indian social media platforms like Twitter, leading to inaccurate sentiment analysis and misleading market insights for brand monitoring.

Method: The approach fine-tunes mBERT (Multilingual BERT) to leverage its multilingual capabilities, with a key focus on subword tokenization to effectively handle spelling variations, slang, and out-of-vocabulary terms common in Romanized Hinglish text.

Result: The research delivers a production-ready AI solution for brand sentiment tracking and establishes a strong benchmark for multilingual NLP in low-resource, code-mixed environments.

Conclusion: The proposed framework successfully addresses the challenges of Hinglish sentiment analysis, providing an effective solution for brand monitoring in India’s multilingual social media landscape and advancing multilingual NLP capabilities for code-mixed languages.

Abstract: The effectiveness of brand monitoring in India is increasingly challenged by the rise of Hinglish–a hybrid of Hindi and English–used widely in user-generated content on platforms like Twitter. Traditional Natural Language Processing (NLP) models, built for monolingual data, often fail to interpret the syntactic and semantic complexity of this code-mixed language, resulting in inaccurate sentiment analysis and misleading market insights. To address this gap, we propose a high-performance sentiment classification framework specifically designed for Hinglish tweets. Our approach fine-tunes mBERT (Multilingual BERT), leveraging its multilingual capabilities to better understand the linguistic diversity of Indian social media. A key component of our methodology is the use of subword tokenization, which enables the model to effectively manage spelling variations, slang, and out-of-vocabulary terms common in Romanized Hinglish. This research delivers a production-ready AI solution for brand sentiment tracking and establishes a strong benchmark for multilingual NLP in low-resource, code-mixed environments.

[87] How Human is AI? Examining the Impact of Emotional Prompts on Artificial and Human and Responsiveness

Florence Bernays, Marco Henriques Pereira, Jochen Menges

Main category: cs.CL

TL;DR: Emotional tone in human-AI interactions affects both ChatGPT’s responses and subsequent human-human communication, with praise being most effective for improving AI outputs.

Details

Motivation: To understand how emotional expressions directed at AI systems influence both the AI's behavior and subsequent human communication patterns.

Method: Between-subject experiment where participants expressed specific emotions while working with ChatGPT (GPT-4.0) on two tasks: writing a public response and addressing an ethical dilemma. Conditions included neutral tone, praise, anger, and blame.

Result: Praise led to greatest improvement in ChatGPT’s answers; anger showed smaller improvement; blame showed no improvement. For ethical dilemmas, anger reduced ChatGPT’s prioritization of corporate interests, while blame increased emphasis on protecting public interest. Participants used more negative expressions in subsequent human communication after blaming interactions.

Conclusion: Emotional tone in human-AI interactions significantly shapes both AI outputs and carries over to affect subsequent human-human communication, highlighting the bidirectional influence between emotional expressions and AI/human behavior.

Abstract: This research examines how the emotional tone of human-AI interactions shapes ChatGPT and human behavior. In a between-subject experiment, we asked participants to express a specific emotion while working with ChatGPT (GPT-4.0) on two tasks, including writing a public response and addressing an ethical dilemma. We found that compared to interactions where participants maintained a neutral tone, ChatGPT showed greater improvement in its answers when participants praised ChatGPT for its responses. Expressing anger towards ChatGPT also led to a higher albeit smaller improvement relative to the neutral condition, whereas blaming ChatGPT did not improve its answers. When addressing an ethical dilemma, ChatGPT prioritized corporate interests less when participants expressed anger towards it, while blaming increases its emphasis on protecting the public interest. Additionally, we found that people used more negative, hostile, and disappointing expressions in human-human communication after interactions during which participants blamed rather than praised for their responses. Together, our findings demonstrate that the emotional tone people apply in human-AI interactions not only shape ChatGPT’s outputs but also carry over into subsequent human-human communication.

[88] Agent-as-a-Judge

Runyang You, Hongru Cai, Caiqi Zhang, Qiancheng Xu, Meng Liu, Tiezheng Yu, Yongqi Li, Wenjie Li

Main category: cs.CL

TL;DR: Survey paper tracing the evolution from LLM-as-a-Judge to Agent-as-a-Judge, establishing a taxonomy and roadmap for agentic evaluation systems.

Details

Motivation: LLM-as-a-Judge has limitations with complex, specialized, multi-step evaluations due to biases, shallow reasoning, and inability to verify against real-world observations. The field lacks a unified framework for the emerging Agent-as-a-Judge paradigm.

Method: Comprehensive survey approach: identifying key dimensions of the paradigm shift, establishing a developmental taxonomy, organizing core methodologies, and surveying applications across general and professional domains.

Result: First comprehensive survey mapping the transition to agentic evaluation, providing a framework to navigate the landscape, analyzing frontier challenges, and identifying promising research directions.

Conclusion: The paper provides a clear roadmap for next-generation agentic evaluation systems, bridging the gap in unified frameworks for this emerging field.

Abstract: LLM-as-a-Judge has revolutionized AI evaluation by leveraging large language models for scalable assessments. However, as evaluands become increasingly complex, specialized, and multi-step, the reliability of LLM-as-a-Judge has become constrained by inherent biases, shallow single-pass reasoning, and the inability to verify assessments against real-world observations. This has catalyzed the transition to Agent-as-a-Judge, where agentic judges employ planning, tool-augmented verification, multi-agent collaboration, and persistent memory to enable more robust, verifiable, and nuanced evaluations. Despite the rapid proliferation of agentic evaluation systems, the field lacks a unified framework to navigate this shifting landscape. To bridge this gap, we present the first comprehensive survey tracing this evolution. Specifically, we identify key dimensions that characterize this paradigm shift and establish a developmental taxonomy. We organize core methodologies and survey applications across general and professional domains. Furthermore, we analyze frontier challenges and identify promising research directions, ultimately providing a clear roadmap for the next generation of agentic evaluation.

[89] DocDancer: Towards Agentic Document-Grounded Information Seeking

Qintong Zhang, Xinjie Lv, Jialong Wu, Baixuan Li, Zhengwei Tao, Guochen Yan, Huanyao Zhang, Bin Wang, Jiahao Xu, Haitao Mi, Wentao Zhang

Main category: cs.CL

TL;DR: DocDancer is an open-source document QA agent with tool-driven framework for document exploration and comprehension, trained end-to-end using synthetic data generation.

Details

Motivation: Existing DocQA agents lack effective tool utilization and rely heavily on closed-source models, creating a need for open-source solutions with better tool integration.

Method: Proposes a tool-driven agent framework for document exploration and comprehension, with an Exploration-then-Synthesis data synthesis pipeline to generate training data for end-to-end training.

Result: Trained models show effectiveness on MMLongBench-Doc and DocBench benchmarks, with analysis providing insights for agentic tool design and synthetic data.

Conclusion: DocDancer demonstrates the viability of open-source, tool-driven document QA agents trained end-to-end with synthetic data, advancing document understanding capabilities.

Abstract: Document Question Answering (DocQA) focuses on answering questions grounded in given documents, yet existing DocQA agents lack effective tool utilization and largely rely on closed-source models. In this work, we introduce DocDancer, an end-to-end trained open-source Doc agent. We formulate DocQA as an information-seeking problem and propose a tool-driven agent framework that explicitly models document exploration and comprehension. To enable end-to-end training of such agents, we introduce an Exploration-then-Synthesis data synthesis pipeline that addresses the scarcity of high-quality training data for DocQA. Training on the synthesized data, the trained models on two long-context document understanding benchmarks, MMLongBench-Doc and DocBench, show their effectiveness. Further analysis provides valuable insights for the agentic tool design and synthetic data.

[90] RelayLLM: Efficient Reasoning via Collaborative Decoding

Chengsong Huang, Tong Zheng, Langlin Huang, Jinyuan Li, Haolin Liu, Jiaxin Huang

Main category: cs.CL

TL;DR: RelayLLM is a token-level collaborative decoding framework where SLMs dynamically invoke LLMs only for critical tokens, achieving near-LLM performance with 98.2% cost reduction.

Details

Motivation: Current collaborative approaches between LLMs and SLMs operate at coarse granularity (entire queries), causing computational waste when SLMs can handle most reasoning steps. There's a need for more efficient token-level collaboration.

Method: RelayLLM uses token-level collaborative decoding where SLMs act as controllers that dynamically invoke LLMs via special commands for critical tokens. Training involves warm-up and Group Relative Policy Optimization (GRPO) to balance independence with strategic help-seeking.

Result: Achieves 49.52% average accuracy across six benchmarks, bridging performance gap between models while invoking LLMs for only 1.07% of total generated tokens, offering 98.2% cost reduction compared to performance-matched random routers.

Conclusion: RelayLLM enables efficient reasoning through fine-grained token-level collaboration between SLMs and LLMs, dramatically reducing computational costs while maintaining strong performance.

Abstract: Large Language Models (LLMs) for complex reasoning is often hindered by high computational costs and latency, while resource-efficient Small Language Models (SLMs) typically lack the necessary reasoning capacity. Existing collaborative approaches, such as cascading or routing, operate at a coarse granularity by offloading entire queries to LLMs, resulting in significant computational waste when the SLM is capable of handling the majority of reasoning steps. To address this, we propose RelayLLM, a novel framework for efficient reasoning via token-level collaborative decoding. Unlike routers, RelayLLM empowers the SLM to act as an active controller that dynamically invokes the LLM only for critical tokens via a special command, effectively “relaying” the generation process. We introduce a two-stage training framework, including warm-up and Group Relative Policy Optimization (GRPO) to teach the model to balance independence with strategic help-seeking. Empirical results across six benchmarks demonstrate that RelayLLM achieves an average accuracy of 49.52%, effectively bridging the performance gap between the two models. Notably, this is achieved by invoking the LLM for only 1.07% of the total generated tokens, offering a 98.2% cost reduction compared to performance-matched random routers.

[91] Reverse-engineering NLI: A study of the meta-inferential properties of Natural Language Inference

Rasmus Blanck, Bill Noble, Stergios Chatzikyriakidis

Main category: cs.CL

TL;DR: The paper analyzes the logical properties of Natural Language Inference (NLI) by formulating three possible readings of NLI labels and evaluating which logical interpretation is encoded in the SNLI dataset through meta-inferential consistency analysis.

Details

Motivation: NLI is important for evaluating language models' natural language understanding, but its logical properties are poorly understood and often mischaracterized. Understanding what kind of inference NLI captures is crucial for interpreting model performance on the task.

Method: The authors formulate three possible readings of the NLI label set (entailment, contradiction, neutral) and analyze their meta-inferential properties. They use two approaches: (1) NLI items with shared premises, and (2) items generated by LLMs to evaluate models trained on SNLI for meta-inferential consistency.

Result: The analysis reveals which reading of the logical relations is actually encoded by the SNLI dataset, providing insights into how NLI labels should be interpreted logically.

Conclusion: The paper provides a systematic framework for understanding the logical properties of NLI, helping to clarify what kind of inference the task actually evaluates and enabling better interpretation of model performance on NLI benchmarks.

Abstract: Natural Language Inference (NLI) has been an important task for evaluating language models for Natural Language Understanding, but the logical properties of the task are poorly understood and often mischaracterized. Understanding the notion of inference captured by NLI is key to interpreting model performance on the task. In this paper we formulate three possible readings of the NLI label set and perform a comprehensive analysis of the meta-inferential properties they entail. Focusing on the SNLI dataset, we exploit (1) NLI items with shared premises and (2) items generated by LLMs to evaluate models trained on SNLI for meta-inferential consistency and derive insights into which reading of the logical relations is encoded by the dataset.

[92] Inside Out: Evolving User-Centric Core Memory Trees for Long-Term Personalized Dialogue Systems

Jihao Zhao, Ding Chen, Zhaoxin Fan, Kerun Xu, Mengting Hu, Bo Tang, Feiyu Xiong, Zhiyu li

Main category: cs.CL

TL;DR: Inside Out framework uses PersonaTree for long-term personalized dialogue, with MemListener for structured memory operations, outperforming existing methods in noise suppression and persona consistency.

Details

Motivation: Existing long-term personalized dialogue systems struggle with memory noise accumulation, reasoning degradation, and persona inconsistency due to unbounded interactions within finite context constraints.

Method: Proposes Inside Out framework with globally maintained PersonaTree for user profiling, trained lightweight MemListener via RL with process-based rewards for structured operations (ADD, UPDATE, DELETE, NO_OP), and dual-mode response generation (direct tree usage vs. agentic mode for details).

Result: PersonaTree outperforms full-text concatenation and personalized memory systems in suppressing contextual noise and maintaining persona consistency. MemListener achieves memory-operation decision performance comparable to or surpassing powerful reasoning models like DeepSeek-R1-0528 and Gemini-3-Pro.

Conclusion: The Inside Out framework with PersonaTree and MemListener effectively addresses long-term personalized dialogue challenges through structured memory management, achieving both efficiency and consistency while enabling dynamic persona evolution.

Abstract: Existing long-term personalized dialogue systems struggle to reconcile unbounded interaction streams with finite context constraints, often succumbing to memory noise accumulation, reasoning degradation, and persona inconsistency. To address these challenges, this paper proposes Inside Out, a framework that utilizes a globally maintained PersonaTree as the carrier of long-term user profiling. By constraining the trunk with an initial schema and updating the branches and leaves, PersonaTree enables controllable growth, achieving memory compression while preserving consistency. Moreover, we train a lightweight MemListener via reinforcement learning with process-based rewards to produce structured, executable, and interpretable {ADD, UPDATE, DELETE, NO_OP} operations, thereby supporting the dynamic evolution of the personalized tree. During response generation, PersonaTree is directly leveraged to enhance outputs in latency-sensitive scenarios; when users require more details, the agentic mode is triggered to introduce details on-demand under the constraints of the PersonaTree. Experiments show that PersonaTree outperforms full-text concatenation and various personalized memory systems in suppressing contextual noise and maintaining persona consistency. Notably, the small MemListener model achieves memory-operation decision performance comparable to, or even surpassing, powerful reasoning models such as DeepSeek-R1-0528 and Gemini-3-Pro.

[93] LELA: an LLM-based Entity Linking Approach with Zero-Shot Domain Adaptation

Samy Haffoudhi, Fabian M. Suchanek, Nils Holzenberger

Main category: cs.CL

TL;DR: LELA is a modular coarse-to-fine entity linking method using LLMs without fine-tuning, competitive with fine-tuned approaches and outperforming non-fine-tuned methods.

Details

Motivation: Entity linking is crucial for knowledge graph construction, QA, and information extraction. Existing methods often require fine-tuning, limiting flexibility across domains, knowledge bases, and LLMs.

Method: LELA uses a modular coarse-to-fine approach leveraging LLMs without fine-tuning. It works across different domains, knowledge bases, and LLMs through its flexible architecture.

Result: Experiments across various entity linking settings show LELA is highly competitive with fine-tuned approaches and substantially outperforms non-fine-tuned methods.

Conclusion: LELA provides an effective, flexible entity linking solution that eliminates the need for fine-tuning while maintaining competitive performance across diverse settings.

Abstract: Entity linking (mapping ambiguous mentions in text to entities in a knowledge base) is a foundational step in tasks such as knowledge graph construction, question-answering, and information extraction. Our method, LELA, is a modular coarse-to-fine approach that leverages the capabilities of large language models (LLMs), and works with different target domains, knowledge bases and LLMs, without any fine-tuning phase. Our experiments across various entity linking settings show that LELA is highly competitive with fine-tuned approaches, and substantially outperforms the non-fine-tuned ones.

[94] Measuring and Fostering Peace through Machine Learning and Artificial Intelligence

P. Gilda, P. Dungarwal, A. Thongkham, E. T. Ajayi, S. Choudhary, T. M. Terol, C. Lam, J. P. Araujo, M. McFadyen-Mungalln, L. S. Liebovitch, P. T. Coleman, H. West, K. Sieck, S. Carter

Main category: cs.CL

TL;DR: Researchers used AI to measure peace levels in countries from news/social media and developed a Chrome extension (MirrorMirror) that gives real-time feedback on content peacefulness to promote more respectful communication.

Details

Motivation: 71% of young adults get news from short social media videos, and content creators often use emotional activation (anger) to increase engagement, potentially harming peaceful discourse. There's a need to move beyond simple engagement metrics toward tools that promote understanding of media tone and its effects.

Method: 1) Used neural networks on text embeddings from online news to measure peace levels, with cross-dataset validation. 2) Developed models for social media (YouTube) using word-level (GoEmotions) and context-level (LLM) methods to measure social dimensions important for peace. 3) Built and tested MirrorMirror Chrome extension providing real-time peacefulness feedback to YouTube viewers.

Result: News peace measurement model trained on one dataset showed high accuracy on different news dataset. Developed functional Chrome extension that provides real-time feedback. Identified that 71% of 20-40 year olds get daily news from short social media videos.

Conclusion: The MirrorMirror tool aims to evolve into open-source platform for content creators, journalists, researchers, platforms, and users to better understand media tone and its effects, encouraging more respectful, nuanced communication beyond simple engagement metrics.

Abstract: We used machine learning and artificial intelligence: 1) to measure levels of peace in countries from news and social media and 2) to develop on-line tools that promote peace by helping users better understand their own media diet. For news media, we used neural networks to measure levels of peace from text embeddings of on-line news sources. The model, trained on one news media dataset also showed high accuracy when used to analyze a different news dataset. For social media, such as YouTube, we developed other models to measure levels of social dimensions important in peace using word level (GoEmotions) and context level (Large Language Model) methods. To promote peace, we note that 71% of people 20-40 years old daily view most of their news through short videos on social media. Content creators of these videos are biased towards creating videos with emotional activation, making you angry to engage you, to increase clicks. We developed and tested a Chrome extension, MirrorMirror, which provides real-time feedback to YouTube viewers about the peacefulness of the media they are watching. Our long term goal is for MirrorMirror to evolve into an open-source tool for content creators, journalists, researchers, platforms, and individual users to better understand the tone of their media creation and consumption and its effects on viewers. Moving beyond simple engagement metrics, we hope to encourage more respectful, nuanced, and informative communication.

[95] GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Peter Belcak, Mingjie Liu, Min-Hung Chen, Hongxu Yin, Yu-Chiang Frank Wang, Kwang-Ting Cheng, Yejin Choi, Jan Kautz, Pavlo Molchanov

Main category: cs.CL

TL;DR: GDPO improves multi-reward RL by decoupling reward normalization to prevent signal collapse, outperforming GRPO across tool calling, math reasoning, and coding tasks.

Details

Motivation: Current RL pipelines use multiple rewards to align language models with diverse human preferences, but GRPO causes reward collapse when normalizing distinct rollout combinations, leading to suboptimal convergence and training failures.

Method: Introduces Group reward-Decoupled Normalization Policy Optimization (GDPO), which decouples normalization of individual rewards to preserve their relative differences and enable more accurate multi-reward optimization with improved stability.

Result: GDPO consistently outperforms GRPO across three tasks (tool calling, math reasoning, coding reasoning) in both correctness metrics (accuracy, bug ratio) and constraint adherence metrics (format, length).

Conclusion: GDPO is an effective and generalizable method for multi-reward reinforcement learning optimization that resolves the reward collapse issue in GRPO while improving training stability and performance.

Abstract: As language models become increasingly capable, users expect them to provide not only accurate responses but also behaviors aligned with diverse human preferences across a variety of scenarios. To achieve this, Reinforcement learning (RL) pipelines have begun incorporating multiple rewards, each capturing a distinct preference, to guide models toward these desired behaviors. However, recent work has defaulted to apply Group Relative Policy Optimization (GRPO) under multi-reward setting without examining its suitability. In this paper, we demonstrate that directly applying GRPO to normalize distinct rollout reward combinations causes them to collapse into identical advantage values, reducing the resolution of the training signal and resulting in suboptimal convergence and, in some cases, early training failure. We then introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new policy optimization method to resolve these issues by decoupling the normalization of individual rewards, more faithfully preserving their relative differences and enabling more accurate multi-reward optimization, along with substantially improved training stability. We compare GDPO with GRPO across three tasks: tool calling, math reasoning, and coding reasoning, evaluating both correctness metrics (accuracy, bug ratio) and constraint adherence metrics (format, length). Across all settings, GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability for multi-reward reinforcement learning optimization.

[96] Safe in the Future, Dangerous in the Past: Dissecting Temporal and Linguistic Vulnerabilities in LLMs

Muhammad Abdullahi Said, Muhammad Sammani Sani

Main category: cs.CL

TL;DR: LLM safety alignment doesn’t transfer zero-shot across languages; models show complex interference patterns with reverse linguistic vulnerability and catastrophic temporal reasoning failures, creating dangerous safety pockets for Global South users.

Details

Motivation: The dangerous assumption that safety alignment transfers zero-shot from English to other languages in LLMs integrated into critical global infrastructure, particularly exposing Global South users to localized harms.

Method: Systematic audit of GPT-5.1, Gemini 3 Pro, and Claude 4.5 Opus using HausaSafety dataset with West African threat scenarios. Employed 2 x 4 factorial design across 1,440 evaluations testing language (English vs. Hausa) and temporal framing interactions.

Result: Found complex interference mechanisms instead of simple multilingual safety gap: reverse linguistic vulnerability (Claude 4.5 Opus safer in Hausa than English), catastrophic temporal reasoning failures with profound Temporal Asymmetry (past-tense bypassed defenses, future-triggered hyper-conservative refusals), and 9.2x disparity between safest and most vulnerable configurations.

Conclusion: Current models rely on superficial heuristics rather than robust semantic understanding, creating Safety Pockets that expose Global South users. Proposes Invariant Alignment as necessary paradigm shift for safety stability across linguistic and temporal shifts.

Abstract: As Large Language Models (LLMs) integrate into critical global infrastructure, the assumption that safety alignment transfers zero-shot from English to other languages remains a dangerous blind spot. This study presents a systematic audit of three state of the art models (GPT-5.1, Gemini 3 Pro, and Claude 4.5 Opus) using HausaSafety, a novel adversarial dataset grounded in West African threat scenarios (e.g., Yahoo-Yahoo fraud, Dane gun manufacturing). Employing a 2 x 4 factorial design across 1,440 evaluations, we tested the non-linear interaction between language (English vs. Hausa) and temporal framing. Our results challenge the narrative of the multilingual safety gap. Instead of a simple degradation in low-resource settings, we identified a complex interference mechanism in which safety is determined by the intersection of variables. Although the models exhibited a reverse linguistic vulnerability with Claude 4.5 Opus proving significantly safer in Hausa (45.0%) than in English (36.7%) due to uncertainty-driven refusal, they suffered catastrophic failures in temporal reasoning. We report a profound Temporal Asymmetry, where past-tense framing bypassed defenses (15.6% safe) while future-tense scenarios triggered hyper-conservative refusals (57.2% safe). The magnitude of this volatility is illustrated by a 9.2x disparity between the safest and most vulnerable configurations, proving that safety is not a fixed property but a context-dependent state. We conclude that current models rely on superficial heuristics rather than robust semantic understanding, creating Safety Pockets that leave Global South users exposed to localized harms. We propose Invariant Alignment as a necessary paradigm shift to ensure safety stability across linguistic and temporal shifts.

[97] Pelican Soup Framework: A Theoretical Framework for Language Model Capabilities

Ting-Rui Chiang, Dani Yogatama

Main category: cs.CL

TL;DR: Pelican Soup is a theoretical framework explaining how pretrained LLMs generalize to unseen instructions and perform in-context learning, even with irrelevant verbalizers, by introducing concepts of “knowledge base” and “reference-sense association.”

Details

Motivation: To better understand two key capabilities of pretrained LLMs: (1) generalization to unseen instructions, and (2) in-context learning performance even when verbalizers are irrelevant to the task.

Method: Proposes the Pelican Soup framework with concepts of “knowledge base” and “reference-sense association,” plus a simple formalism for NLP tasks. The framework connects linguistic, psychology, and philosophy studies to language model understanding.

Result: Derives a bound on in-context learning loss using the framework, supports with empirical experiments, and provides future research directions.

Conclusion: Pelican Soup offers a theoretical foundation for understanding LLM capabilities, bridging cognitive science with language model theory, and enabling formal analysis of in-context learning.

Abstract: In this work, we propose a simple theoretical framework, Pelican Soup, aiming to better understand how pretraining allows LLMs to (1) generalize to unseen instructions and (2) perform in-context learning, even when the verbalizers are irrelevant to the task. To this end, in our framework, we introduce the notion of “knowledge base” and “reference-sense association” and a simple formalism for natural language processing tasks. Our framework demonstrates how linguistic, psychology, and philosophy studies can inform our understanding of the language model and is connected to several other existing theoretical results. As an illustration of the usage of our framework, we derive a bound on in-context learning loss with our framework. Finally, we support our framework with empirical experiments and provide possible future research directions.

[98] On the Diagram of Thought

Yifan Zhang, Yang Yuan, Andrew Chi-Chih Yao

Main category: cs.CL

TL;DR: DoT is a framework enabling LLMs to build dynamic reasoning diagrams for complex problems, using category theory for logical consistency and producing auditable reasoning traces.

Details

Motivation: LLMs struggle with complex, multi-step reasoning problems that require structured thinking. Current methods often rely on external controllers or search algorithms, making them inefficient and less transparent.

Method: Diagram of Thought (DoT) framework where LLMs construct dynamic diagrams of ideas, allowing them to propose different lines of thought, critique steps, and synthesize insights. Grounded in category theory for logical consistency and robustness.

Result: A self-contained, efficient reasoning process that avoids complex external controllers, produces auditable step-by-step traces, and bridges the gap between fluent language and formal reasoning.

Conclusion: DoT enables more powerful and transparent reasoning in LLMs through diagrammatic thinking with mathematical foundations, improving complex problem-solving capabilities while maintaining efficiency and auditability.

Abstract: Large Language Models (LLMs) excel at many tasks but often falter on complex problems that require structured, multi-step reasoning. We introduce the Diagram of Thought (DoT), a new framework that enables a single LLM to build and navigate a mental map of its reasoning. Instead of thinking in a straight line, the model constructs a dynamic diagram of ideas, where it can propose different lines of thought, critique its own steps, and synthesize validated insights into a final conclusion. This entire process is self-contained within the model, making it highly efficient by avoiding the complex external controllers or search algorithms required by other methods. To ensure the reliability of this process, we ground DoT in a rigorous mathematical framework from category theory. This foundation guarantees that the way the model combines information is logical, consistent, and robust, regardless of the order in which ideas were explored. The result is a more powerful and transparent reasoning process that produces a fully auditable, step-by-step trace of the LLM’s thinking, bridging the gap between fluent language and formal reasoning.

[99] ChakmaNMT: Machine Translation for a Low-Resource and Endangered Language via Transliteration

Aunabil Chakma, Aditya Chakma, Masum Hasan, Soham Khisa, Chumui Tripura, Rifat Shahriyar

Main category: cs.CL

TL;DR: First systematic study of machine translation for endangered Chakma language, introducing new datasets and showing transliteration is essential for effective MT due to script mismatch and data scarcity.

Details

Motivation: To support language access and preservation for Chakma, an endangered and extremely low-resource Indo-Aryan language, by developing effective machine translation solutions.

Method: Introduces new Chakma-Bangla parallel and monolingual dataset plus trilingual benchmark; proposes character-level transliteration framework exploiting orthographic/phonological relationship between Chakma and Bangla; benchmarks from-scratch MT, fine-tuned pretrained models, and LLMs via in-context learning.

Result: Transliteration is essential; fine-tuning and in-context learning substantially outperform from-scratch baselines; strong asymmetry observed across translation directions.

Conclusion: The study provides foundational resources and methods for Chakma MT, demonstrating that transliteration combined with transfer learning from related languages enables effective translation for endangered low-resource languages.

Abstract: We present the first systematic study of machine translation for Chakma, an endangered and extremely low-resource Indo-Aryan language, with the goal of supporting language access and preservation. We introduce a new Chakma-Bangla parallel and monolingual dataset, along with a trilingual Chakma-Bangla-English benchmark for evaluation. To address script mismatch and data scarcity, we propose a character-level transliteration framework that exploits the close orthographic and phonological relationship between Chakma and Bangla, preserving semantic content while enabling effective transfer from Bangla and multilingual pretrained models. We benchmark from-scratch MT, fine-tuned pretrained models, and large language models via in-context learning. Results show that transliteration is essential and that fine-tuning and in-context learning substantially outperform from-scratch baselines, with strong asymmetry across translation directions.

[100] Is This Collection Worth My LLM’s Time? Automatically Measuring Information Potential in Text Corpora

Tristan Karch, Luca Engel, Philippe Schwaller, Frédéric Kaplan

Main category: cs.CL

TL;DR: Automated pipeline evaluates text collections’ information value for LLMs using MCQ generation and performance gap analysis without training.

Details

Motivation: As LLMs converge in capabilities, identifying valuable new information sources is crucial but evaluating text collections for digitization/integration is challenging and resource-intensive.

Method: Generate multiple choice questions from texts, measure LLM performance with and without source access, use performance gap as proxy for collection’s information potential.

Result: Method effectively identifies collections with valuable novel information across five datasets: EPFL PhD manuscripts, Venetian historical records, Wikipedia articles, and synthetic baseline.

Conclusion: Provides practical tool for prioritizing data acquisition/integration efforts by evaluating information potential without costly training/fine-tuning.

Abstract: As large language models (LLMs) converge towards similar capabilities, the key to advancing their performance lies in identifying and incorporating valuable new information sources. However, evaluating which text collections are worth the substantial investment required for digitization, preprocessing, and integration into LLM systems remains a significant challenge. We present a novel approach to this challenge: an automated pipeline that evaluates the potential information gain from text collections without requiring model training or fine-tuning. Our method generates multiple choice questions (MCQs) from texts and measures an LLM’s performance both with and without access to the source material. The performance gap between these conditions serves as a proxy for the collection’s information potential. We validate our approach using five strategically selected datasets: EPFL PhD manuscripts, a private collection of Venetian historical records, two sets of Wikipedia articles on related topics, and a synthetic baseline dataset. Our results demonstrate that this method effectively identifies collections containing valuable novel information, providing a practical tool for prioritizing data acquisition and integration efforts.

[101] Cognitive-Mental-LLM: Evaluating Reasoning in Large Language Models for Mental Health Prediction via Online Text

Avinash Patil, Amardeep Kour Gedhu

Main category: cs.CL

TL;DR: This study evaluates structured reasoning techniques (CoT, SC-CoT, ToT) for LLM-based mental health classification from Reddit text, showing performance improvements over traditional methods but with dataset-specific limitations.

Details

Motivation: Traditional LLM classification methods for mental health prediction lack interpretability and robustness, creating a need for structured reasoning approaches to improve accuracy and reliability in clinical applications.

Method: Evaluated Chain-of-Thought (CoT), Self-Consistency (SC-CoT), and Tree-of-Thought (ToT) reasoning techniques with Zero-shot and Few-shot prompting strategies across multiple mental health datasets from Reddit, comparing against baseline models like BERT, Mental-RoBerta, Mental Alpaca, and Mental-Flan-T5.

Result: Reasoning-enhanced techniques improved classification performance over direct prediction, with notable gains on Dreaddit (+0.52% over M-LLM, +0.82% over BERT) and SDCNL (+4.67% over M-LLM, +2.17% over BERT). Few-shot CoT consistently outperformed other strategies, but performance declined in Depression Severity and CSSRS predictions due to dataset-specific limitations.

Conclusion: Reasoning-driven LLMs show promise for scalable mental health applications, with Few-shot CoT being most effective, but dataset variability highlights challenges in model reliability and interpretability that need addressing for future improvements.

Abstract: Large Language Models (LLMs) have demonstrated potential in predicting mental health outcomes from online text, yet traditional classification methods often lack interpretability and robustness. This study evaluates structured reasoning techniques-Chain-of-Thought (CoT), Self-Consistency (SC-CoT), and Tree-of-Thought (ToT)-to improve classification accuracy across multiple mental health datasets sourced from Reddit. We analyze reasoning-driven prompting strategies, including Zero-shot CoT and Few-shot CoT, using key performance metrics such as Balanced Accuracy, F1 score, and Sensitivity/Specificity. Our findings indicate that reasoning-enhanced techniques improve classification performance over direct prediction, particularly in complex cases. Compared to baselines such as Zero Shot non-CoT Prompting, and fine-tuned pre-trained transformers such as BERT and Mental-RoBerta, and fine-tuned Open Source LLMs such as Mental Alpaca and Mental-Flan-T5, reasoning-driven LLMs yield notable gains on datasets like Dreaddit (+0.52% over M-LLM, +0.82% over BERT) and SDCNL (+4.67% over M-LLM, +2.17% over BERT). However, performance declines in Depression Severity, and CSSRS predictions suggest dataset-specific limitations, likely due to our using a more extensive test set. Among prompting strategies, Few-shot CoT consistently outperforms others, reinforcing the effectiveness of reasoning-driven LLMs. Nonetheless, dataset variability highlights challenges in model reliability and interpretability. This study provides a comprehensive benchmark of reasoning-based LLM techniques for mental health text classification. It offers insights into their potential for scalable clinical applications while identifying key challenges for future improvements.

[102] SciClaims: An End-to-End Generative System for Biomedical Claim Analysis

Raúl Ortega, José Manuel Gómez-Pérez

Main category: cs.CL

TL;DR: SciClaims is an interactive web system for biomedical claim analysis that extracts claims, retrieves PubMed evidence, and verifies veracity using a single LLM without fine-tuning.

Details

Motivation: Addresses the need for automated scientific claim analysis in high-stakes biomedical applications like systematic literature reviews and patent validation, where manual verification is time-consuming and resource-intensive.

Method: Uses a single large language model for end-to-end claim analysis without fine-tuning. System extracts claims from text, retrieves relevant evidence from PubMed, and provides veracity predictions with supporting/refuting evidence and natural language justifications.

Result: Developed a publicly available web-based system optimized to run efficiently on a single GPU, featuring a user-friendly interface for interactive claim analysis in the biomedical domain.

Conclusion: SciClaims provides an efficient, integrated solution for scientific claim verification that simplifies the analysis process and makes it accessible for practical applications in biomedical research and validation.

Abstract: We present SciClaims, an interactive web-based system for end-to-end scientific claim analysis in the biomedical domain. Designed for high-stakes use cases such as systematic literature reviews and patent validation, SciClaims extracts claims from text, retrieves relevant evidence from PubMed, and verifies their veracity. The system features a user-friendly interface where users can input scientific text and view extracted claims, predictions, supporting or refuting evidence, and justifications in natural language. Unlike prior approaches, SciClaims seamlessly integrates the entire scientific claim analysis process using a single large language model, without requiring additional fine-tuning. SciClaims is optimized to run efficiently on a single GPU and is publicly available for live interaction.

[103] Establishing a Scale for Kullback–Leibler Divergence in Language Models Across Various Settings

Ryo Kishino, Yusuke Takase, Momose Oyama, Hiroaki Yamagiwa, Hidetoshi Shimodaira

Main category: cs.CL

TL;DR: The paper proposes using log-likelihood vectors as a unified framework to compare language models across different settings, showing that model behavior stabilizes early despite ongoing weight changes.

Details

Motivation: Current methods for comparing language models lack a unified framework that works across heterogeneous settings like different model sizes, training stages, quantization levels, and fine-tuning approaches. There's a need for a consistent way to measure model similarity and track learning trajectories.

Method: Extends log-likelihood vectors to create a common comparison space for language models. Uses KL divergence as a consistent scale across pretraining, model size, random seeds, quantization, fine-tuning, and layers. Analyzes Pythia pretraining trajectories to study learning dynamics.

Result: Changes in log-likelihood space are much smaller than in weight space, revealing subdiffusive learning trajectories. Language model behavior stabilizes early in training despite continued weight drift, showing that most meaningful learning happens early.

Conclusion: Log-likelihood vectors provide a powerful unified framework for comparing language models across diverse settings. The findings reveal that model behavior stabilizes quickly during training, with weight changes becoming less meaningful over time, which has implications for training efficiency and model comparison methodologies.

Abstract: Log-likelihood vectors define a common space for comparing language models as probability distributions, enabling unified comparisons across heterogeneous settings. We extend this framework to training checkpoints and intermediate layers, and establish a consistent scale for KL divergence across pretraining, model size, random seeds, quantization, fine-tuning, and layers. Analysis of Pythia pretraining trajectories further shows that changes in log-likelihood space are much smaller than in weight space, resulting in subdiffusive learning trajectories and early stabilization of language-model behavior despite weight drift.

[104] OpenEthics: A Comprehensive Ethical Evaluation of Open-Source Generative Large Language Models

Yıldırım Özen, Burak Erinç Çetin, Kaan Engür, Elif Naz Demiryılmaz, Cagri Toraman

Main category: cs.CL

TL;DR: Broad ethical evaluation of 29 open-source LLMs across 4 ethical dimensions (robustness, reliability, safety, fairness) in English and Turkish, showing strong safety/fairness but reliability concerns, cross-linguistic consistency, and jailbreak ineffectiveness.

Details

Motivation: Existing ethical studies of LLMs have limitations: narrow focus, lack of language diversity, and evaluation of restricted model sets. Need comprehensive ethical assessment covering multiple dimensions across diverse languages and models.

Method: Evaluated 29 recent open-source LLMs using novel dataset assessing four ethical dimensions. Used LLM-as-a-Judge methodology. Included both high-resource (English) and low-resource (Turkish) languages for comprehensive assessment.

Result: Many open-source models show strong performance in safety, fairness, and robustness, but reliability remains a key concern. Ethical evaluation shows cross-linguistic consistency. Larger models generally exhibit better ethical performance. Jailbreak templates ineffective for most models.

Conclusion: Comprehensive ethical evaluation provides guide for safer model development. Open-source models demonstrate ethical strengths but reliability needs improvement. Cross-linguistic consistency suggests ethical principles transfer across languages. Materials shared publicly for reproducibility.

Abstract: Generative large language models present significant potential but also raise critical ethical concerns, including issues of safety, fairness, robustness, and reliability. Most existing ethical studies, however, are limited by their narrow focus, a lack of language diversity, and an evaluation of a restricted set of models. To address these gaps, we present a broad ethical evaluation of 29 recent open-source LLMs using a novel dataset that assesses four key ethical dimensions: robustness, reliability, safety, and fairness. Our analysis includes both a high-resource language, English, and a low-resource language, Turkish, providing a comprehensive assessment and a guide for safer model development. Using an LLM-as-a-Judge methodology, our experimental results indicate that many open-source models demonstrate strong performance in safety, fairness, and robustness, while reliability remains a key concern. Ethical evaluation shows cross-linguistic consistency, and larger models generally exhibit better ethical performance. We also show that jailbreak templates are ineffective for most of the open-source models examined in this study. We share all materials including data and scripts at https://github.com/metunlp/openethics

[105] Faithfulness-Aware Uncertainty Quantification for Fact-Checking the Output of Retrieval Augmented Generation

Ekaterina Fadeeva, Aleksandr Rubashevskii, Dzianis Piatrashyn, Roman Vashurin, Shehzaad Dhuliawala, Artem Shelmanov, Timothy Baldwin, Preslav Nakov, Mrinmaya Sachan, Maxim Panov

Main category: cs.CL

TL;DR: FRANQ is a new method for detecting hallucinations in RAG outputs that uses different uncertainty quantification techniques based on whether statements are faithful to retrieved context, achieving better factual error detection than existing approaches.

Details

Motivation: RAG systems are prone to hallucinations, but existing approaches conflate factuality with faithfulness to retrieved evidence, incorrectly labeling factually correct statements as hallucinations if not explicitly supported by retrieval.

Method: FRANQ applies distinct uncertainty quantification (UQ) techniques to estimate factuality, conditioning on whether a statement is faithful to the retrieved context. The method uses a new long-form QA dataset annotated for both factuality and faithfulness.

Result: Extensive experiments across multiple datasets, tasks, and LLMs show that FRANQ achieves more accurate detection of factual errors in RAG-generated responses compared to existing approaches.

Conclusion: FRANQ provides a more nuanced approach to hallucination detection in RAG systems by separating factuality assessment from faithfulness to retrieved context, addressing limitations of existing methods.

Abstract: Large Language Models (LLMs) enhanced with retrieval, an approach known as Retrieval-Augmented Generation (RAG), have achieved strong performance in open-domain question answering. However, RAG remains prone to hallucinations: factually incorrect outputs may arise from inaccuracies in the model’s internal knowledge and the retrieved context. Existing approaches to mitigating hallucinations often conflate factuality with faithfulness to the retrieved evidence, incorrectly labeling factually correct statements as hallucinations if they are not explicitly supported by the retrieval. In this paper, we introduce FRANQ, a new method for hallucination detection in RAG outputs. FRANQ applies distinct uncertainty quantification (UQ) techniques to estimate factuality, conditioning on whether a statement is faithful to the retrieved context. To evaluate FRANQ and competing UQ methods, we construct a new long-form question answering dataset annotated for both factuality and faithfulness, combining automated labeling with manual validation of challenging cases. Extensive experiments across multiple datasets, tasks, and LLMs show that FRANQ achieves more accurate detection of factual errors in RAG-generated responses compared to existing approaches.

[106] AutoL2S: Auto Long-Short Reasoning for Efficient Large Language Models

Feng Luo, Yu-Neng Chuang, Guanchu Wang, Hoang Anh Duy Le, Shaochen Zhong, Hongyi Liu, Jiayi Yuan, Yang Sui, Vladimir Braverman, Vipin Chaudhary, Xia Hu

Main category: cs.CL

TL;DR: AutoL2S is a distillation framework that teaches non-reasoning LLMs to generate concise reasoning for simple inputs while maintaining thorough reasoning for complex ones, reducing inference costs by up to 71% with minimal accuracy loss.

Details

Motivation: Current reasoning-capable LLMs often generate unnecessarily long chain-of-thought reasoning even for simple inputs, leading to high inference costs. Simply shortening reasoning length can degrade accuracy since concise reasoning may be insufficient for complex inputs and lacks proper supervision.

Method: AutoL2S first learns a lightweight switching token using verified long-short CoTs to enable instance-wise reasoning selection. It then uses long-short reasoning rollouts induced by the switching token in a GRPO-style loss to improve reasoning efficiency while maintaining accuracy.

Result: AutoL2S effectively reduces reasoning length up to 71% with minimal accuracy loss, achieving a significantly better trade-off between token length/inference time and accuracy preservation.

Conclusion: The proposed AutoL2S framework successfully enables LLMs to think thoroughly only when necessary, providing an efficient solution to the overthinking problem in distilled reasoning models while maintaining reasoning accuracy.

Abstract: Reasoning-capable large language models (LLMs) achieve strong performance on complex tasks but often exhibit overthinking after distillation, generating unnecessarily long chain-of-thought (CoT) reasoning even for simple inputs and incurring high inference cost. However, naively shortening reasoning length can degrade reasoning accuracy, as concise reasoning may be insufficient for certain inputs and lacks explicit supervision. We propose Auto Long-Short Reasoning (AutoL2S), a distillation framework that empowers non-reasoning LLMs to think thoroughly but only when necessary. AutoL2S first learns a lightweight switching token with verified long-short CoTs to enable instance-wise long-short reasoning selection. Then it leverages long-short reasoning rollouts induced by a switching token in a GRPO-style loss to improve reasoning efficiency while maintaining accuracy. Experiments demonstrate that AutoL2S effectively reduces reasoning length up to 71% with minimal accuracy loss, yielding markedly better trade-off in token length and inference time while preserving accuracy.

[107] Act-Adaptive Margin: Dynamically Calibrating Reward Models for Subjective Ambiguity

Feiteng Fang, Dingwei Chen, Xiang Huang, Ting-En Lin, Yuchuan Wu, Xiong Liu, Xinge Ye, Ziqiang Liu, Haonan Zhang, Liang Zhu, Hamid Alinejad-Rokny, Min Yang, Yongbin Li

Main category: cs.CL

TL;DR: AAM (Act-Adaptive Margin) improves reward modeling for subjective tasks like role-playing by dynamically calibrating preference margins using model’s internal knowledge, enhancing Bradley-Terry models without extra human annotation.

Details

Motivation: Current RL alignment techniques struggle with subjective tasks (e.g., role-playing) because traditional reward modeling using Bradley-Terry models faces challenges with ambiguous preferences in subjective domains.

Method: Proposes AAM (Act-Adaptive Margin) that dynamically calibrates preference margins using model’s internal parameter knowledge. Two versions efficiently generate contextually-appropriate preference gaps without additional human annotation.

Result: AAM improves Bradley-Terry reward models by 2.95% in general tasks and 4.85% in subjective role-playing tasks. When applied to downstream alignment (e.g., GRPO), achieves SOTA results on CharacterEval and Charm benchmarks.

Conclusion: AAM significantly enhances subjective reward modeling by better integrating generative understanding with preference scoring, enabling better performance in subjective tasks where traditional methods struggle.

Abstract: Currently, most reinforcement learning tasks focus on domains like mathematics and programming, where verification is relatively straightforward. However, in subjective tasks such as role-playing, alignment techniques struggle to make progress, primarily because subjective reward modeling using the Bradley-Terry model faces significant challenges when dealing with ambiguous preferences. To improve reward modeling in subjective tasks, this paper proposes AAM (\textbf{\underline{A}}ct-\textbf{\underline{A}}daptive \textbf{\underline{M}}argin), which enhances reward modeling by dynamically calibrating preference margins using the model’s internal parameter knowledge. We design two versions of AAM that efficiently generate contextually-appropriate preference gaps without additional human annotation. This approach fundamentally improves how reward models handle subjective rewards by better integrating generative understanding with preference scoring. To validate AAM’s effectiveness in subjective reward modeling, we conduct evaluations on RewardBench, JudgeBench, and challenging role-playing tasks. Results show that AAM significantly improves subjective reward modeling performance, enhancing Bradley-Terry reward models by 2.95% in general tasks and 4.85% in subjective role-playing tasks. Furthermore, reward models trained with AAM can help downstream alignment tasks achieve better results. Our test results show that applying rewards generated by AAM-Augmented RM to preference learning techniques (e.g., GRPO) achieves state-of-the-art results on CharacterEval and Charm. Code and dataset are available at https://github.com/calubkk/AAM.

[108] FinChain: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning

Zhuohan Xie, Daniil Orel, Rushil Thareja, Dhruv Sahnan, Hachem Madmoun, Fan Zhang, Debopriyo Banerjee, Georgi Georgiev, Xueqing Peng, Lingfei Qian, Jimin Huang, Jinyan Su, Aaryamonvikram Singh, Rui Xing, Rania Elbadry, Chen Xu, Haonan Li, Fajri Koto, Ivan Koychev, Tanmoy Chakraborty, Yuxia Wang, Salem Lahlou, Veselin Stoyanov, Sophia Ananiadou, Preslav Nakov

Main category: cs.CL

TL;DR: FINCHAIN is a new benchmark for verifiable Chain-of-Thought evaluation in finance, addressing the lack of multi-step symbolic reasoning assessment in existing financial QA datasets.

Details

Motivation: Current financial benchmarks like FinQA and ConvFinQA focus on final numerical answers but neglect intermediate reasoning transparency and verification, creating a gap in assessing multi-step symbolic reasoning essential for robust financial analysis.

Method: Created FINCHAIN benchmark spanning 58 topics across 12 financial domains using parameterized symbolic templates with executable Python traces for machine-verifiable reasoning. Proposed CHAINEVAL metric for dynamic alignment evaluation of both final-answer correctness and step-level reasoning consistency.

Result: Evaluation of 26 leading LLMs shows frontier proprietary models have clear limitations in symbolic financial reasoning, while domain-adapted and math-enhanced fine-tuned models can substantially narrow this gap.

Conclusion: FINCHAIN exposes persistent weaknesses in multi-step financial reasoning and provides a foundation for developing trustworthy, interpretable, and verifiable financial AI systems.

Abstract: Multi-step symbolic reasoning is essential for robust financial analysis; yet, current benchmarks largely overlook this capability. Existing datasets such as FinQA and ConvFinQA emphasize final numerical answers while neglecting the intermediate reasoning required for transparency and verification. To address this gap, we introduce FINCHAIN, the first benchmark specifically designed for verifiable Chain-of-Thought (CoT) evaluation in finance. FINCHAIN spans 58 topics across 12 financial domains, each represented by parameterized symbolic templates with executable Python traces that enable fully machine-verifiable reasoning and scalable, contamination-free data generation. To assess reasoning capacity, we propose CHAINEVAL, a dynamic alignment measure that jointly evaluates both the final-answer correctness and the step-level reasoning consistency. Our evaluation of 26 leading LLMs reveals that even frontier proprietary LLMs exhibit clear limitations in symbolic financial reasoning, while domain-adapted and math-enhanced fine-tuned models can substantially narrow this gap. Overall, FINCHAIN exposes persistent weaknesses in multi-step financial reasoning and provides a foundation for developing trustworthy, interpretable, and verifiable financial AI.

[109] Evaluating Large Language Models for Zero-Shot Disease Labeling in CT Radiology Reports Across Organ Systems

Michael E. Garcia-Alcoser, Mobina GhojoghNejad, Fakrul Islam Tushar, David Kim, Kyle J. Lafata, Geoffrey D. Rubin, Joseph Y. Lo

Main category: cs.CL

TL;DR: Lightweight LLMs outperform rule-based methods for CT report disease annotation, achieving high agreement with clinical judgment across multiple organ systems using zero-shot prompting.

Details

Motivation: To evaluate and compare the effectiveness of different approaches (rule-based algorithms, specialized models, and lightweight LLMs) for automating disease annotation in CT radiology reports, particularly for multi-disease labeling across chest, abdomen, and pelvis CT reports.

Method: Retrospective analysis of 40,833 CAP CT reports from 29,540 patients, with 1,789 reports manually annotated. Compared rule-based algorithm (RBA), RadBERT, and three lightweight open-weight LLMs (including Llama-3.1 8B and Gemma-3 27B) using zero-shot prompting. External validation with CT RATE dataset. Performance evaluated using Cohen’s Kappa and micro/macro-averaged F1 scores.

Result: Llama-3.1 8B and Gemma-3 27B showed highest agreement (κ median: 0.87) in internal test set. Gemma-3 27B achieved top macro-F1 (0.82) on manually annotated set, followed by Llama-3.1 8B (0.79), while RBA scored lowest (0.64). On CT RATE dataset, Llama-3.1 8B performed best (0.91), with Gemma-3 27B close behind (0.89). Performance differences mainly due to labeling practices, especially for subjective labels like atelectasis.

Conclusion: Lightweight LLMs outperform rule-based methods for CT report annotation and generalize well across organ systems with zero-shot prompting. However, binary labels alone cannot capture the full nuance of report language. LLMs provide a flexible, efficient solution aligned with clinical judgment and user needs.

Abstract: Purpose: This study aims to evaluate the effectiveness of large language models (LLMs) in automating disease annotation of CT radiology reports. We compare a rule-based algorithm (RBA), RadBERT, and three lightweight open-weight LLMs for multi-disease labeling of chest, abdomen, and pelvis (CAP) CT reports. Materials and Methods: This retrospective study analyzed 40,833 chest-abdomen-pelvis (CAP) CT reports from 29,540 patients, with 1,789 reports manually annotated across three organ systems. External validation was conducted using the CT RATE dataset. Three open-weight LLMs were tested with zero-shot prompting. Performance was evaluated using Cohen’s Kappa ($κ$) and micro/macro-averaged F1 scores. Results: In the internal test set of 12,197 CAP reports from 8,854 patients, Llama-3.1 8B and Gemma-3 27B showed the highest agreement ($κ$ median: 0.87). On the manually annotated set, Gemma-3 27B achieved the top macro-F1 (0.82), followed by Llama-3.1 8B (0.79), while the RBA scored lowest (0.64). On the CT RATE dataset (lungs/pleura labels only), Llama-3.1 8B performed best (0.91), with Gemma-3 27B close behind (0.89). Performance differences were mainly due to differing labeling practices, especially for labels with high subjectivity such as atelectasis. Conclusion: Lightweight LLMs outperform rule-based methods for CT report annotation and generalize across organ systems with zero-shot prompting. However, binary labels alone cannot capture the full nuance of report language. LLMs can provide a flexible, efficient solution aligned with clinical judgment and user needs.

Arkadiusz Modzelewski, Witold Sosnowski, Tiziano Labruna, Adam Wierzbicki, Giovanni Da San Martino

Main category: cs.CL

TL;DR: PCoT (Persuasion-Augmented Chain of Thought) improves zero-shot disinformation detection by 15% on average across 5 LLMs and 5 datasets, using persuasion knowledge inspired by psychological studies.

Details

Motivation: Psychological studies show knowledge of persuasive fallacies helps humans detect disinformation. The authors want to test if infusing this persuasion knowledge into LLMs can similarly enhance their disinformation detection capabilities.

Method: Developed PCoT (Persuasion-Augmented Chain of Thought), a novel approach that leverages persuasion knowledge to improve zero-shot disinformation detection. Created two new datasets (EUDisinfo and MultiDis) with content published after LLMs’ knowledge cutoffs for proper evaluation.

Result: PCoT outperforms competitive methods by 15% on average across five LLMs and five datasets. The approach demonstrates significant improvement in zero-shot disinformation detection for both online news and social media posts.

Conclusion: Persuasion knowledge significantly strengthens zero-shot disinformation detection in LLMs, validating the psychological insight that understanding persuasive fallacies aids in identifying disinformation.

Abstract: Disinformation detection is a key aspect of media literacy. Psychological studies have shown that knowledge of persuasive fallacies helps individuals detect disinformation. Inspired by these findings, we experimented with large language models (LLMs) to test whether infusing persuasion knowledge enhances disinformation detection. As a result, we introduce the Persuasion-Augmented Chain of Thought (PCoT), a novel approach that leverages persuasion to improve disinformation detection in zero-shot classification. We extensively evaluate PCoT on online news and social media posts. Moreover, we publish two novel, up-to-date disinformation datasets: EUDisinfo and MultiDis. These datasets enable the evaluation of PCoT on content entirely unseen by the LLMs used in our experiments, as the content was published after the models’ knowledge cutoffs. We show that, on average, PCoT outperforms competitive methods by 15% across five LLMs and five datasets. These findings highlight the value of persuasion in strengthening zero-shot disinformation detection.

[111] Revisiting Chain-of-Thought Prompting: Zero-shot Can Be Stronger than Few-shot

Xiang Cheng, Chengyan Pan, Minjun Zhao, Deyang Li, Fangchao Liu, Xinyu Zhang, Xiao Zhang, Yong Liu

Main category: cs.CL

TL;DR: Recent strong LLMs like Qwen2.5 don’t benefit from traditional or enhanced CoT exemplars in math reasoning - they ignore exemplars and focus on instructions, making ICL+CoT ineffective for these models.

Details

Motivation: To investigate whether Chain-of-Thought exemplars still benefit recent, stronger LLMs in mathematical reasoning tasks, given continuous model advancement and unclear effectiveness of traditional ICL+CoT approaches.

Method: Systematic experiments with recent strong models (Qwen2.5 series) comparing traditional CoT exemplars vs Zero-Shot CoT, plus investigation of enhanced CoT exemplars constructed using answers from advanced models (Qwen2.5-Max and DeepSeek-R1).

Result: Traditional CoT exemplars don’t improve reasoning performance compared to Zero-Shot CoT for recent strong models; their main function is output format alignment. Enhanced CoT exemplars also fail to improve reasoning performance. Models ignore exemplars and focus primarily on instructions.

Conclusion: Current ICL+CoT framework has limitations in mathematical reasoning for advanced models, calling for re-examination of ICL paradigm and exemplar definition as models evolve beyond traditional prompting techniques.

Abstract: In-Context Learning (ICL) is an essential emergent ability of Large Language Models (LLMs), and recent studies introduce Chain-of-Thought (CoT) to exemplars of ICL to enhance the reasoning capability, especially in mathematics tasks. However, given the continuous advancement of model capabilities, it remains unclear whether CoT exemplars still benefit recent, stronger models in such tasks. Through systematic experiments, we find that for recent strong models such as the Qwen2.5 series, adding traditional CoT exemplars does not improve reasoning performance compared to Zero-Shot CoT. Instead, their primary function is to align the output format with human expectations. We further investigate the effectiveness of enhanced CoT exemplars, constructed using answers from advanced models such as \texttt{Qwen2.5-Max} and \texttt{DeepSeek-R1}. Experimental results indicate that these enhanced exemplars still fail to improve the model’s reasoning performance. Further analysis reveals that models tend to ignore the exemplars and focus primarily on the instructions, leading to no observable gain in reasoning ability. Overall, our findings highlight the limitations of the current ICL+CoT framework in mathematical reasoning, calling for a re-examination of the ICL paradigm and the definition of exemplars.

[112] Instruction Tuning with and without Context: Behavioral Shifts and Downstream Impact

Hyunji Lee, Seunghyun Yoon, Yunjae Won, Hanseok Oh, Geewook Kim, Trung Bui, Franck Dernoncourt, Elias Stengel-Eskin, Mohit Bansal, Minjoon Seo

Main category: cs.CL

TL;DR: Instruction tuning with context-augmented vs. context-free data has distinct effects: context-augmented training improves grounding and reduces reliance on parametric knowledge, while using separate specialized models yields better performance than mixed training.

Details

Motivation: Prior work has combined context-augmented and context-free examples in instruction tuning datasets without examining their distinct effects on model behavior and performance.

Method: Investigate how training LLMs with or without context affects model behavior by: 1) analyzing text domain performance and knowledge usage patterns, 2) testing context-augmented LLMs as backbones for vision-language models, and 3) exploring deployment strategies with separate context-augmented and context-free models.

Result: 1) Context-augmented training improves grounding and shifts knowledge usage from parametric to contextual knowledge. 2) Using context-augmented LLMs as backbones reduces hallucination and improves visual grounding. 3) Maintaining separate specialized models with input routing yields more robust performance than mixed training.

Conclusion: Context-augmented and context-free training have complementary strengths, and deploying separate specialized models with intelligent routing provides better overall performance than training a single mixed model, especially in real-world scenarios with varying context availability.

Abstract: Instruction tuning is a widely used approach to improve the instruction-following ability of large language models (LLMs). Instruction-tuning datasets typically include a mixture of context-augmented and context-free examples, yet prior work has largely combined these data types without examining their distinct effects. In this paper, we investigate how training LLMs with or without context affects model behavior and downstream performance. First, in the text domain, we show that LLMs trained with context attend more strongly to the provided knowledge, achieving better grounding. We also observe that context-augmented training shifts how LLMs use knowledge: models store and leverage less on parametric knowledge and instead depend more on the provided context. Second, we observe that using LLM trained with context-augmented data as the backbone for vision-language models reduces hallucination and improves grounding in the visual domain. Finally, we explore practical strategies for real-world deployments where context availability varies. We show that maintaining separate context-augmented and context-free models and routing inputs between them yields more robust overall performance than training a single mixed model, as it better preserves their complementary strengths.

[113] Reverse Language Model

Xunjian Yin, Sitao Cheng, Yuxi Xie, Xinyu Hu, Li Lin, Xinyi Wang, Liangming Pan, William Yang Wang, Xiaojun Wan

Main category: cs.CL

TL;DR: LEDOM is the first purely reverse language model trained autoregressively on 435B tokens, processing sequences in reverse temporal order. It demonstrates unique backward reasoning capabilities and enables novel applications like Reverse Reward for improving mathematical reasoning tasks.

Details

Motivation: To explore reverse language modeling as a potential foundational approach, investigating whether processing sequences in reverse temporal order can provide unique capabilities and insights compared to traditional forward language models.

Method: Trained autoregressively on 435B tokens with 2B and 7B parameter variants using previous token prediction (reverse temporal order processing). Introduced Reverse Reward application where LEDOM-guided reranking of forward language model outputs improves mathematical reasoning.

Result: LEDOM exhibits unique characteristics and backward reasoning capabilities. The Reverse Reward approach leads to substantial performance improvements on mathematical reasoning tasks by leveraging LEDOM’s posterior evaluation to refine forward model outputs.

Conclusion: Reverse language models like LEDOM show broad application potential as foundational models. The Reverse Reward application demonstrates practical utility, and the authors will release all models, code, and data to support future research in this direction.

Abstract: We introduce LEDOM, the first purely reverse language model, trained autoregressively on 435B tokens with 2B and 7B parameter variants, which processes sequences in reverse temporal order through previous token prediction. For the first time, we present the reverse language model as a potential foundational model across general tasks, accompanied by a set of intriguing examples and insights. Based on LEDOM, we further introduce a novel application: Reverse Reward, where LEDOM-guided reranking of forward language model outputs leads to substantial performance improvements on mathematical reasoning tasks. This approach leverages LEDOM’s unique backward reasoning capability to refine generation quality through posterior evaluation. Our findings suggest that LEDOM exhibits unique characteristics with broad application potential. We will release all models, training code, and pre-training data to facilitate future research.

[114] On the robustness of modeling grounded word learning through a child’s egocentric input

Wai Keen Vong, Brenden M. Lake

Main category: cs.CL

TL;DR: Multimodal neural networks trained on child-like visual and linguistic input (61+ hours per child) can learn word-referent mappings, demonstrating robustness across children and contexts while revealing individual learning differences.

Details

Motivation: To bridge the gap between machine learning models (trained on massive datasets) and human children (who learn language from limited input), and to test whether previous single-child findings generalize across multiple children's developmental experiences.

Method: Applied automated speech transcription to the entire SAYCam dataset (500+ hours of video from 3 children), created multimodal vision-language datasets, and trained various neural network configurations to simulate word learning from child-like input.

Result: Networks trained on automatically transcribed data from each child successfully acquired word-referent mappings, generalizing across videos, children, and image domains, while showing individual differences in learning patterns.

Conclusion: Multimodal neural networks demonstrate robust word learning capabilities from child-like input, validating this approach for studying language acquisition while highlighting the importance of individual developmental experiences.

Abstract: What insights can machine learning bring to understanding human language acquisition? Large language and multimodal models have achieved remarkable capabilities, but their reliance on massive training datasets creates a fundamental mismatch with children, who succeed in acquiring language from comparatively limited input. To help bridge this gap, researchers have increasingly trained neural networks using data similar in quantity and quality to children’s input. Taking this approach to the limit, Vong et al. (2024) showed that a multimodal neural network trained on 61 hours of visual and linguistic input extracted from just one child’s developmental experience could acquire word-referent mappings. However, whether this approach’s success reflects the idiosyncrasies of a single child’s experience, or whether it would show consistent and robust learning patterns across multiple children’s experiences was not explored. In this article, we applied automated speech transcription methods to the entirety of the SAYCam dataset, consisting of over 500 hours of video data spread across all three children. Using these automated transcriptions, we generated multi-modal vision-and-language datasets for both training and evaluation, and explored a range of neural network configurations to examine the robustness of simulated word learning. Our findings demonstrate that networks trained on automatically transcribed data from each child can acquire word-referent mappings, generalizing across videos, children, and image domains. These results validate the robustness of multimodal neural networks for grounded word learning, while highlighting the individual differences that emerge in how models learn when trained on each child’s developmental experiences.

[115] TeSent: A Benchmark Dataset for Fairness-aware Explainable Sentiment Classification in Telugu

Vallabhaneni Raj Kumar, Ashwin S, Supriya Manna, Niladri Sett, Cheedella V S N M S Hema Harshitha, Kurakula Harshitha, Anand Kumar Sharma, Basina Deepakraj, Tanuj Sarkar, Bondada Navaneeth Krishna, Samanthapudi Shakeer

Main category: cs.CL

TL;DR: TeSent is a comprehensive Telugu sentiment classification benchmark with 21,119 annotated sentences, including human rationales and fairness evaluation tools.

Details

Motivation: Telugu, a major Dravidian language with 96 million speakers, lacks high-quality annotated resources for NLP/ML tasks, particularly for sentiment classification with modern requirements like explainability and fairness.

Method: Scraped Telugu texts from social media, news, and blogs; built custom annotation platform; collected 21,119 sentences with ground truth labels and human rationales; fine-tuned SOTA models with/without rationales; developed evaluation suites for explainability (plausibility/faithfulness) and fairness (TeEEC corpus).

Result: Training with human rationales improves model accuracy and alignment with human reasoning, but does not necessarily reduce bias in sentiment classification models.

Conclusion: TeSent provides a comprehensive benchmark for Telugu sentiment classification that addresses modern ML requirements (explainability, fairness) and shows that rationale-based training improves performance but not bias reduction.

Abstract: In the Indian subcontinent, Telugu, one of India’s six classical languages, is the most widely spoken Dravidian Language. Despite its 96 million speaker base worldwide, Telugu remains underrepresented in the global NLP and Machine Learning landscape, mainly due to lack of high-quality annotated resources. This work introduces TeSent, a comprehensive benchmark dataset for sentiment classification, a key text classification problem, in Telugu. TeSent not only provides ground truth labels for the sentences, but also supplements with provisions for evaluating explainability and fairness, two critical requirements in modern-day machine learning tasks. We scraped Telugu texts covering multiple domains from various social media platforms, news websites and web-blogs to preprocess and generate 21,119 sentences, and developed a custom-built annotation platform and a carefully crafted annotation protocol for collecting the ground truth labels along with their human-annotated rationales. We then fine-tuned several SOTA pre-trained models in two ways: with rationales, and without rationales. Further, we provide a detailed plausibility and faithfulness evaluation suite, which exploits the rationales, for six widely used post-hoc explainers applied on the trained models. Lastly, we curate TeEEC, Equity Evaluation Corpus in Telugu, a corpus to evaluate fairness of Telugu sentiment and emotion related NLP tasks, and provide a fairness evaluation suite for the trained classifier models. Our experimental results suggest that training with human rationales improves model accuracy and models’ alignment with human reasoning, but does not necessarily reduce bias.

[116] Towards Trustworthy Multimodal Moderation via Policy-Aligned Reasoning and Hierarchical Labeling

Anqi Li, Wenwei Jin, Jintao Tong, Pengda Qin, Weijia Li, Guo Lu

Main category: cs.CL

TL;DR: Hi-Guard is a multimodal content moderation framework that uses hierarchical classification and policy-aligned decision making to improve accuracy and interpretability.

Details

Motivation: Current moderation systems rely on noisy label-driven learning, lack alignment with moderation rules, and produce opaque decisions that hinder human review, making them inadequate for ensuring safety and compliance at scale.

Method: Uses a two-stage hierarchical pipeline: (1) lightweight binary model filters safe content, then (2) stronger model performs path-based classification over hierarchical taxonomy. Incorporates rule definitions into model prompts and optimizes with Group Relative Policy Optimization (GRPO) with multi-level soft-margin reward.

Result: Extensive experiments and real-world deployment demonstrate superior classification accuracy, generalization, and interpretability compared to existing approaches.

Conclusion: Hi-Guard paves the way toward scalable, transparent, and trustworthy content safety systems by aligning with moderation policies and providing interpretable decisions.

Abstract: Social platforms have revolutionized information sharing, but also accelerated the dissemination of harmful and policy-violating content. To ensure safety and compliance at scale, moderation systems must go beyond efficiency and offer accuracy and interpretability. However, current approaches largely rely on noisy, label-driven learning, lacking alignment with moderation rules and producing opaque decisions that hinder human review. Therefore, we propose Hierarchical Guard (Hi-Guard), a multimodal moderation framework that introduces a new policy-aligned decision paradigm. The term “Hierarchical” reflects two key aspects of our system design: (1) a hierarchical moderation pipeline, where a lightweight binary model first filters safe content and a stronger model handles fine-grained risk classification; and (2) a hierarchical taxonomy in the second stage, where the model performs path-based classification over a hierarchical taxonomy ranging from coarse to fine-grained levels. To ensure alignment with evolving moderation policies, Hi-Guard directly incorporates rule definitions into the model prompt. To further enhance structured prediction and reasoning, we introduce a multi-level soft-margin reward and optimize with Group Relative Policy Optimization (GRPO), penalizing semantically adjacent misclassifications and improving explanation quality. Extensive experiments and real-world deployment demonstrate that Hi-Guard achieves superior classification accuracy, generalization, and interpretability, paving the way toward scalable, transparent, and trustworthy content safety systems. Code is available at: https://github.com/lianqi1008/Hi-Guard.

[117] Latent Fusion Jailbreak: Blending Harmful and Harmless Representations to Elicit Unsafe LLM Outputs

Wenpeng Xing, Mohan Li, Chunqiang Hu, Haitao Xu, Ningyu Zhang, Bo Lin, Meng Han

Main category: cs.CL

TL;DR: LFJ is a stealthy white-box jailbreak attack that fuses harmful and benign query hidden states in latent space, achieving high attack success rates while avoiding detectable artifacts.

Details

Motivation: Existing jailbreak attacks (like GCG) suffer from high computational costs and generate high-perplexity prompts that are easily blocked by filters. There's a need for more stealthy and efficient attacks that operate in continuous latent space rather than discrete input optimization.

Method: LFJ constructs adversarial representations by mathematically fusing hidden states of harmful queries with thematically similar benign queries in the continuous latent space. It uses gradient-guided optimization to balance attack success and computational efficiency.

Result: LFJ achieves average Attack Success Rate (ASR) of 94.01% across multiple models (Vicuna-7B, LLaMA-2-7B-Chat, Guanaco-7B, LLaMA-3-70B, Mistral-7B-Instruct), significantly outperforming baselines like GCG and AutoDAN while avoiding detectable input artifacts.

Conclusion: Thematic similarity in latent space is a critical vulnerability in current safety alignments. The paper proposes latent adversarial training defense that reduces LFJ’s ASR by over 80% without compromising model utility.

Abstract: While Large Language Models (LLMs) have achieved remarkable progress, they remain vulnerable to jailbreak attacks. Existing methods, primarily relying on discrete input optimization (e.g., GCG), often suffer from high computational costs and generate high-perplexity prompts that are easily blocked by simple filters. To overcome these limitations, we propose Latent Fusion Jailbreak (LFJ), a stealthy white-box attack that operates in the continuous latent space. Unlike previous approaches, LFJ constructs adversarial representations by mathematically fusing the hidden states of a harmful query with a thematically similar benign query, effectively masking malicious intent while retaining semantic drive. We further introduce a gradient-guided optimization strategy to balance attack success and computational efficiency. Extensive evaluations on Vicuna-7B, LLaMA-2-7B-Chat, Guanaco-7B, LLaMA-3-70B, and Mistral-7B-Instruct show that LFJ achieves an average Attack Success Rate (ASR) of 94.01%, significantly outperforming state-of-the-art baselines like GCG and AutoDAN while avoiding detectable input artifacts. Furthermore, we identify that thematic similarity in the latent space is a critical vulnerability in current safety alignments. Finally, we propose a latent adversarial training defense that reduces LFJ’s ASR by over 80% without compromising model utility.

[118] Efficient Switchable Safety Control in LLMs via Magic-Token-Guided Co-Training

Jianfeng Si, Lin Sun, Zhewen Tan, Xiangzheng Zhang

Main category: cs.CL

TL;DR: A unified co-training framework enables dynamic safety behavior switching in LLMs via system instructions, achieving superior safety performance with reduced complexity.

Details

Motivation: Current LLM safety methods (SFT, RLHF) require multi-stage training pipelines and lack fine-grained, post-deployment controllability, limiting their flexibility and efficiency.

Method: A unified co-training framework integrates three safety behaviors (positive, negative, rejective) in single SFT stage, activated via system-level instructions/magic tokens for dynamic switching at inference.

Result: The method matches SFT+DPO safety quality, with 8B model surpassing DeepSeek-R1 (671B) in safety performance while reducing training complexity and deployment costs.

Conclusion: This work presents a scalable, efficient, and highly controllable solution for LLM content safety with unprecedented fine-grained control through Safety Alignment Margin.

Abstract: Current methods for content safety in Large Language Models (LLMs), such as Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF), often rely on multi-stage training pipelines and lack fine-grained, post-deployment controllability. To address these limitations, we propose a unified co-training framework that efficiently integrates multiple safety behaviors: positive (lawful/prosocial), negative (unfiltered/risk-prone) and rejective (refusal-oriented/conservative) within a single SFT stage. Notably, each behavior is dynamically activated via a simple system-level instruction, or magic token, enabling stealthy and efficient behavioral switching at inference time. This flexibility supports diverse deployment scenarios, such as positive for safe user interaction, negative for internal red-teaming, and rejective for context-aware refusals triggered by upstream moderation signals. This co-training strategy induces a distinct Safety Alignment Margin in the output space, characterized by well-separated response distributions corresponding to each safety mode. The existence of this margin provides empirical evidence for the model’s safety robustness and enables unprecedented fine-grained control. Experiments show that our method matches the safety alignment quality of SFT+DPO, with our 8B model notably surpassing DeepSeek-R1 (671B) in safety performance, while significantly reducing both training complexity and deployment costs. This work presents a scalable, efficient, and highly controllable solution for LLM content safety.

[119] SurGE: A Benchmark and Evaluation Framework for Scientific Survey Generation

Weihang Su, Anzhe Xie, Qingyao Ai, Jianming Long, Jiaxin Mao, Ziyi Ye, Yiqun Liu

Main category: cs.CL

TL;DR: SurGE is a new benchmark for evaluating automated scientific survey generation, addressing the lack of standardized evaluation in this area.

Details

Motivation: Manual creation of scientific surveys is becoming infeasible due to rapid growth of academic literature, and while LLMs show promise, progress is hindered by absence of standardized benchmarks and evaluation protocols.

Method: Introduces SurGE benchmark with (1) test instances (topic descriptions, expert-written surveys, and cited references) and (2) large-scale academic corpus of 1M+ papers, plus an automated evaluation framework measuring four quality dimensions.

Result: Evaluation of diverse LLM-based methods shows significant performance gap, revealing even advanced agentic frameworks struggle with survey generation complexities.

Conclusion: SurGE bridges critical gap in survey generation evaluation, highlighting need for future research and providing open-source code, data, and models.

Abstract: The rapid growth of academic literature makes the manual creation of scientific surveys increasingly infeasible. While large language models show promise for automating this process, progress in this area is hindered by the absence of standardized benchmarks and evaluation protocols. To bridge this critical gap, we introduce SurGE (Survey Generation Evaluation), a new benchmark for scientific survey generation in computer science. SurGE consists of (1) a collection of test instances, each including a topic description, an expert-written survey, and its full set of cited references, and (2) a large-scale academic corpus of over one million papers. In addition, we propose an automated evaluation framework that measures the quality of generated surveys across four dimensions: comprehensiveness, citation accuracy, structural organization, and content quality. Our evaluation of diverse LLM-based methods demonstrates a significant performance gap, revealing that even advanced agentic frameworks struggle with the complexities of survey generation and highlighting the need for future research in this area. We have open-sourced all the code, data, and models at: https://github.com/oneal2000/SurGE

[120] Beyond Monolingual Assumptions: A Survey of Code-Switched NLP in the Era of Large Language Models across Modalities

Rajvee Sheth, Samridhi Raj Sinha, Mahavir Patil, Himanshu Beniwal, Mayank Singh

Main category: cs.CL

TL;DR: This survey paper provides the first comprehensive analysis of code-switching (CSW) in large language models, reviewing 324 studies across multiple research areas, tasks, datasets, and languages to address LLMs’ limitations with mixed-language inputs.

Details

Motivation: Most LLMs struggle with mixed-language inputs due to limited code-switching datasets and evaluation biases, hindering their deployment in multilingual societies where code-switching is common.

Method: Comprehensive survey methodology reviewing 324 studies across five research areas, 15+ NLP tasks, 30+ datasets, and 80+ languages, categorizing advances by architecture, training strategy, and evaluation methodology.

Result: The survey outlines how LLMs have reshaped CSW modeling and identifies persistent challenges, providing a structured analysis of the current state of code-switching research in LLMs.

Conclusion: The paper concludes with a roadmap emphasizing the need for inclusive datasets, fair evaluation, and linguistically grounded models to achieve truly multilingual capabilities in LLMs.

Abstract: Amidst the rapid advances of large language models (LLMs), most LLMs still struggle with mixed-language inputs, limited Codeswitching (CSW) datasets, and evaluation biases, which hinder their deployment in multilingual societies. This survey provides the first comprehensive analysis of CSW-aware LLM research, reviewing 324 studies spanning five research areas, 15+ NLP tasks, 30+ datasets, and 80+ languages. We categorize recent advances by architecture, training strategy, and evaluation methodology, outlining how LLMs have reshaped CSW modeling and identifying the challenges that persist. The paper concludes with a roadmap that emphasizes the need for inclusive datasets, fair evaluation, and linguistically grounded models to achieve truly multilingual capabilities https://github.com/lingo-iitgn/awesome-code-mixing/.

[121] Think Natively: Unlocking Multilingual Reasoning with Consistency-Enhanced Reinforcement Learning

Xue Zhang, Yunlong Liang, Fandong Meng, Songming Zhang, Kaiyu Huang, Yufeng Chen, Jinan Xu, Jie Zhou

Main category: cs.CL

TL;DR: M-Thinker addresses language inconsistency and poor reasoning in non-English LRMs using GRPO with language consistency and cross-lingual thinking alignment rewards.

Details

Motivation: Current Large Reasoning Models struggle with language consistency and perform poorly on non-English reasoning tasks compared to English, compromising interpretability and hindering global deployment.

Method: Proposes M-Thinker trained with GRPO algorithm featuring Language Consistency reward for input-thought-answer consistency and Cross-lingual Thinking Alignment reward to transfer reasoning capabilities from English to other languages.

Result: M-Thinker models achieve nearly 100% language consistency, superior performance on multilingual benchmarks (MMATH and PolyMath), and excellent generalization to out-of-domain languages.

Conclusion: M-Thinker effectively addresses non-English reasoning limitations in LRMs through language consistency enforcement and cross-lingual reasoning alignment, enabling better global deployment.

Abstract: Large Reasoning Models (LRMs) have achieved remarkable performance on complex reasoning tasks by adopting the ``think-then-answer’’ paradigm, which enhances both accuracy and interpretability. However, current LRMs exhibit two critical limitations when processing non-English languages: (1) They often struggle to maintain input-output language consistency; (2) They generally perform poorly with wrong reasoning paths and lower answer accuracy compared to English. These limitations significantly compromise the interpretability of reasoning processes and degrade the user experience for non-English speakers, hindering the global deployment of LRMs. To address these limitations, we propose M-Thinker, which is trained by the GRPO algorithm that involves a Language Consistency (LC) reward and a novel Cross-lingual Thinking Alignment (CTA) reward. Specifically, the LC reward defines a strict constraint on the language consistency between the input, thought, and answer. Besides, the CTA reward compares the model’s non-English reasoning paths with its English reasoning path to transfer its own reasoning capability from English to non-English languages. Through an iterative RL procedure, our M-Thinker-1.5B/4B/7B models not only achieve nearly 100% language consistency and superior performance on two multilingual benchmarks (MMATH and PolyMath), but also exhibit excellent generalization on out-of-domain languages.

[122] Hallucination Detection via Internal States and Structured Reasoning Consistency in Large Language Models

Yusheng Song, Lirong Qiu, Xi Zhang, Zhihao Tang

Main category: cs.CL

TL;DR: A unified framework that bridges the gap between internal state probing and chain-of-thought verification for detecting sophisticated hallucinations in LLMs, overcoming signal scarcity and representational alignment barriers.

Details

Motivation: Current hallucination detection methods suffer from a "Detection Dilemma": internal state probing works well for factual inconsistencies but fails on logical fallacies, while chain-of-thought verification shows the opposite behavior, creating task-dependent blind spots.

Method: Introduces a multi-path reasoning mechanism to obtain comparable fine-grained signals, and a segment-aware temporalized cross-attention module to adaptively fuse aligned representations, pinpointing subtle dissonances between reasoning and internal states.

Result: Extensive experiments on three diverse benchmarks and two leading LLMs demonstrate that the framework consistently and significantly outperforms strong baselines in hallucination detection.

Conclusion: The proposed unified framework successfully bridges the critical gap between internal state probing and chain-of-thought verification, overcoming fundamental challenges of signal scarcity and representational alignment for comprehensive hallucination detection.

Abstract: The detection of sophisticated hallucinations in Large Language Models (LLMs) is hampered by a ``Detection Dilemma’’: methods probing internal states (Internal State Probing) excel at identifying factual inconsistencies but fail on logical fallacies, while those verifying externalized reasoning (Chain-of-Thought Verification) show the opposite behavior. This schism creates a task-dependent blind spot: Chain-of-Thought Verification fails on fact-intensive tasks like open-domain QA where reasoning is ungrounded, while Internal State Probing is ineffective on logic-intensive tasks like mathematical reasoning where models are confidently wrong. We resolve this with a unified framework that bridges this critical gap. However, unification is hindered by two fundamental challenges: the Signal Scarcity Barrier, as coarse symbolic reasoning chains lack signals directly comparable to fine-grained internal states, and the Representational Alignment Barrier, a deep-seated mismatch between their underlying semantic spaces. To overcome these, we introduce a multi-path reasoning mechanism to obtain more comparable, fine-grained signals, and a segment-aware temporalized cross-attention module to adaptively fuse these now-aligned representations, pinpointing subtle dissonances. Extensive experiments on three diverse benchmarks and two leading LLMs demonstrate that our framework consistently and significantly outperforms strong baselines. Our code is available: https://github.com/peach918/HalluDet.

[123] Proverbs or Pythian Oracles? Sentiments and Emotions in Greek Sayings

Katerina Korre, John Pavlopoulos

Main category: cs.CL

TL;DR: This paper analyzes Greek proverbs using NLP to study their sentiment and emotion, creating a multi-label annotation framework, scaling to local varieties, and mapping emotional distributions across Greece.

Details

Motivation: Proverbs represent fascinating cultural wisdom that transcends boundaries, but much global proverb knowledge remains underexplored due to oral traditions. The paper aims to leverage NLP advances to analyze Greek proverbs, which preserve traditional wisdom within their communities.

Method: 1) Developed a multi-label annotation framework and dataset capturing emotional variability of Greek proverbs; 2) Scaled analysis to local varieties; 3) Created a map of Greece showing emotional distribution; 4) Used LLMs to capture and reproduce the complexity of proverb interpretation.

Result: Findings show proverb interpretation is multidimensional, manifested through both multi-labeling and instance-level polarity. LLMs can effectively capture this complexity. In Greece, surprise and anger compete and coexist within proverbs, revealing interesting emotional patterns across regions.

Conclusion: LLMs can help better understand the proverbial landscape of a place, as demonstrated with Greece. The multi-dimensional nature of proverb interpretation can be effectively analyzed using NLP techniques, providing insights into cultural emotional patterns preserved in traditional wisdom.

Abstract: Proverbs are among the most fascinating language phenomena that transcend cultural and linguistic boundaries. Yet, much of the global landscape of proverbs remains underexplored, as many cultures preserve their traditional wisdom within their own communities due to the oral tradition of the phenomenon. Taking advantage of the current advances in Natural Language Processing (NLP), we focus on Greek proverbs, analyzing their sentiment and emotion. Departing from an annotated dataset of Greek proverbs, (1) we propose a multi-label annotation framework and dataset that captures the emotional variability of the proverbs, (2) we up-scale to local varieties, (3) we sketch a map of Greece that provides an overview of the distribution of emotions. Our findings show that the interpretation of proverbs is multidimensional, a property manifested through both multi-labeling and instance-level polarity. LLMs can capture and reproduce this complexity, and can therefore help us better understand the proverbial landscape of a place, as in the case of Greece, where surprise and anger compete and coexist within proverbs.

[124] Qomhra: A Bilingual Irish and English Large Language Model

Joseph McInerney, Khanh-Tung Tran, Liam Lonergan, Ailbhe Ní Chasaide, Neasa Ní Chiaráin, Barry Devereux

Main category: cs.CL

TL;DR: Qomhrá is a bilingual Irish-English LLM developed under low-resource constraints, featuring novel methods for synthesizing human preference data and achieving significant performance gains over existing Irish LLM baselines.

Details

Motivation: LLM research has focused on major languages, leaving low-resource languages like Irish underrepresented. There's a need for scalable methods to create human preference data and develop effective LLMs for Irish language communities.

Method: Complete pipeline with bilingual continued pre-training, instruction tuning, and novel synthesis of human preference data using LLM-generated “accepted” and “rejected” responses validated by L1 Irish speakers. Used Gemini-2.5-Pro for translation and preference data synthesis after evaluating top closed-weight LLMs.

Result: Qomhrá shows gains of up to 29% in Irish and 44% in English compared to existing Irish LLM baseline UCCIX. Gemini-2.5-Pro was ranked highest for Irish generation by speakers, diverging from LLM-as-a-judge ratings, revealing misalignment between current LLMs and Irish-language community.

Conclusion: The framework provides insights and guidance for developing LLMs for Irish and other low-resource languages, demonstrating effective methods for overcoming data scarcity and creating culturally aligned language models.

Abstract: Large language model (LLM) research and development has overwhelmingly focused on the world’s major languages, leading to under-representation of low-resource languages such as Irish. This paper introduces \textbf{Qomhrá}, a bilingual Irish and English LLM, developed under extremely low-resource constraints. A complete pipeline is outlined spanning bilingual continued pre-training, instruction tuning, and the synthesis of human preference data for future alignment training. We focus on the lack of scalable methods to create human preference data by proposing a novel method to synthesise such data by prompting an LLM to generate accepted'' and rejected’’ responses, which we validate as aligning with L1 Irish speakers. To select an LLM for synthesis, we evaluate the top closed-weight LLMs for Irish language generation performance. Gemini-2.5-Pro is ranked highest by L1 and L2 Irish-speakers, diverging from LLM-as-a-judge ratings, indicating a misalignment between current LLMs and the Irish-language community. Subsequently, we leverage Gemini-2.5-Pro to translate a large scale English-language instruction tuning dataset to Irish and to synthesise a first-of-its-kind Irish-language human preference dataset. We comprehensively evaluate Qomhrá across several benchmarks, testing translation, gender understanding, topic identification, and world knowledge; these evaluations show gains of up to 29% in Irish and 44% in English compared to the existing open-source Irish LLM baseline, UCCIX. The results of our framework provide insight and guidance to developing LLMs for both Irish and other low-resource languages.

[125] AgenticMath: Enhancing LLM Reasoning via Agentic-based Math Data Generation

Xianyang Liu, Yilin Liu, Shuai Wang, Hao Cheng, Andrew Estornell, Yuzhi Zhao, Jun Shu, Jiaheng Wei

Main category: cs.CL

TL;DR: AgenticMath is a novel agentic method for generating high-quality mathematical QA pairs to enhance LLM fine-tuning, achieving competitive performance with much smaller datasets than baselines.

Details

Motivation: Current methods for creating reasoning datasets suffer from generating low-quality/incorrect answers and limited information richness from available data sources, creating a need for better dataset generation approaches.

Method: Four-stage agentic method: (1) Seed Question Filter selects high-quality questions, (2) Agentic Question Rephrase uses multi-agent system for diverse paraphrases, (3) Answer Augment rewrites answers using chain-of-thought reasoning, (4) Question and Answer Evaluation retains only superior pairs.

Result: Fine-tuning 3B-8B parameter LLMs on AgenticMath datasets (only 30-60K samples) achieves competitive/superior performance on mathematical reasoning benchmarks compared to baselines trained on much larger datasets (400K-2.3M samples).

Conclusion: Targeted, high-quality data generation is more efficient for improving mathematical reasoning in LLMs than large-scale, low-quality alternatives, demonstrating the value of quality over quantity in dataset creation.

Abstract: The creation of high-quality datasets to improve Large Language Model (LLM) reasoning remains a significant challenge, as current methods often suffer from generating low-quality/incorrect answers and limited information richness from available data sources. To address this, we propose AgenticMath, a novel agentic method for generating high-quality mathematical question-answer pairs to enhance the supervised fine-tuning of LLMs. Our method operates through four stages: (1) Seed Question Filter that selects questions with high information richness, complexity, and clarity; (2) an Agentic Question Rephrase step that employs a multi-agent system to generate diverse, logically consistent paraphrases; (3) an Answer Augment step where rewrite answers using chain-of-thought reasoning to enhance numerical and logical correctness, without reliance on human-provided labels; and (4) a final Question and Answer Evaluation that retains only the most superior pairs. Extensive experiments demonstrate that, fine-tuning 3B-8B parameter LLMs on AgenticMath generated datasets (comprising only 30-60K math samples) achieves competitive or superior performance on diverse in domain and out-of-domain mathematical reasoning benchmarks compared to baselines trained on much more data (e.g., 400K or 2.3M samples). Our work demonstrates that targeted, high-quality data generation is a more efficient path to improving mathematical reasoning in LLMs than large-scale, low-quality alternatives.

[126] Can Confidence Estimates Decide When Chain-of-Thought Is Necessary for LLMs?

Samuel Lewis-Lim, Xingwei Tan, Zhixue Zhao, Nikolaos Aletras

Main category: cs.CL

TL;DR: Confidence-gated Chain-of-Thought prompting uses confidence estimates to decide when extended reasoning is needed, reducing unnecessary token usage while maintaining performance.

Details

Motivation: Chain-of-Thought prompting improves reasoning but increases token usage unnecessarily when reasoning isn't actually needed. The paper aims to optimize compute allocation by only using CoT when necessary.

Method: Proposes confidence-gated CoT where models produce direct answers with confidence estimates to decide whether to invoke CoT. Evaluates four representative confidence measures compared to random gating and oracle upper bound across two model families and diverse reasoning tasks.

Result: Existing training-free confidence measures can reduce redundant reasoning, but individual confidence measures show inconsistent utility across different settings.

Conclusion: The study provides practical guidance for developing and evaluating models that selectively use CoT through an evaluation framework and analysis of confidence signals.

Abstract: Chain-of-thought (CoT) prompting is a common technique for improving the reasoning abilities of large language models (LLMs). However, extended reasoning is often unnecessary and substantially increases token usage. As such, a key question becomes how to optimally allocate compute to when reasoning is actually needed. We study this through confidence-gated CoT, where a model produces a direct answer and a confidence estimate to decide whether to invoke CoT. We present an evaluation framework together with the first systematic study of confidence signals for this decision. We evaluate four representative confidence measures and compare them with random gating and an oracle upper bound. Experiments across two model families and diverse reasoning tasks show that existing training-free confidence measures can reduce redundant reasoning. However, we also find that the utility of individual confidence measures is inconsistent across settings. Through our evaluation framework and analysis, our study provides practical guidance toward developing and evaluating models that selectively use CoT.

[127] IF-CRITIC: Towards a Fine-Grained LLM Critic for Instruction-Following Evaluation

Bosi Wen, Yilin Niu, Cunxiang Wang, Pei Ke, Xiaoying Ling, Ying Zhang, Aohan Zeng, Hongning Wang, Minlie Huang

Main category: cs.CL

TL;DR: IF-CRITIC is an LLM critic for fine-grained, efficient, and reliable instruction-following evaluation that uses constraint checklists and multi-stage filtering to outperform existing LLM-as-a-Judge methods.

Details

Motivation: Existing evaluation models for instruction-following have deficiencies including substantial costs and unreliable assessments, despite numerous attempts to enhance LLMs' instruction-following ability through preference optimization or reinforcement learning.

Method: Develop a checklist generator to decompose instructions and generate constraint checklists, collect high-quality critique training data through multi-stage critique filtering, and employ constraint-level preference optimization to train IF-CRITIC.

Result: IF-CRITIC beats strong LLM-as-a-Judge baselines (including o4-mini and Gemini-3-Pro) in evaluation performance, and enables LLMs to achieve substantial performance gains in instruction-following optimization with lower computational overhead.

Conclusion: IF-CRITIC provides a fine-grained, efficient, and reliable solution for instruction-following evaluation that outperforms existing methods and enables more effective optimization of LLMs’ instruction-following capabilities.

Abstract: Instruction-following is a fundamental ability of Large Language Models (LLMs), requiring their generated outputs to follow multiple constraints imposed in input instructions. Numerous studies have attempted to enhance this ability through preference optimization or reinforcement learning based on reward signals from LLM-as-a-Judge. However, existing evaluation models for instruction-following still possess many deficiencies, such as substantial costs and unreliable assessments. To this end, we propose IF-CRITIC, an LLM critic for fine-grained, efficient, and reliable instruction-following evaluation. We first develop a checklist generator to decompose instructions and generate constraint checklists. With the assistance of the checklists, we collect high-quality critique training data through a multi-stage critique filtering mechanism and employ a constraint-level preference optimization method to train IF-CRITIC. Extensive experiments show that the evaluation performance of IF-CRITIC can beat strong LLM-as-a-Judge baselines, including o4-mini and Gemini-3-Pro. With the reward signals provided by IF-CRITIC, LLMs can achieve substantial performance gains in instruction-following optimization under lower computational overhead compared to strong LLM critic baselines.

[128] One Battle After Another: Probing LLMs’ Limits on Multi-Turn Instruction Following with a Benchmark Evolving Framework

Qi Jia, Ye Shen, Xiujie Song, Kaiwei Zhang, Shibo Wang, Dun Pei, Xiangyang Zhu, Guangtao Zhai

Main category: cs.CL

TL;DR: EvolIF is a novel benchmark for evaluating LLMs’ instruction-following ability in multi-topic dialogues using a three-layer tracking mechanism and query synthesis agent, with process-centric metrics based on Flow Theory that terminates only when user patience is exhausted.

Details

Motivation: Existing benchmarks for evaluating LLMs' instruction-following in multi-topic dialogues are limited by fixed turn counts, susceptible to saturation, and fail to account for users' interactive experience, creating a need for more realistic evaluation frameworks.

Method: Proposes a framework with three-layer tracking mechanism and query synthesis agent to mimic sequential user behaviors, grounded in Flow Theory with process-centric metrics, terminating evaluation only upon exhausting user patience. Creates EvolIF benchmark covering 12 constraint groups.

Result: Analysis reveals deficiencies in failure recovery and fine-grained instruction following, with performance stratification increasing with conversational depth. GPT-5 shows most sustained resilience with 66.40% robustness score, outperforming Gemini-3-Pro by 5.59%, while other models lag behind.

Conclusion: The EvolIF benchmark provides a more realistic evaluation of LLMs’ instruction-following in multi-topic dialogues, revealing critical deficiencies in current models and establishing GPT-5 as the most resilient performer in sustained conversational contexts.

Abstract: Evaluating LLMs’ instruction-following ability in multi-topic dialogues is essential yet challenging. Existing benchmarks are limited to a fixed number of turns, susceptible to saturation and failing to account for users’ interactive experience. In this work, we propose a novel framework featuring a three-layer tracking mechanism and a query synthesis agent to mimic sequential user behaviors. Grounded in Flow Theory, we introduce process-centric metrics and terminate a conversational evaluation only upon exhausting user patience. Leveraging this framework, we present EvolIF, an evolving benchmark covering 12 constraint groups. Our analysis reveals deficiencies in failure recovery and fine-grained instruction following, with performance stratification becoming evident as conversational depth increases. GPT-5 demonstrates the most sustained resilience, maintaining a 66.40% robustness score, outperforming Gemini-3-Pro by 5.59%, while other models lag behind. Data and code will be released at https://github.com/JiaQiSJTU/EvolIF.

[129] Black-Box On-Policy Distillation of Large Language Models

Tianzhu Ye, Li Dong, Zewen Chi, Xun Wu, Shaohan Huang, Furu Wei

Main category: cs.CL

TL;DR: GAD enables black-box LLM distillation by framing student as generator and training discriminator to distinguish student/teacher responses, creating minimax game that provides stable adaptive feedback.

Details

Motivation: Black-box distillation currently learns only from teacher's text outputs without internal logits/parameters, limiting effectiveness. Need better approach for on-policy distillation without access to teacher internals.

Method: Generative Adversarial Distillation (GAD) frames student LLM as generator, trains discriminator to distinguish student vs teacher responses. Creates minimax game where discriminator acts as on-policy reward model co-evolving with student.

Result: GAD consistently surpasses sequence-level knowledge distillation. Qwen2.5-14B-Instruct trained with GAD becomes comparable to GPT-5-Chat teacher on LMSYS-Chat evaluation.

Conclusion: GAD establishes promising effective paradigm for black-box LLM distillation, enabling on-policy training without teacher internal access.

Abstract: Black-box distillation creates student large language models (LLMs) by learning from a proprietary teacher model’s text outputs alone, without access to its internal logits or parameters. In this work, we introduce Generative Adversarial Distillation (GAD), which enables on-policy and black-box distillation. GAD frames the student LLM as a generator and trains a discriminator to distinguish its responses from the teacher LLM’s, creating a minimax game. The discriminator acts as an on-policy reward model that co-evolves with the student, providing stable, adaptive feedback. Experimental results show that GAD consistently surpasses the commonly used sequence-level knowledge distillation. In particular, Qwen2.5-14B-Instruct (student) trained with GAD becomes comparable to its teacher, GPT-5-Chat, on the LMSYS-Chat automatic evaluation. The results establish GAD as a promising and effective paradigm for black-box LLM distillation.

[130] Donors and Recipients: On Asymmetric Transfer Across Tasks and Languages with Parameter-Efficient Fine-Tuning

Kajetan Dymkiewicz, Ivan Vulic, Helen Yannakoudakis, Eilam Shapira, Roi Reichart, Anna Korhonen

Main category: cs.CL

TL;DR: The paper investigates how LoRA fine-tuning on one task-language combination transfers to other tasks and languages, finding asymmetric transfer patterns with matched-task cross-language transfer being most effective.

Details

Motivation: While LLMs perform well across tasks and languages, it's unclear how improvements in one task or language affect others. The study aims to systematically understand transfer patterns in multilingual fine-tuning.

Method: Controlled LoRA fine-tuning study across multiple open-weight LLM families and scales, using standardized grid of 11 languages and 4 benchmarks. Fine-tune each model on single task-language source, then measure transfer to all other task-language target pairs. Decompose transfer into three regimes: matched-task cross-language, matched-language cross-task, and cross-task cross-language.

Result: Single-source fine-tuning yields net positive uplift but gains are strongly asymmetric. Matched-task cross-language transfer is most effective and predictable, driven by target language identity rather than model architecture. High-resource languages and broad semantic tasks act as efficient recipients absorbing gains from diverse sources, while specialized tasks and lower-resource languages are more isolated.

Conclusion: Effective fine-tuning requires navigating donor-recipient roles to maximize downstream gains, with stable hierarchies where certain language-task combinations serve as better sources or recipients for transfer learning.

Abstract: Large language models (LLMs) perform strongly across tasks and languages, yet how improvements in one task or language affect other tasks and languages remains poorly understood. We conduct a controlled LoRA fine-tuning study across multiple open-weight LLM families and scales, using a standardised grid of 11 languages and four benchmarks. We fine-tune each model on a single task-language source and measure transfer when evaluated on all other task-language target pairs. We decompose transfer into three regimes: (i) Matched-Task (Cross-Language), (ii) Matched-Language (Cross-Task), and (iii) Cross-Task (Cross-Language). Single-source fine-tuning yields a net positive uplift across regimes, but the gains are strongly asymmetric. Matched-Task (Cross-Language) transfer emerges as the most effective and predictable regime, driven principally by the identity of the target language rather than model architecture. We identify a stable hierarchy where high-resource languages and broad semantic tasks act as efficient recipients that absorb gains from diverse sources, while specialised tasks and lower-resource languages are more isolated. These results imply that effective fine-tuning requires navigating donor-recipient roles to maximise downstream gains.

[131] Non-Linear Scoring Model for Translation Quality Evaluation

Serge Gladkoff, Lifeng Han, Katerina Gasova

Main category: cs.CL

TL;DR: Non-linear scoring model for translation quality evaluation that uses logarithmic error tolerance scaling instead of linear extrapolation, better aligning with human perception across varying text lengths.

Details

Motivation: Traditional linear error-to-penalty scaling in translation quality evaluation biases judgment on samples of different sizes, over-penalizing short samples and under-penalizing long ones, creating misalignment with expert intuition.

Method: Proposes a calibrated two-parameter logarithmic model E(x) = a * ln(1 + b * x) based on empirical data showing error tolerance grows logarithmically with sample size. Uses psychophysical principles (Weber-Fechner law, Cognitive Load Theory) and anchors to reference tolerance calibrated from two tolerance points with one-dimensional root-finding.

Result: Empirical data from three large-scale enterprise environments confirms acceptable error counts grow logarithmically, not linearly. The model improves interpretability, fairness, and inter-rater reliability across both human and AI-generated translations, with explicit intervals where linear approximation stays within +/-20% relative error.

Conclusion: The non-linear scoring model advances translation quality evaluation toward more accurate and scalable assessment, provides stronger basis for AI-based document-level evaluation aligned with human judgment, and can be integrated into existing workflows with only a dynamic tolerance function added.

Abstract: Analytic Translation Quality Evaluation (TQE), based on Multidimensional Quality Metrics (MQM), traditionally uses a linear error-to-penalty scale calibrated to a reference sample of 1000-2000 words. However, linear extrapolation biases judgment on samples of different sizes, over-penalizing short samples and under-penalizing long ones, producing misalignment with expert intuition. Building on the Multi-Range framework, this paper presents a calibrated, non-linear scoring model that better reflects how human content consumers perceive translation quality across samples of varying length. Empirical data from three large-scale enterprise environments shows that acceptable error counts grow logarithmically, not linearly, with sample size. Psychophysical and cognitive evidence, including the Weber-Fechner law and Cognitive Load Theory, supports this premise by explaining why the perceptual impact of additional errors diminishes while the cognitive burden grows with scale. We propose a two-parameter model E(x) = a * ln(1 + b * x), a, b > 0, anchored to a reference tolerance and calibrated from two tolerance points using a one-dimensional root-finding step. The model yields an explicit interval within which the linear approximation stays within +/-20 percent relative error and integrates into existing evaluation workflows with only a dynamic tolerance function added. The approach improves interpretability, fairness, and inter-rater reliability across both human and AI-generated translations. By operationalizing a perceptually valid scoring paradigm, it advances translation quality evaluation toward more accurate and scalable assessment. The model also provides a stronger basis for AI-based document-level evaluation aligned with human judgment. Implementation considerations for CAT/LQA systems and implications for human and AI-generated text evaluation are discussed.

[132] Interleaved Latent Visual Reasoning with Selective Perceptual Modeling

Shuai Dong, Siyuan Wang, Xingyu Liu, Chenglin Li, Haowen Hou, Zhongyu Wei

Main category: cs.CL

TL;DR: ILVR is a new framework that combines interleaved reasoning with latent visual representations to improve multimodal reasoning while reducing computational costs.

Details

Motivation: Existing visual reasoning methods either have high computational costs from re-encoding images or sacrifice perceptual precision through over-compression. Current approaches fail to capture intermediate state evolution or maintain precise perceptual modeling.

Method: ILVR interleaves textual generation with evolving latent visual representations as cues. Uses self-supervision with a momentum teacher model that selectively distills relevant features from ground-truth intermediate images into sparse supervision targets, enabling autonomous generation of context-aware visual signals.

Result: Extensive experiments on multimodal reasoning benchmarks show ILVR outperforms existing approaches, effectively bridging the gap between fine-grained perception and sequential multimodal reasoning.

Conclusion: ILVR successfully unifies dynamic state evolution with precise perceptual modeling in multimodal reasoning, offering a computationally efficient solution that maintains perceptual accuracy.

Abstract: Interleaved reasoning paradigms enhance Multimodal Large Language Models (MLLMs) with visual feedback but are hindered by the prohibitive computational cost of re-encoding pixel-dense images. A promising alternative, latent visual reasoning, circumvents this bottleneck yet faces limitations: methods either fail to capture intermediate state evolution due to single-step, non-interleaved structures, or sacrifice precise perceptual modeling by over-compressing features. We introduce Interleaved Latent Visual Reasoning (ILVR), a framework that unifies dynamic state evolution with precise perceptual modeling. ILVR interleaves textual generation with latent visual representations that act as specific, evolving cues for subsequent reasoning. Specifically, we employ a self-supervision strategy where a momentum teacher model selectively distills relevant features from ground-truth intermediate images into sparse supervision targets. This adaptive selection mechanism guides the model to autonomously generate context-aware visual signals. Extensive experiments on multimodal reasoning benchmarks demonstrate that ILVR outperforms existing approaches, effectively bridging the gap between fine-grained perception and sequential multimodal reasoning.

[133] TPA: Next Token Probability Attribution for Detecting Hallucinations in RAG

Pengqian Lu, Jie Lu, Anjin Liu, Guangquan Zhang

Main category: cs.CL

TL;DR: TPA is a new method for detecting hallucinations in RAG systems by mathematically attributing token probabilities to seven distinct sources and analyzing their contributions by part-of-speech tags.

Details

Motivation: Prior approaches to hallucination detection in RAG systems are incomplete because they only consider binary conflicts between internal knowledge and retrieved context, ignoring other important LLM components like user queries, previously generated tokens, self tokens, and LayerNorm adjustments.

Method: TPA mathematically attributes each token’s probability to seven sources: Query, RAG Context, Past Token, Self Token, FFN, Final LayerNorm, and Initial Embedding. It aggregates these attribution scores by Part-of-Speech tags to quantify how each model component contributes to generating specific linguistic categories.

Result: TPA achieves state-of-the-art performance in hallucination detection by identifying patterns like anomalies where Nouns rely heavily on LayerNorm, which helps effectively identify hallucinated responses.

Conclusion: TPA provides a comprehensive framework for hallucination detection in RAG systems by considering multiple model components and their linguistic contributions, outperforming previous approaches.

Abstract: Detecting hallucinations in Retrieval-Augmented Generation remains a challenge. Prior approaches attribute hallucinations to a binary conflict between internal knowledge stored in FFNs and the retrieved context. However, this perspective is incomplete, failing to account for the impact of other components of the LLM, such as the user query, previously generated tokens, the self token, and the final LayerNorm adjustment. To comprehensively capture the impact of these components on hallucination detection, we propose TPA which mathematically attributes each token’s probability to seven distinct sources: Query, RAG Context, Past Token, Self Token, FFN, Final LayerNorm, and Initial Embedding. This attribution quantifies how each source contributes to the generation of the next token. Specifically, we aggregate these attribution scores by Part-of-Speech (POS) tags to quantify the contribution of each model component to the generation of specific linguistic categories within a response. By leveraging these patterns, such as detecting anomalies where Nouns rely heavily on LayerNorm, TPA effectively identifies hallucinated responses. Extensive experiments show that TPA achieves state-of-the-art performance.

[134] Minimal Clips, Maximum Salience: Long Video Summarization via Key Moment Extraction

Galann Pennec, Zhengyuan Liu, Nicholas Asher, Philippe Muller, Nancy F. Chen

Main category: cs.CL

TL;DR: A clip selection method for video summarization that uses lightweight video captioning and LLMs to identify key moments, achieving near-reference performance with low computational cost.

Details

Motivation: VLMs struggle with long videos where important visual information gets lost, and there's a need for cost-effective tools to analyze lengthy video content.

Method: Divide video into short clips, generate compact visual descriptions using lightweight video captioning model, then use LLM to select K most relevant clips for multimodal summary.

Result: Achieves summarization performance close to reference clips (derived from human-annotated screenplays), captures substantially more relevant information than random selection, while maintaining low computational cost.

Conclusion: The proposed clip selection method effectively identifies key video moments for multimodal summarization, balancing performance with computational efficiency.

Abstract: Vision-Language Models (VLMs) are able to process increasingly longer videos. Yet, important visual information is easily lost throughout the entire context and missed by VLMs. Also, it is important to design tools that enable cost-effective analysis of lengthy video content. In this paper, we propose a clip selection method that targets key video moments to be included in a multimodal summary. We divide the video into short clips and generate compact visual descriptions of each using a lightweight video captioning model. These are then passed to a large language model (LLM), which selects the K clips containing the most relevant visual information for a multimodal summary. We evaluate our approach on reference clips for the task, automatically derived from full human-annotated screenplays and summaries in the MovieSum dataset. We further show that these reference clips (less than 6% of the movie) are sufficient to build a complete multimodal summary of the movies in MovieSum. Using our clip selection method, we achieve a summarization performance close to that of these reference clips while capturing substantially more relevant video information than random clip selection. Importantly, we maintain low computational cost by relying on a lightweight captioning model.

[135] NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents

Jingzhe Ding, Shengda Long, Changxin Pu, Huan Zhou, Hongwan Gao, Xiang Gao, Chao He, Yue Hou, Fei Hu, Zhaojian Li, Weiran Shi, Zaiyuan Wang, Daoguang Zan, Chenchen Zhang, Xiaoxu Zhang, Qizhi Chen, Xianfu Cheng, Bo Deng, Qingshui Gu, Kai Hua, Juntao Lin, Pai Liu, Mingchen Li, Xuanguang Pan, Zifan Peng, Yujia Qin, Yong Shan, Zhewen Tan, Weihao Xie, Zihan Wang, Yishuo Yuan, Jiayu Zhang, Enduo Zhao, Yunfei Zhao, He Zhu, Liya Zhu, Chenyang Zou, Ming Ding, Jianpeng Jiao, Jiaheng Liu, Minghao Liu, Qian Liu, Chongyang Tao, Jian Yang, Tong Yang, Zhaoxiang Zhang, Xinjie Chen, Wenhao Huang, Ge Zhang

Main category: cs.CL

TL;DR: NL2Repo Bench is a new benchmark for evaluating coding agents’ ability to generate complete software repositories from natural language requirements, revealing that current agents struggle with long-horizon tasks, achieving below 40% success rates.

Details

Motivation: Existing benchmarks focus on localized code generation, scaffolded completion, or short-term repair tasks, but fail to evaluate the long-horizon capabilities needed for building complete software systems. There's a gap in assessing whether agents can sustain coherent reasoning, planning, and execution over extended horizons required for real-world repository construction.

Method: Created NL2Repo Bench benchmark where agents must autonomously design architecture, manage dependencies, implement multi-module logic, and produce a fully installable Python library from a single natural-language requirements document and empty workspace. Evaluated state-of-the-art open- and closed-source models on this benchmark.

Result: Long-horizon repository generation remains largely unsolved - even the strongest agents achieve below 40% average test pass rates and rarely complete entire repositories correctly. Identified fundamental failure modes: premature termination, loss of global coherence, fragile cross-file dependencies, and inadequate planning over hundreds of interaction steps.

Conclusion: NL2Repo Bench establishes a rigorous testbed for measuring sustained agentic competence and highlights long-horizon reasoning as a central bottleneck for the next generation of autonomous coding agents.

Abstract: Recent advances in coding agents suggest rapid progress toward autonomous software development, yet existing benchmarks fail to rigorously evaluate the long-horizon capabilities required to build complete software systems. Most prior evaluations focus on localized code generation, scaffolded completion, or short-term repair tasks, leaving open the question of whether agents can sustain coherent reasoning, planning, and execution over the extended horizons demanded by real-world repository construction. To address this gap, we present NL2Repo Bench, a benchmark explicitly designed to evaluate the long-horizon repository generation ability of coding agents. Given only a single natural-language requirements document and an empty workspace, agents must autonomously design the architecture, manage dependencies, implement multi-module logic, and produce a fully installable Python library. Our experiments across state-of-the-art open- and closed-source models reveal that long-horizon repository generation remains largely unsolved: even the strongest agents achieve below 40% average test pass rates and rarely complete an entire repository correctly. Detailed analysis uncovers fundamental long-horizon failure modes, including premature termination, loss of global coherence, fragile cross-file dependencies, and inadequate planning over hundreds of interaction steps. NL2Repo Bench establishes a rigorous, verifiable testbed for measuring sustained agentic competence and highlights long-horizon reasoning as a central bottleneck for the next generation of autonomous coding agents.

[136] Comparative Analysis of LLM Abliteration Methods: A Cross-Architecture Evaluation

Richard J. Young

Main category: cs.CL

TL;DR: This study evaluates four abliteration tools (Heretic, DECCP, ErisForge, FailSpy) for removing safety refusal mechanisms from LLMs, finding that single-pass methods preserve capabilities better while Bayesian optimization causes variable distribution shifts, with mathematical reasoning being most sensitive to these interventions.

Details

Motivation: Safety alignment mechanisms in LLMs prevent harmful responses but also impede legitimate research applications like cognitive modeling, adversarial testing, and security analysis. While abliteration techniques can surgically remove refusal representations, their relative effectiveness remains uncharacterized.

Method: Evaluated four abliteration tools across sixteen instruction-tuned models (7B-14B parameters), reporting tool compatibility on all models and quantitative metrics on subsets dictated by tool support. Compared single-pass methods vs Bayesian-optimized approaches.

Result: Single-pass methods demonstrated superior capability preservation (avg GSM8K change: ErisForge -0.28 pp; DECCP -0.13 pp). Bayesian-optimized abliteration produced variable distribution shift (KL divergence: 0.043-1.646). Mathematical reasoning capabilities showed highest sensitivity, with GSM8K change ranging from +1.51 pp to -18.81 pp (-26.5% relative).

Conclusion: The findings provide evidence-based selection criteria for abliteration tool deployment across diverse model architectures, highlighting that mathematical reasoning is most sensitive to these interventions and that tool selection significantly impacts capability preservation.

Abstract: Safety alignment mechanisms in large language models prevent responses to harmful queries through learned refusal behavior, yet these same mechanisms impede legitimate research applications including cognitive modeling, adversarial testing, and security analysis. While abliteration techniques enable surgical removal of refusal representations through directional orthogonalization, the relative effectiveness of available implementations remains uncharacterized. This study evaluates four abliteration tools (Heretic, DECCP, ErisForge, FailSpy) across sixteen instruction-tuned models (7B-14B parameters), reporting tool compatibility on all 16 models and quantitative metrics on subsets dictated by tool support. Single-pass methods demonstrated superior capability preservation on the benchmarked subset (avg GSM8K change across three models: ErisForge -0.28 pp; DECCP -0.13 pp), while Bayesian-optimized abliteration produced variable distribution shift (KL divergence: 0.043-1.646) with model-dependent capability impact. These findings provide researchers with evidence-based selection criteria for abliteration tool deployment across diverse model architectures. The principal finding indicates that mathematical reasoning capabilities exhibit the highest sensitivity to abliteration interventions, with GSM8K change ranging from +1.51 pp to -18.81 pp (-26.5% relative) depending on tool selection and model architecture.

[137] Visual Merit or Linguistic Crutch? A Close Look at DeepSeek-OCR

Yunhao Liang, Ruixuan Ying, Bo Li, Hong Li, Kai Yan, Qingwen Li, Min Yang, Okamoto Satoshi, Zhe Cui, Shiwen Ni

Main category: cs.CL

TL;DR: DeepSeek-OCR’s high performance heavily relies on linguistic priors rather than pure visual OCR capabilities; without language support, accuracy drops from 90% to 20%, revealing significant hallucination risks and context limitations.

Details

Motivation: To investigate whether DeepSeek-OCR's claimed high-ratio vision-text compression performance is driven by genuine optical character recognition capabilities or linguistic priors, and to understand its limitations for addressing LLM long-context bottlenecks.

Method: Used sentence-level and word-level semantic corruption to isolate intrinsic OCR capabilities from language priors; benchmarked against 13 baseline models; conducted context stress testing; analyzed correlation between visual token counts and prior reliance.

Result: Performance dropped from ~90% to 20% without linguistic support; traditional pipeline OCR methods showed higher robustness than end-to-end methods; lower visual token counts increased prior reliance and hallucination risks; model collapsed around 10,000 text tokens.

Conclusion: DeepSeek-OCR’s performance is heavily dependent on linguistic priors rather than pure visual OCR, with significant limitations in robustness and context handling; current optical compression techniques may worsen rather than solve long-context bottlenecks.

Abstract: DeepSeek-OCR utilizes an optical 2D mapping approach to achieve high-ratio vision-text compression, claiming to decode text tokens exceeding ten times the input visual tokens. While this suggests a promising solution for the LLM long-context bottleneck, we investigate a critical question: “Visual merit or linguistic crutch - which drives DeepSeek-OCR’s performance?” By employing sentence-level and word-level semantic corruption, we isolate the model’s intrinsic OCR capabilities from its language priors. Results demonstrate that without linguistic support, DeepSeek-OCR’s performance plummets from approximately 90% to 20%. Comparative benchmarking against 13 baseline models reveals that traditional pipeline OCR methods exhibit significantly higher robustness to such semantic perturbations than end-to-end methods. Furthermore, we find that lower visual token counts correlate with increased reliance on priors, exacerbating hallucination risks. Context stress testing also reveals a total model collapse around 10,000 text tokens, suggesting that current optical compression techniques may paradoxically aggravate the long-context bottleneck. This study empirically defines DeepSeek-OCR’s capability boundaries and offers essential insights for future optimizations of the vision-text compression paradigm. We release all data, results and scripts used in this study at https://github.com/dududuck00/DeepSeekOCR.

[138] SiamGPT: Quality-First Fine-Tuning for Stable Thai Text Generation

Thittipat Pairatsuppawat, Abhibhu Tachaapornchai, Paweekorn Kusolsomboon, Chutikan Chaiwong, Thodsaporn Chay-intr, Kobkrit Viriyayudhakorn, Nongnuch Ketui, Aslan B. Wong

Main category: cs.CL

TL;DR: SiamGPT-32B is a Thai-optimized open-weights LLM based on Qwen3-32B, using Quality-First fine-tuning to improve instruction following and linguistic stability for Thai language tasks.

Details

Motivation: Existing open-weights LLMs perform well in English but struggle with unstable generation for Thai under complex instructions, creating deployment challenges for Thai language applications.

Method: Quality-First fine-tuning strategy emphasizing curated supervision over data scale; combines high-complexity English instruction data with Thai-adapted AutoIF framework for instruction and linguistic constraints; uses supervised fine-tuning only without continual pretraining or corpus expansion.

Result: SiamGPT-32B achieves strongest overall performance among similar-scale open-weights Thai models on SEA-HELM benchmark, with consistent gains in instruction following, multi-turn dialogue, and natural language understanding.

Conclusion: The Quality-First fine-tuning approach effectively improves Thai language performance by focusing on curated supervision and linguistic constraints, enabling better deployment of open-weights models for Thai without requiring extensive pretraining or data expansion.

Abstract: Open-weights large language models remain difficult to deploy for Thai due to unstable generation under complex instructions, despite strong English performance. To mitigate these limitations, We present SiamGPT-32B, an open-weights model based on Qwen3-32B, fine-tuned with a Quality-First strategy emphasizing curated supervision over data scale. The fine-tuning pipeline combines high-complexity English instruction data with a Thai-adapted AutoIF framework for instruction and linguistic constraints. Using supervised fine-tuning only, without continual pretraining or corpus expansion, SiamGPT-32B improves instruction adherence, multi-turn robustness, and linguistic stability. Evaluations on the SEA-HELM benchmark show that SiamGPT-32B achieves the strongest overall performance among similar-scale open-weights Thai models, with consistent gains in instruction following, multi-turn dialogue, and natural language understanding.

[139] InfiniteWeb: Scalable Web Environment Synthesis for GUI Agent Training

Ziyun Zhang, Zezhou Wang, Xiaoyi Zhang, Zongyu Guo, Jiahao Li, Bin Li, Yan Lu

Main category: cs.CL

TL;DR: InfiniteWeb automatically generates functional web environments at scale for training GUI agents, overcoming challenges in realistic website construction and enabling better agent training with verifiable task evaluators.

Details

Motivation: Training GUI agents that interact with graphical interfaces is hindered by the scarcity of suitable environments. While LLMs can generate single webpages, building realistic, functional websites with many interconnected pages remains challenging.

Method: Uses unified specification, task-centric test-driven development, and a combination of website seed with reference design image to ensure diversity. The system generates verifiable task evaluators that provide dense reward signals for reinforcement learning.

Result: InfiniteWeb surpasses commercial coding agents at realistic website construction. GUI agents trained on the generated environments achieve significant performance improvements on OSWorld and Online-Mind2Web benchmarks.

Conclusion: The system effectively addresses the environment scarcity problem for GUI agent training, demonstrating that automatically generated functional web environments can significantly improve agent performance on real-world tasks.

Abstract: GUI agents that interact with graphical interfaces on behalf of users represent a promising direction for practical AI assistants. However, training such agents is hindered by the scarcity of suitable environments. We present InfiniteWeb, a system that automatically generates functional web environments at scale for GUI agent training. While LLMs perform well on generating a single webpage, building a realistic and functional website with many interconnected pages faces challenges. We address these challenges through unified specification, task-centric test-driven development, and a combination of website seed with reference design image to ensure diversity. Our system also generates verifiable task evaluators enabling dense reward signals for reinforcement learning. Experiments show that InfiniteWeb surpasses commercial coding agents at realistic website construction, and GUI agents trained on our generated environments achieve significant performance improvements on OSWorld and Online-Mind2Web, demonstrating the effectiveness of proposed system.

[140] Distilling the Essence: Efficient Reasoning Distillation via Sequence Truncation

Wei-Rui Chen, Vignesh Kothapalli, Ata Fatahibaarzi, Hejian Sang, Shao Tang, Qingquan Song, Zhipeng Wang, Muhammad Abdul-Mageed

Main category: cs.CL

TL;DR: Selective knowledge distillation over only CoT tokens can achieve 91% of full-sequence performance while cutting training costs by 50%.

Details

Motivation: Knowledge distillation over lengthy reasoning sequences (prompt, chain-of-thought, answer) is computationally expensive, creating a need for more efficient distillation methods.

Method: Analyze supervision allocation across different sequence sections, establish truncation protocol to quantify computation-quality tradeoffs, and train on only the first 50% of tokens in each sequence.

Result: Training on only the first 50% of tokens retains ~91% of full-sequence performance on math benchmarks while reducing training time, memory usage, and FLOPs by about 50% each.

Conclusion: Selective knowledge distillation focusing on CoT tokens provides an efficient tradeoff between performance and computational cost, enabling more practical distillation of reasoning models.

Abstract: Distilling the capabilities from a large reasoning model (LRM) to a smaller student model often involves training on substantial amounts of reasoning data. However, knowledge distillation (KD) over lengthy sequences with prompt (P), chain-of-thought (CoT), and answer (A) sections makes the process computationally expensive. In this work, we investigate how the allocation of supervision across different sections (P, CoT, A) affects student performance. Our analysis shows that selective KD over only the CoT tokens can be effective when the prompt and answer information is encompassed by it. Building on this insight, we establish a truncation protocol to quantify computation-quality tradeoffs as a function of sequence length. We observe that beyond a specific length, longer training sequences provide marginal returns for downstream performance but require substantially higher memory and FLOPs. To this end, training on only the first $50%$ of tokens of every training sequence can retain, on average, $\approx91%$ of full-sequence performance on math benchmarks while reducing training time, memory usage, and FLOPs by about $50%$ each. Codes are available at https://github.com/weiruichen01/distilling-the-essence.

[141] Disentangling Learning from Judgment: Representation Learning for Open Response Analytics

Conrad Borchers, Manit Patel, Seiyon M. Lee, Anthony F. Botelho

Main category: cs.CL

TL;DR: Paper presents analytics framework separating content signals from rater tendencies in open-ended response scoring, using teacher priors and embeddings to improve grading transparency.

Details

Motivation: Automated scoring of open-ended responses often conflates student content with teacher grading tendencies, lacking transparency and auditability in assessment practices.

Method: Uses ASSISTments math responses with teacher histories as dynamic priors, sentence embeddings, centroid normalization, response-problem embedding differences, and explicit teacher effect modeling to reduce confounds.

Result: Teacher priors heavily influence predictions; best results (AUC~~0.815) combine priors with content embeddings vs. content-only models (AUC~~0.626). Adjusting for rater effects improves feature selection from embeddings.

Conclusion: Framework transforms embeddings into learning analytics for reflection, enabling examination of where grading practices align or conflict with evidence of student reasoning and learning.

Abstract: Open-ended responses are central to learning, yet automated scoring often conflates what students wrote with how teachers grade. We present an analytics-first framework that separates content signals from rater tendencies, making judgments visible and auditable via analytics. Using de-identified ASSISTments mathematics responses, we model teacher histories as dynamic priors and represent text with sentence embeddings. We apply centroid normalization and response-problem embedding differences, and explicitly model teacher effects with priors to reduce problem- and teacher-related confounds. Temporally-validated linear models quantify the contributions of each signal, and model disagreements surface observations for qualitative inspection. Results show that teacher priors heavily influence grade predictions; the strongest results arise when priors are combined with content embeddings (AUC~~0.815), while content-only models remain above chance but substantially weaker (AUC~~0.626). Adjusting for rater effects sharpens the selection of features derived from content representations, retaining more informative embedding dimensions and revealing cases where semantic evidence supports understanding as opposed to surface-level differences in how students respond. The contribution presents a practical pipeline that transforms embeddings from mere features into learning analytics for reflection, enabling teachers and researchers to examine where grading practices align (or conflict) with evidence of student reasoning and learning.

[142] From Policy to Logic for Efficient and Interpretable Coverage Assessment

Rhitabrat Pokharel, Hamid Reza Hassanzadeh, Ameeta Agrawal

Main category: cs.CL

TL;DR: Hybrid system combining LLMs with symbolic reasoning for medical policy review achieves 44% cost reduction and 4.5% F1 improvement.

Details

Motivation: LLMs show promise for legal/policy interpretation but suffer from hallucinations and inconsistencies, especially critical in medical coverage policy review where human experts need reliable, accurate information.

Method: Hybrid approach pairing coverage-aware retriever with symbolic rule-based reasoning to surface relevant policy language, organize it into explicit facts and rules, and generate auditable rationales while minimizing LLM inferences.

Result: Achieves 44% reduction in inference cost alongside 4.5% improvement in F1 score, demonstrating both efficiency and effectiveness.

Conclusion: The hybrid system successfully supports human reviewers by making policy interpretation more efficient and interpretable while reducing costs and improving accuracy.

Abstract: Large Language Models (LLMs) have demonstrated strong capabilities in interpreting lengthy, complex legal and policy language. However, their reliability can be undermined by hallucinations and inconsistencies, particularly when analyzing subjective and nuanced documents. These challenges are especially critical in medical coverage policy review, where human experts must be able to rely on accurate information. In this paper, we present an approach designed to support human reviewers by making policy interpretation more efficient and interpretable. We introduce a methodology that pairs a coverage-aware retriever with symbolic rule-based reasoning to surface relevant policy language, organize it into explicit facts and rules, and generate auditable rationales. This hybrid system minimizes the number of LLM inferences required which reduces overall model cost. Notably, our approach achieves a 44% reduction in inference cost alongside a 4.5% improvement in F1 score, demonstrating both efficiency and effectiveness.

[143] Surprisal and Metaphor Novelty: Moderate Correlations and Divergent Scaling Effects

Omar Momen, Emilie Sitter, Berenike Herrmann, Sina Zarrieß

Main category: cs.CL

TL;DR: Surprisal from language models shows moderate correlation with metaphor novelty scores, but exhibits divergent scaling patterns: inverse scaling on corpus data vs. improved scaling on synthetic data.

Details

Motivation: Metaphor comprehension involves complex semantic processing and linguistic creativity, making it an interesting test case for evaluating language models' understanding of novel linguistic expressions.

Method: Analyzed surprisal from 16 LM variants on both corpus-based and synthetic metaphor novelty datasets using a cloze-style surprisal method that conditions on full-sentence context.

Result: LMs show significant moderate correlations with metaphor novelty scores/labels. Divergent scaling patterns: correlation decreases with model size on corpus data (inverse scaling), but increases on synthetic data (Quality-Power Hypothesis).

Conclusion: Surprisal can partially account for metaphor novelty annotations but remains a limited metric of linguistic creativity, with different scaling behaviors depending on dataset type.

Abstract: Novel metaphor comprehension involves complex semantic processes and linguistic creativity, making it an interesting task for studying language models (LMs). This study investigates whether surprisal, a probabilistic measure of predictability in LMs, correlates with different metaphor novelty datasets. We analyse surprisal from 16 LM variants on corpus-based and synthetic metaphor novelty datasets. We explore a cloze-style surprisal method that conditions on full-sentence context. Results show that LMs yield significant moderate correlations with scores/labels of metaphor novelty. We further identify divergent scaling patterns: on corpus-based data, correlation strength decreases with model size (inverse scaling effect), whereas on synthetic data it increases (Quality-Power Hypothesis). We conclude that while surprisal can partially account for annotations of metaphor novelty, it remains a limited metric of linguistic creativity.

[144] MiMo-V2-Flash Technical Report

Xiaomi LLM-Core Team, :, Bangjun Xiao, Bingquan Xia, Bo Yang, Bofei Gao, Bowen Shen, Chen Zhang, Chenhong He, Chiheng Lou, Fuli Luo, Gang Wang, Gang Xie, Hailin Zhang, Hanglong Lv, Hanyu Li, Heyu Chen, Hongshen Xu, Houbin Zhang, Huaqiu Liu, Jiangshan Duo, Jianyu Wei, Jiebao Xiao, Jinhao Dong, Jun Shi, Junhao Hu, Kainan Bao, Kang Zhou, Lei Li, Liang Zhao, Linghao Zhang, Peidian Li, Qianli Chen, Shaohui Liu, Shihua Yu, Shijie Cao, Shimao Chen, Shouqiu Yu, Shuo Liu, Tianling Zhou, Weijiang Su, Weikun Wang, Wenhan Ma, Xiangwei Deng, Bohan Mao, Bowen Ye, Can Cai, Chenghua Wang, Chengxuan Zhu, Chong Ma, Chun Chen, Chunan Li, Dawei Zhu, Deshan Xiao, Dong Zhang, Duo Zhang, Fangyue Liu, Feiyu Yang, Fengyuan Shi, Guoan Wang, Hao Tian, Hao Wu, Heng Qu, Hongfei Yi, Hongxu An, Hongyi Guan, Xing Zhang, Yifan Song, Yihan Yan, Yihao Zhao, Yingchun Lai, Yizhao Gao, Yu Cheng, Yuanyuan Tian, Yudong Wang, Zhen Tang, Zhengju Tang, Zhengtao Wen, Zhichao Song, Zhixian Zheng, Zihan Jiang, Jian Wen, Jiarui Sun, Jiawei Li, Jinlong Xue, Jun Xia, Kai Fang, Menghang Zhu, Nuo Chen, Qian Tu, Qihao Zhang, Qiying Wang, Rang Li, Rui Ma, Shaolei Zhang, Shengfan Wang, Shicheng Li, Shuhao Gu, Shuhuai Ren, Sirui Deng, Tao Guo, Tianyang Lu, Weiji Zhuang, Weikang Zhang, Weimin Xiong, Wenshan Huang, Wenyu Yang, Xin Zhang, Xing Yong, Xu Wang, Xueyang Xie, Yilin Jiang, Yixin Yang, Yongzhe He, Yu Tu, Yuanliang Dong, Yuchen Liu, Yue Ma, Yue Yu, Yuxing Xiang, Zhaojun Huang, Zhenru Lin, Zhipeng Xu, Zhiyang Chen, Zhonghua Deng, Zihan Zhang, Zihao Yue

Main category: cs.CL

TL;DR: MiMo-V2-Flash is a 309B parameter Mixture-of-Experts model with 15B active parameters, featuring hybrid attention architecture and multi-token prediction, achieving competitive performance with fewer parameters and faster inference via speculative decoding.

Details

Motivation: To create a highly efficient large language model that combines strong reasoning and agentic capabilities while using significantly fewer parameters than top-tier models, enabling faster inference and more efficient scaling through novel distillation techniques.

Method: Uses Mixture-of-Experts architecture with hybrid attention (Sliding Window + global attention at 5:1 ratio), pre-trained on 27T tokens with Multi-Token Prediction. Introduces Multi-Teacher On-Policy Distillation for efficient post-training, where domain-specialized teachers provide dense token-level rewards.

Result: Achieves performance rivaling top models like DeepSeek-V3.2 and Kimi-K2 with only 1/2 to 1/3 of their parameters. Enables 2.6x decoding speedup via speculative decoding using MTP as draft model, with up to 3.6 acceptance length.

Conclusion: MiMo-V2-Flash demonstrates that efficient architecture design combined with novel distillation techniques can create competitive models with significantly reduced parameter counts and improved inference speeds, promoting open research through model release.

Abstract: We present MiMo-V2-Flash, a Mixture-of-Experts (MoE) model with 309B total parameters and 15B active parameters, designed for fast, strong reasoning and agentic capabilities. MiMo-V2-Flash adopts a hybrid attention architecture that interleaves Sliding Window Attention (SWA) with global attention, with a 128-token sliding window under a 5:1 hybrid ratio. The model is pre-trained on 27 trillion tokens with Multi-Token Prediction (MTP), employing a native 32k context length and subsequently extended to 256k. To efficiently scale post-training compute, MiMo-V2-Flash introduces a novel Multi-Teacher On-Policy Distillation (MOPD) paradigm. In this framework, domain-specialized teachers (e.g., trained via large-scale reinforcement learning) provide dense and token-level reward, enabling the student model to perfectly master teacher expertise. MiMo-V2-Flash rivals top-tier open-weight models such as DeepSeek-V3.2 and Kimi-K2, despite using only 1/2 and 1/3 of their total parameters, respectively. During inference, by repurposing MTP as a draft model for speculative decoding, MiMo-V2-Flash achieves up to 3.6 acceptance length and 2.6x decoding speedup with three MTP layers. We open-source both the model weights and the three-layer MTP weights to foster open research and community collaboration.

[145] NorwAI’s Large Language Models: Technical Report

Jon Atle Gulla, Peng Liu, Lemei Zhang

Main category: cs.CL

TL;DR: NorwAI developed a family of Norwegian/Scandinavian LLMs using diverse Transformer architectures, trained on 25B-88.45B tokens with Norwegian-extended tokenizers, featuring strong instruction-tuned variants for practical deployment.

Details

Motivation: Norwegian (spoken by ~5M people) is underrepresented in major NLP breakthroughs, creating a gap that needs to be addressed for Scandinavian language support.

Method: Built models on diverse Transformer architectures (GPT, Mistral, Llama2, Mixtral, Magistral), either pretrained from scratch or continually pretrained on 25B-88.45B tokens using Norwegian-extended tokenizers with advanced post-training strategies.

Result: Developed a family of Norwegian/Scandinavian LLMs with instruction-tuned variants (e.g., Mistral-7B-Instruct, Mixtral-8x7B-Instruct) showing strong assistant-style capabilities for practical deployment.

Conclusion: The NorwAI LLMs are openly available to Nordic organizations, companies, and students for research/experimental use, addressing the Norwegian language gap in NLP with detailed documentation provided.

Abstract: Norwegian, spoken by approximately five million people, remains underrepresented in many of the most significant breakthroughs in Natural Language Processing (NLP). To address this gap, the NorLLM team at NorwAI has developed a family of models specifically tailored to Norwegian and other Scandinavian languages, building on diverse Transformer-based architectures such as GPT, Mistral, Llama2, Mixtral and Magistral. These models are either pretrained from scratch or continually pretrained on 25B - 88.45B tokens, using a Norwegian-extended tokenizer and advanced post-training strategies to optimize performance, enhance robustness, and improve adaptability across various real-world tasks. Notably, instruction-tuned variants (e.g., Mistral-7B-Instruct and Mixtral-8x7B-Instruct) showcase strong assistant-style capabilities, underscoring their potential for practical deployment in interactive and domain-specific applications. The NorwAI large language models are openly available to Nordic organizations, companies and students for both research and experimental use. This report provides detailed documentation of the model architectures, training data, tokenizer design, fine-tuning strategies, deployment, and evaluations.

[146] BaseCal: Unsupervised Confidence Calibration via Base Model Signals

Hexiang Tan, Wanli Yang, Junwei Zhang, Xin Chen, Rui Tang, Du Su, Jingang Wang, Yuanzhuo Wang, Fei Sun, Xueqi Cheng

Main category: cs.CL

TL;DR: BaseCal: A method to calibrate overconfident post-trained LLMs using their base LLMs as reference, reducing calibration error by 42.9% without human labels or model modifications.

Details

Motivation: Post-trained LLMs (PoLLMs) are severely overconfident, compromising trust in their outputs, while their corresponding base LLMs remain well-calibrated. This creates an opportunity to use base LLMs as calibration references.

Method: Two approaches: 1) BaseCal-ReEval: Feed PoLLM responses into base LLM to get average probabilities as confidence (effective but adds inference overhead). 2) BaseCal-Proj: Train lightweight projection to map PoLLM’s final-layer hidden states to base LLM’s states, then use base LLM’s output layer to derive calibrated confidence.

Result: Experiments across five datasets and three LLM families show BaseCal reduces Expected Calibration Error (ECE) by an average of 42.90% compared to the best unsupervised baselines.

Conclusion: BaseCal provides an effective unsupervised, plug-and-play solution for calibrating PoLLM confidence using base LLMs as reference, without requiring human labels or model modifications.

Abstract: Reliable confidence is essential for trusting the outputs of LLMs, yet widely deployed post-trained LLMs (PoLLMs) typically compromise this trust with severe overconfidence. In contrast, we observe that their corresponding base LLMs often remain well-calibrated. This naturally motivates us to calibrate PoLLM confidence using the base LLM as a reference. This work proposes two ways to achieve this. A straightforward solution, BaseCal-ReEval, evaluates PoLLM’s responses by feeding them into the base LLM to get average probabilities as confidence. While effective, this approach introduces additional inference overhead. To address this, we propose BaseCal-Proj, which trains a lightweight projection to map the final-layer hidden states of PoLLMs back to those of their base LLMs. These projected states are then processed by the base LLM’s output layer to derive base-calibrated confidence for PoLLM’s responses. Notably, BaseCal is an unsupervised, plug-and-play solution that operates without human labels or LLM modifications. Experiments across five datasets and three LLM families demonstrate the effectiveness of BaseCal, reducing Expected Calibration Error (ECE) by an average of 42.90% compared to the best unsupervised baselines.

[147] Internal Reasoning vs. External Control: A Thermodynamic Analysis of Sycophancy in Large Language Models

Edward Y. Chang

Main category: cs.CL

TL;DR: RCA (Regulated Causal Anchoring) detects sycophancy in LLMs by verifying reasoning trace-output consistency without ground truth, achieving 0% sycophancy while accepting 88% of valid hints.

Details

Motivation: Current remedies for LLM sycophancy (RLHF, self-correction) require ground truth which is often unavailable at inference time and vulnerable to the same biases. They evaluate reasoning outcomes rather than processes.

Method: Regulated Causal Anchoring (RCA) evaluates the reasoning process by verifying whether model outputs follow from their reasoning traces, without requiring ground truth. It detects sycophancy as trace-output inconsistency.

Result: RCA achieves 0.0% sycophancy while accepting 88% of valid hints. It identifies two failures invisible to outcome evaluation: Inverse Scaling (frontier models sycophant more) and Final Output Gap (correct reasoning precedes sycophantic output). Traditional self-correction reduces these to 7-9% but cannot eliminate them.

Conclusion: RCA’s process evaluation operates at inference time, requires no ground truth, and uses an independent judge to break the self-reinforcing bias loop - three properties that outcome evaluation lacks, making it superior for detecting and preventing sycophancy.

Abstract: Large Language Models exhibit sycophancy: prioritizing agreeableness over correctness. Current remedies evaluate reasoning outcomes: RLHF rewards correct answers, self-correction critiques outputs. All require ground truth, which is often unavailable at inference time and vulnerable to the same biases. We explore evaluating the reasoning process instead. Regulated Causal Anchoring (RCA) verifies whether outputs follow from their reasoning traces, without requiring ground truth. Sycophancy manifests as trace-output inconsistency: models derive one answer but output another to please users. RCA detects this inconsistency, achieving 0.0% sycophancy while accepting 88% of valid hints. We identify two failures invisible to outcome evaluation: Inverse Scaling (frontier models sycophant more because rationalization requires capability) and the Final Output Gap (correct reasoning precedes sycophantic output). Traditional self-correction reduces these failures to 7-9% but cannot eliminate them because the model critiques itself with the same biases. RCA’s process evaluation operates at inference time, requires no ground truth, and uses an independent judge that breaks the self-reinforcing bias loop: three properties that outcome evaluation lacks.

[148] Evaluating the Pre-Consultation Ability of LLMs using Diagnostic Guidelines

Jean Seo, Gibaeg Kim, Kihun Shin, Seungseop Lim, Hyunkyung Lee, Wooseok Han, Jongwon Lee, Eunho Yang

Main category: cs.CL

TL;DR: EPAG is a benchmark for evaluating LLMs’ pre-consultation ability using diagnostic guidelines, showing that fine-tuned small models can outperform frontier LLMs, and that more HPI doesn’t always improve diagnosis.

Details

Motivation: To develop a framework for evaluating LLMs' ability to assist in pre-consultation clinical settings by comparing their performance against diagnostic guidelines and disease diagnosis tasks.

Method: Created EPAG benchmark dataset and framework with two evaluation approaches: direct HPI-diagnostic guideline comparison and indirect disease diagnosis. Experiments tested various LLMs including small open-source models fine-tuned on task-specific data versus frontier LLMs.

Result: Fine-tuned small open-source models outperformed frontier LLMs in pre-consultation tasks. Increased HPI amount didn’t necessarily improve diagnostic performance. Language of pre-consultation influenced dialogue characteristics.

Conclusion: EPAG provides a valuable evaluation framework for LLM applications in clinical settings. Task-specific fine-tuning can be more effective than using larger frontier models, and careful consideration of HPI quantity and language is important for optimal pre-consultation performance.

Abstract: We introduce EPAG, a benchmark dataset and framework designed for Evaluating the Pre-consultation Ability of LLMs using diagnostic Guidelines. LLMs are evaluated directly through HPI-diagnostic guideline comparison and indirectly through disease diagnosis. In our experiments, we observe that small open-source models fine-tuned with a well-curated, task-specific dataset can outperform frontier LLMs in pre-consultation. Additionally, we find that increased amount of HPI (History of Present Illness) does not necessarily lead to improved diagnostic performance. Further experiments reveal that the language of pre-consultation influences the characteristics of the dialogue. By open-sourcing our dataset and evaluation pipeline on https://github.com/seemdog/EPAG, we aim to contribute to the evaluation and further development of LLM applications in real-world clinical settings.

[149] Agent-Dice: Disentangling Knowledge Updates via Geometric Consensus for Agent Continual Learning

Zheng Wu, Xingyu Lou, Xinbei Ma, Yansi Li, Weiwen Liu, Weinan Zhang, Jun Wang, Zhuosheng Zhang

Main category: cs.CL

TL;DR: Agent-Dice is a parameter fusion framework that addresses catastrophic forgetting in LLM-based agents by distinguishing between shared common knowledge and conflicting task-specific knowledge through directional consensus evaluation.

Details

Motivation: LLM-based agents face the stability-plasticity dilemma when learning new tasks continually, suffering from catastrophic forgetting. The core issue is failure to distinguish between common knowledge shared across tasks and conflicting knowledge from task-specific interference.

Method: Two-stage parameter fusion framework: 1) geometric consensus filtering to prune conflicting gradients, and 2) curvature-based importance weighting to amplify shared semantics. Uses directional consensus evaluation to disentangle knowledge updates.

Result: Agent-Dice demonstrates outstanding continual learning performance with minimal computational overhead and parameter updates in experiments on GUI agents and tool-use agent domains.

Conclusion: The framework successfully addresses the stability-plasticity dilemma by explicitly separating shared and conflicting knowledge, providing both practical performance improvements and theoretical insights into the origins of the problem.

Abstract: Large Language Model (LLM)-based agents significantly extend the utility of LLMs by interacting with dynamic environments. However, enabling agents to continually learn new tasks without catastrophic forgetting remains a critical challenge, known as the stability-plasticity dilemma. In this work, we argue that this dilemma fundamentally arises from the failure to explicitly distinguish between common knowledge shared across tasks and conflicting knowledge introduced by task-specific interference. To address this, we propose Agent-Dice, a parameter fusion framework based on directional consensus evaluation. Concretely, Agent-Dice disentangles knowledge updates through a two-stage process: geometric consensus filtering to prune conflicting gradients, and curvature-based importance weighting to amplify shared semantics. We provide a rigorous theoretical analysis that establishes the validity of the proposed fusion scheme and offers insight into the origins of the stability-plasticity dilemma. Extensive experiments on GUI agents and tool-use agent domains demonstrate that Agent-Dice exhibits outstanding continual learning performance with minimal computational overhead and parameter updates. The codes are available at https://github.com/Wuzheng02/Agent-Dice.

[150] VotIE: Information Extraction from Meeting Minutes

José Pedro Evans, Luís Filipe Cunha, Purificação Silvano, Alípio Jorge, Nuno Guimarães, Sérgio Nunes, Ricardo Campos

Main category: cs.CL

TL;DR: VotIE is a new information extraction task for identifying voting events in municipal meeting minutes, with experiments showing fine-tuned encoders perform best in-domain but few-shot LLMs generalize better across municipalities.

Details

Motivation: Municipal meeting minutes contain voting outcomes in highly heterogeneous, free-form narrative text that varies widely across municipalities, posing significant challenges for automated extraction compared to standardized parliamentary proceedings.

Method: Introduced VotIE (Voting Information Extraction) task and established first benchmark using Portuguese municipal minutes from CitiLink corpus. Compared fine-tuned encoders (XLM-R-CRF) with generative approaches under both in-domain and cross-municipality evaluation settings.

Result: 1) In-domain: Fine-tuned XLM-R-CRF achieved 93.2% macro F1, outperforming generative approaches. 2) Cross-municipality: Fine-tuned models suffered substantial performance degradation, while few-shot LLMs showed greater robustness with significantly smaller performance declines.

Conclusion: While few-shot LLMs demonstrate better generalization across municipalities, their high computational cost makes lightweight fine-tuned encoders more practical for large-scale deployment. The benchmark, models, and evaluation framework are publicly released for reproducible research.

Abstract: Municipal meeting minutes record key decisions in local democratic processes. Unlike parliamentary proceedings, which typically adhere to standardized formats, they encode voting outcomes in highly heterogeneous, free-form narrative text that varies widely across municipalities, posing significant challenges for automated extraction. In this paper, we introduce VotIE (Voting Information Extraction), a new information extraction task aimed at identifying structured voting events in narrative deliberative records, and establish the first benchmark for this task using Portuguese municipal minutes, building on the recently introduced CitiLink corpus. Our experiments yield two key findings. First, under standard in-domain evaluation, fine-tuned encoders, specifically XLM-R-CRF, achieve the strongest performance, reaching 93.2% macro F1, outperforming generative approaches. Second, in a cross-municipality setting that evaluates transfer to unseen administrative contexts, these models suffer substantial performance degradation, whereas few-shot LLMs demonstrate greater robustness, with significantly smaller declines in performance. Despite this generalization advantage, the high computational cost of generative models currently constrains their practicality. As a result, lightweight fine-tuned encoders remain a more practical option for large-scale, real-world deployment. To support reproducible research in administrative NLP, we publicly release our benchmark, trained models, and evaluation framework.

[151] All That Glisters Is Not Gold: A Benchmark for Reference-Free Counterfactual Financial Misinformation Detection

Yuechen Jiang, Zhiwei Liu, Yupeng Cao, Yueru He, Chen Xu, Ziyang Xu, Zhiyang Deng, Prayag Tiwari, Xi Chen, Alejandro Lopez-Lira, Jimin Huang, Junichi Tsujii, Sophia Ananiadou

Main category: cs.CL

TL;DR: RFC Bench is a benchmark for evaluating LLMs on financial misinformation detection in realistic news contexts, featuring two tasks: reference-free detection and comparative diagnosis using paired original/perturbed inputs.

Details

Motivation: Financial misinformation in news requires nuanced detection because meaning emerges from dispersed contextual cues, and current models need better evaluation for real-world financial misinformation scenarios.

Method: Created RFC Bench benchmark operating at paragraph level with two tasks: 1) reference-free misinformation detection, and 2) comparison-based diagnosis using paired original and perturbed inputs.

Result: Models perform substantially better with comparative context than in reference-free settings, which expose significant weaknesses including unstable predictions and elevated invalid outputs.

Conclusion: Current models struggle to maintain coherent belief states without external grounding, and RFC Bench provides a structured testbed for studying reference-free reasoning and advancing reliable financial misinformation detection.

Abstract: We introduce RFC Bench, a benchmark for evaluating large language models on financial misinformation under realistic news. RFC Bench operates at the paragraph level and captures the contextual complexity of financial news where meaning emerges from dispersed cues. The benchmark defines two complementary tasks: reference free misinformation detection and comparison based diagnosis using paired original perturbed inputs. Experiments reveal a consistent pattern: performance is substantially stronger when comparative context is available, while reference free settings expose significant weaknesses, including unstable predictions and elevated invalid outputs. These results indicate that current models struggle to maintain coherent belief states without external grounding. By highlighting this gap, RFC Bench provides a structured testbed for studying reference free reasoning and advancing more reliable financial misinformation detection in real world settings.

cs.CV

[152] Beyond Binary Preference: Aligning Diffusion Models to Fine-grained Criteria by Decoupling Attributes

Chenye Meng, Zejian Li, Zhongni Liu, Yize Li, Changle Xie, Kaixin Jia, Ling Yang, Huanghuang Deng, Shiying Ding, Shengyuan Zhang, Jiayi Li, Lingyun Sun

Main category: cs.CV

TL;DR: CPO (Complex Preference Optimization) aligns diffusion models with hierarchical, fine-grained human expertise using domain-specific criteria and auxiliary diffusion models, improving painting generation quality.

Details

Motivation: Current post-training alignment methods use oversimplified signals (scalar rewards or binary preferences), which fail to capture the hierarchical and fine-grained nature of complex human expertise needed for high-quality image generation.

Method: Two-stage framework: 1) Construct hierarchical evaluation criteria with domain experts, decomposing image quality into tree-structured positive/negative attributes. 2) Inject domain knowledge via Supervised Fine-Tuning of an auxiliary diffusion model. 3) Introduce CPO that extends DPO to align target diffusion model with non-binary, hierarchical criteria by simultaneously maximizing positive attribute probabilities and minimizing negative ones using the auxiliary model.

Result: Extensive experiments in painting generation domain show CPO significantly enhances generation quality and alignment with expertise, using an annotated dataset with fine-grained attributes based on the constructed criteria.

Conclusion: CPO opens new avenues for fine-grained criteria alignment by enabling diffusion models to better capture complex human expertise through hierarchical, multi-attribute evaluation frameworks.

Abstract: Post-training alignment of diffusion models relies on simplified signals, such as scalar rewards or binary preferences. This limits alignment with complex human expertise, which is hierarchical and fine-grained. To address this, we first construct a hierarchical, fine-grained evaluation criteria with domain experts, which decomposes image quality into multiple positive and negative attributes organized in a tree structure. Building on this, we propose a two-stage alignment framework. First, we inject domain knowledge to an auxiliary diffusion model via Supervised Fine-Tuning. Second, we introduce Complex Preference Optimization (CPO) that extends DPO to align the target diffusion to our non-binary, hierarchical criteria. Specifically, we reformulate the alignment problem to simultaneously maximize the probability of positive attributes while minimizing the probability of negative attributes with the auxiliary diffusion. We instantiate our approach in the domain of painting generation and conduct CPO training with an annotated dataset of painting with fine-grained attributes based on our criteria. Extensive experiments demonstrate that CPO significantly enhances generation quality and alignment with expertise, opening new avenues for fine-grained criteria alignment.

[153] CounterVid: Counterfactual Video Generation for Mitigating Action and Temporal Hallucinations in Video-Language Models

Tobia Poppi, Burak Uzkent, Amanmeet Garg, Lucas Porto, Garin Kessler, Yezhou Yang, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara, Florian Schiffers

Main category: cs.CV

TL;DR: CounterVid: A framework for generating counterfactual videos to address VLM hallucinations in action and temporal reasoning, using synthetic preference pairs and MixDPO fine-tuning.

Details

Motivation: Video-language models suffer from hallucinations, especially in action and temporal reasoning, due to over-reliance on language priors rather than visual dynamics. Existing mitigation strategies fail to address the root cause.

Method: Propose scalable counterfactual video generation framework combining multimodal LLMs for action proposal/editing guidance with diffusion models. Build CounterVid dataset (~26k preference pairs) and introduce MixDPO for joint textual and visual preference optimization.

Result: Fine-tuning Qwen2.5-VL with MixDPO yields consistent improvements, notably in temporal ordering, and transfers effectively to standard video hallucination benchmarks.

Conclusion: The proposed counterfactual video generation framework and MixDPO approach effectively address VLM hallucinations in action and temporal reasoning, with promising results on benchmark evaluations.

Abstract: Video-language models (VLMs) achieve strong multimodal understanding but remain prone to hallucinations, especially when reasoning about actions and temporal order. Existing mitigation strategies, such as textual filtering or random video perturbations, often fail to address the root cause: over-reliance on language priors rather than fine-grained visual dynamics. We propose a scalable framework for counterfactual video generation that synthesizes videos differing only in actions or temporal structure while preserving scene context. Our pipeline combines multimodal LLMs for action proposal and editing guidance with diffusion-based image and video models to generate semantic hard negatives at scale. Using this framework, we build CounterVid, a synthetic dataset of ~26k preference pairs targeting action recognition and temporal reasoning. We further introduce MixDPO, a unified Direct Preference Optimization approach that jointly leverages textual and visual preferences. Fine-tuning Qwen2.5-VL with MixDPO yields consistent improvements, notably in temporal ordering, and transfers effectively to standard video hallucination benchmarks. Code and models will be made publicly available.

[154] Embedding Textual Information in Images Using Quinary Pixel Combinations

A V Uday Kiran Kandala

Main category: cs.CV

TL;DR: A novel steganography method using quinary pixel intensity combinations in RGB space to embed text in images with minimal distortion and high efficiency.

Details

Motivation: Existing text embedding methods (LSB, PVD, transform domain, deep learning) often create noise, are computationally heavy, or require multiple pixels per character. Need for more efficient, less distorting single-pixel encoding.

Method: Uses quinary combinations of pixel intensities in RGB space - five controlled intensity variations per R/G/B channel create 125 distinct combinations. Maps these combinations to textual symbols (letters, numbers, special characters).

Result: No significant distortion in images (verified by MSE, MAE, SNR, PSNR, SSIM, Histogram, Heatmap). Achieves improved embedding efficiency by encoding complete textual symbol within single RGB pixel.

Conclusion: The proposed method offers efficient, low-distortion text embedding using quinary pixel intensity combinations, outperforming traditional LSB/MSB and complex deep learning approaches in computational efficiency and embedding density.

Abstract: This paper presents a novel technique for embedding textual data into images using quinary combinations of pixel intensities in RGB space. Existing methods predominantly rely on least and most significant bit (LSB & MSB) manipulation, Pixel Value Differencing (PVD), spatial perturbations in RGB channels, transform domain based methods, Quantization methods, Edge and Region based methods and more recently through deep learning methods and generative AI techniques for hiding textual information in spatial domain of images. Most of them are dependent on pixel intensity flipping over multiple pixels, such as LSB and combination of LSB based methodologies, and on transform coefficients, often resulting in the form of noise. Encoding and Decoding are deterministic in most of the existing approaches and are computationally heavy in case of higher models such as deep learning and gen AI approaches. The proposed method works on quinary pixel intensity combinations in RGB space, where five controlled different pixel intensity variations in each of the R, G, and B channels formulate up to one hundred and twenty five distinct pixel intensity combinations. These combinations are mapped to textual symbols, enabling the representation of uppercase and lowercase alphabetic characters, numeric digits, whitespace, and commonly used special characters. Different metrics such as MSE, MAE, SNR, PSNR, SSIM, Histogram Comparison and Heatmap analysis, were evaluated for both original and encoded images resulting in no significant distortion in the images. Furthermore, the method achieves improved embedding efficiency by encoding a complete textual symbol within a single RGB pixel, in contrast to LSB and MSB based approaches that typically require multiple pixels or multi-step processes, as well as transform and learning based methods that incur higher computational overhead.

[155] Unified Text-Image Generation with Weakness-Targeted Post-Training

Jiahui Chen, Philippe Hansen-Estruch, Xiaochuang Han, Yushi Hu, Emily Dinan, Amita Kamath, Michal Drozdzal, Reyhane Askari-Hemmat, Luke Zettlemoyer, Marjan Ghazvininejad

Main category: cs.CV

TL;DR: The paper proposes a unified multimodal generation architecture that enables autonomous text-to-image synthesis within a single inference process, using reward-weighted post-training with synthetic data to improve performance across multiple benchmarks.

Details

Motivation: Existing unified multimodal generation systems rely on explicit modality switching (generating reasoning text first, then manually switching to image generation), which limits cross-modal coupling and prohibits automatic multimodal generation. The authors aim to achieve fully unified text-image generation where models autonomously transition from textual reasoning to visual synthesis.

Method: The approach uses offline, reward-weighted post-training with fully self-generated synthetic data. The authors explore different post-training data strategies and examine the impact of joint text-image generation on T2I performance, as well as the relative importance of each modality during post-training.

Result: The method enables improvements in multimodal image generation across four diverse T2I benchmarks. The targeted dataset addressing specific limitations achieves superior results compared to broad image-caption corpora or benchmark-aligned data. Reward-weighting both modalities and strategically designed post-training data proves effective.

Conclusion: Fully unified text-image generation through post-training with reward-weighted synthetic data is an effective approach that outperforms existing modality-switching methods, enabling autonomous multimodal generation within a single inference process.

Abstract: Unified multimodal generation architectures that jointly produce text and images have recently emerged as a promising direction for text-to-image (T2I) synthesis. However, many existing systems rely on explicit modality switching, generating reasoning text before switching manually to image generation. This separate, sequential inference process limits cross-modal coupling and prohibits automatic multimodal generation. This work explores post-training to achieve fully unified text-image generation, where models autonomously transition from textual reasoning to visual synthesis within a single inference process. We examine the impact of joint text-image generation on T2I performance and the relative importance of each modality during post-training. We additionally explore different post-training data strategies, showing that a targeted dataset addressing specific limitations achieves superior results compared to broad image-caption corpora or benchmark-aligned data. Using offline, reward-weighted post-training with fully self-generated synthetic data, our approach enables improvements in multimodal image generation across four diverse T2I benchmarks, demonstrating the effectiveness of reward-weighting both modalities and strategically designed post-training data.

[156] ReHyAt: Recurrent Hybrid Attention for Video Diffusion Transformers

Mohsen Ghafoorian, Amirhossein Habibian

Main category: cs.CV

TL;DR: ReHyAt introduces a recurrent hybrid attention mechanism combining softmax and linear attention for efficient video generation with linear complexity instead of quadratic.

Details

Motivation: Current transformer-based video diffusion models have quadratic attention complexity that severely limits scalability for longer video sequences, making them impractical for long-duration and on-device generation.

Method: ReHyAt uses a Recurrent Hybrid Attention mechanism that combines softmax attention fidelity with linear attention efficiency, enabling chunk-wise recurrent reformulation with constant memory usage. It includes a lightweight distillation pipeline to efficiently transfer knowledge from existing softmax-based models.

Result: ReHyAt reduces training cost by two orders of magnitude (~160 GPU hours vs. typical quadratic models) while achieving state-of-the-art video quality on VBench and VBench-2.0 benchmarks. It reduces attention cost from quadratic to linear complexity.

Conclusion: ReHyAt provides a practical solution for scalable long-duration and on-device video generation by combining efficiency with quality, offering a recipe that can be applied to future bidirectional softmax-based models.

Abstract: Recent advances in video diffusion models have shifted towards transformer-based architectures, achieving state-of-the-art video generation but at the cost of quadratic attention complexity, which severely limits scalability for longer sequences. We introduce ReHyAt, a Recurrent Hybrid Attention mechanism that combines the fidelity of softmax attention with the efficiency of linear attention, enabling chunk-wise recurrent reformulation and constant memory usage. Unlike the concurrent linear-only SANA Video, ReHyAt’s hybrid design allows efficient distillation from existing softmax-based models, reducing the training cost by two orders of magnitude to ~160 GPU hours, while being competitive in the quality. Our light-weight distillation and finetuning pipeline provides a recipe that can be applied to future state-of-the-art bidirectional softmax-based models. Experiments on VBench and VBench-2.0, as well as a human preference study, demonstrate that ReHyAt achieves state-of-the-art video quality while reducing attention cost from quadratic to linear, unlocking practical scalability for long-duration and on-device video generation. Project page is available at https://qualcomm-ai-research.github.io/rehyat.

[157] SCAR-GS: Spatial Context Attention for Residuals in Progressive Gaussian Splatting

Diego Revilla, Pooja Suresh, Anand Bhojan, Ooi Wei Tsang

Main category: cs.CV

TL;DR: Progressive 3D Gaussian Splatting compression using Residual Vector Quantization with auto-regressive entropy modeling guided by multi-resolution hash grids.

Details

Motivation: Current 3D Gaussian Splatting models have high storage requirements that hinder cloud/streaming deployment. Existing progressive compression methods use scalar quantization which may not optimally capture correlations in high-dimensional feature vectors, limiting rate-distortion performance.

Method: Introduces a progressive codec that replaces traditional methods with Residual Vector Quantization to compress primitive features. Uses an auto-regressive entropy model guided by a multi-resolution hash grid to predict conditional probabilities of transmitted indices, enabling efficient compression of coarse and refinement layers.

Result: The method provides more efficient compression for 3D Gaussian Splatting models compared to existing progressive compression techniques that use scalar quantization and spatial context models.

Conclusion: Residual Vector Quantization with auto-regressive entropy modeling offers superior compression efficiency for 3D Gaussian Splatting, addressing storage limitations for cloud and streaming deployment.

Abstract: Recent advances in 3D Gaussian Splatting have allowed for real-time, high-fidelity novel view synthesis. Nonetheless, these models have significant storage requirements for large and medium-sized scenes, hindering their deployment over cloud and streaming services. Some of the most recent progressive compression techniques for these models rely on progressive masking and scalar quantization techniques to reduce the bitrate of Gaussian attributes using spatial context models. While effective, scalar quantization may not optimally capture the correlations of high-dimensional feature vectors, which can potentially limit the rate-distortion performance. In this work, we introduce a novel progressive codec for 3D Gaussian Splatting that replaces traditional methods with a more powerful Residual Vector Quantization approach to compress the primitive features. Our key contribution is an auto-regressive entropy model, guided by a multi-resolution hash grid, that accurately predicts the conditional probability of each successive transmitted index, allowing for coarse and refinement layers to be compressed with high efficiency.

[158] Comparative Analysis of Custom CNN Architectures versus Pre-trained Models and Transfer Learning: A Study on Five Bangladesh Datasets

Ibrahim Tanvir, Alif Ruslan, Sartaj Solaiman

Main category: cs.CV

TL;DR: Custom CNNs vs pre-trained models (ResNet-18, VGG-16) tested on 5 Bangladeshi image datasets. Transfer learning with fine-tuning outperforms custom CNNs and feature extraction, achieving up to 76% accuracy improvements and perfect 100% on Road Damage BD dataset.

Details

Motivation: To provide practical guidance for practitioners on selecting appropriate deep learning approaches by comparing custom-built CNNs against popular pre-trained architectures using both feature extraction and transfer learning methods across diverse real-world image classification datasets from Bangladesh.

Method: Comparative analysis of custom CNNs vs pre-trained architectures (ResNet-18, VGG-16) using feature extraction and transfer learning approaches. Evaluated on five Bangladeshi image datasets: Footpath Vision, Auto Rickshaw Detection, Mango Image Classification, Paddy Variety Recognition, and Road Damage Detection.

Result: Transfer learning with fine-tuning consistently outperformed both custom CNNs and feature extraction methods, achieving accuracy improvements of 3% to 76% across datasets. ResNet-18 with fine-tuning achieved 100% accuracy on Road Damage BD dataset. Custom CNNs had advantages in model size (3.4M parameters vs 11-134M) and training efficiency on simpler tasks.

Conclusion: Pre-trained models with transfer learning provide superior performance, especially for complex classification tasks with limited data, while custom CNNs offer benefits in model size and efficiency for simpler tasks. The research provides practical insights for selecting approaches based on dataset characteristics, computational resources, and performance requirements.

Abstract: This study presents a comprehensive comparative analysis of custom-built Convolutional Neural Networks (CNNs) against popular pre-trained architectures (ResNet-18 and VGG-16) using both feature extraction and transfer learning approaches. We evaluated these models across five diverse image classification datasets from Bangladesh: Footpath Vision, Auto Rickshaw Detection, Mango Image Classification, Paddy Variety Recognition, and Road Damage Detection. Our experimental results demonstrate that transfer learning with fine-tuning consistently outperforms both custom CNNs built from scratch and feature extraction methods, achieving accuracy improvements ranging from 3% to 76% across different datasets. Notably, ResNet-18 with fine-tuning achieved perfect 100% accuracy on the Road Damage BD dataset. While custom CNNs offer advantages in model size (3.4M parameters vs. 11-134M for pre-trained models) and training efficiency on simpler tasks, pre-trained models with transfer learning provide superior performance, particularly on complex classification tasks with limited training data. This research provides practical insights for practitioners in selecting appropriate deep learning approaches based on dataset characteristics, computational resources, and performance requirements.

[159] PackCache: A Training-Free Acceleration Method for Unified Autoregressive Video Generation via Compact KV-Cache

Kunyang Li, Mubarak Shah, Yuzhang Shang

Main category: cs.CV

TL;DR: PackCache: Training-free KV-cache management method for unified autoregressive video generation that dynamically compacts KV cache using condition anchoring, cross-frame decay modeling, and spatially preserving position embedding, achieving 1.7-2.2x acceleration for 48-frame sequences.

Details

Motivation: KV-cache size in unified autoregressive models grows linearly with generated tokens, becoming the dominant bottleneck for inference efficiency and generative length in video generation. Analysis shows KV-cache tokens have distinct spatiotemporal properties that can be exploited for optimization.

Method: PackCache uses three coordinated mechanisms: 1) condition anchoring preserves semantic references (text/conditioning-image tokens), 2) cross-frame decay modeling allocates cache budget based on temporal distance, and 3) spatially preserving position embedding maintains coherent 3D structure under cache removal.

Result: PackCache accelerates end-to-end generation by 1.7-2.2x on 48-frame sequences, with the final four frames (most expensive segment) achieving 2.6x acceleration on A40 and 3.7x on H200 for 48-frame videos.

Conclusion: PackCache effectively addresses KV-cache bottlenecks in unified autoregressive video generation by exploiting spatiotemporal token properties, enabling longer-sequence video generation with significant inference acceleration without requiring training.

Abstract: A unified autoregressive model is a Transformer-based framework that addresses diverse multimodal tasks (e.g., text, image, video) as a single sequence modeling problem under a shared token space. Such models rely on the KV-cache mechanism to reduce attention computation from O(T^2) to O(T); however, KV-cache size grows linearly with the number of generated tokens, and it rapidly becomes the dominant bottleneck limiting inference efficiency and generative length. Unified autoregressive video generation inherits this limitation. Our analysis reveals that KV-cache tokens exhibit distinct spatiotemporal properties: (i) text and conditioning-image tokens act as persistent semantic anchors that consistently receive high attention, and (ii) attention to previous frames naturally decays with temporal distance. Leveraging these observations, we introduce PackCache, a training-free KV-cache management method that dynamically compacts the KV cache through three coordinated mechanisms: condition anchoring that preserves semantic references, cross-frame decay modeling that allocates cache budget according to temporal distance, and spatially preserving position embedding that maintains coherent 3D structure under cache removal. In terms of efficiency, PackCache accelerates end-to-end generation by 1.7-2.2x on 48-frame long sequences, showcasing its strong potential for enabling longer-sequence video generation. Notably, the final four frames - the portion most impacted by the progressively expanding KV-cache and thus the most expensive segment of the clip - PackCache delivers a 2.6x and 3.7x acceleration on A40 and H200, respectively, for 48-frame videos.

[160] Combining facial videos and biosignals for stress estimation during driving

Paraskevi Valergaki, Vassilis C. Nicodemou, Iason Oikonomidis, Antonis Argyros, Anastasios Roussos

Main category: cs.CV

TL;DR: The paper proposes a Transformer-based framework for stress recognition using disentangled 3D facial geometry from EMOCA, showing that 41 of 56 facial coefficients respond to stress comparably to physiological markers, with cross-modal attention fusion achieving best performance.

Details

Motivation: Stress recognition from facial videos is challenging due to stress's subjective nature and voluntary facial control. While most methods rely on Facial Action Units, the role of disentangled 3D facial geometry remains underexplored.

Method: Analyze stress during distracted driving using EMOCA-derived 3D expression and pose coefficients. Use paired hypothesis tests between baseline and stressor phases. Propose a Transformer-based temporal modeling framework and assess unimodal, early-fusion, and cross-modal attention strategies.

Result: 41 of 56 coefficients show consistent, phase-specific stress responses comparable to physiological markers. Cross-Modal Attention fusion of EMOCA and physiological signals achieves best performance (AUROC 92%, Accuracy 86.7%), with EMOCA-gaze fusion also competitive (AUROC 91.8%).

Conclusion: This highlights the effectiveness of temporal modeling and cross-modal attention for stress recognition, demonstrating that disentangled 3D facial geometry provides valuable stress indicators comparable to physiological markers.

Abstract: Reliable stress recognition from facial videos is challenging due to stress’s subjective nature and voluntary facial control. While most methods rely on Facial Action Units, the role of disentangled 3D facial geometry remains underexplored. We address this by analyzing stress during distracted driving using EMOCA-derived 3D expression and pose coefficients. Paired hypothesis tests between baseline and stressor phases reveal that 41 of 56 coefficients show consistent, phase-specific stress responses comparable to physiological markers. Building on this, we propose a Transformer-based temporal modeling framework and assess unimodal, early-fusion, and cross-modal attention strategies. Cross-Modal Attention fusion of EMOCA and physiological signals achieves best performance (AUROC 92%, Accuracy 86.7%), with EMOCA-gaze fusion also competitive (AUROC 91.8%). This highlights the effectiveness of temporal modeling and cross-modal attention for stress recognition.

[161] Few-Shot LoRA Adaptation of a Flow-Matching Foundation Model for Cross-Spectral Object Detection

Maxim Clouser, Kia Khezeli, John Kalantari

Main category: cs.CV

TL;DR: A single RGB foundation model (FLUX.1) can be adapted with just 100 paired images to translate RGB to IR/SAR, enabling synthetic data generation that improves object detection in non-visible modalities.

Details

Motivation: Safety-critical applications often use non-visible modalities (IR, SAR) but lack foundation models trained on such data. The paper explores whether RGB foundation models can be repurposed for cross-spectral translation with minimal adaptation.

Method: Fine-tune FLUX.1 Kontext with LoRA modules using only 100 paired images per domain (RGB→IR on KAIST, RGB→SAR on M4-SAR). Use LPIPS on 50 held-out pairs as proxy for downstream performance to select best LoRA hyperparameters.

Result: Lower LPIPS consistently predicts higher mAP for YOLOv11n on both IR/SAR and DETR on KAIST IR. Synthetic IR from external RGB datasets improves KAIST pedestrian detection, and synthetic SAR boosts infrastructure detection when combined with limited real SAR.

Conclusion: Few-shot LoRA adaptation of flow-matching foundation models is a promising approach for extending foundation-style support to non-visible modalities, enabling effective cross-spectral translation with minimal data requirements.

Abstract: Foundation models for vision are predominantly trained on RGB data, while many safety-critical applications rely on non-visible modalities such as infrared (IR) and synthetic aperture radar (SAR). We study whether a single flow-matching foundation model pre-trained primarily on RGB images can be repurposed as a cross-spectral translator using only a few co-measured examples, and whether the resulting synthetic data can enhance downstream detection. Starting from FLUX.1 Kontext, we insert low-rank adaptation (LoRA) modules and fine-tune them on just 100 paired images per domain for two settings: RGB to IR on the KAIST dataset and RGB to SAR on the M4-SAR dataset. The adapted model translates RGB images into pixel-aligned IR/SAR, enabling us to reuse existing bounding boxes and train object detection models purely in the target modality. Across a grid of LoRA hyperparameters, we find that LPIPS computed on only 50 held-out pairs is a strong proxy for downstream performance: lower LPIPS consistently predicts higher mAP for YOLOv11n on both IR and SAR, and for DETR on KAIST IR test data. Using the best LPIPS-selected LoRA adapter, synthetic IR from external RGB datasets (LLVIP, FLIR ADAS) improves KAIST IR pedestrian detection, and synthetic SAR significantly boosts infrastructure detection on M4-SAR when combined with limited real SAR. Our results suggest that few-shot LoRA adaptation of flow-matching foundation models is a promising path toward foundation-style support for non-visible modalities.

[162] Towards Real-world Lens Active Alignment with Unlabeled Data via Domain Adaptation

Wenyong Li, Qi Jiang, Weijian Hu, Kailun Yang, Zhanjun Zhang, Wenjun Tian, Kaiwei Wang, Jian Bai

Main category: cs.CV

TL;DR: DA3 uses domain adaptation to bridge simulation-real gap in optical alignment, improving accuracy by 46% with minimal real data, reducing collection time by 98.7%.

Details

Motivation: Active Alignment is crucial for automated optical assembly, but simulation-trained models suffer from domain gap with real-world images, limiting generalization. Traditional per-model calibration is labor-intensive, while purely simulation approaches don't generalize well to real conditions.

Method: Propose Domain Adaptive Active Alignment (DA3) with: 1) autoregressive domain transformation generator, 2) adversarial-based feature alignment strategy, 3) self-supervised learning to distill real-world domain information, extracting domain-invariant image degradation features for robust misalignment prediction.

Result: DA3 improves accuracy by 46% over purely simulation pipeline. Approaches performance of precisely labeled real-world data from 3 lens samples while reducing on-device data collection time by 98.7%. Validated on two lens types.

Conclusion: Domain adaptation effectively enables simulation-trained models to achieve robust real-world performance, validating digital-twin pipeline as practical solution to enhance efficiency of large-scale optical assembly with minimal real data requirements.

Abstract: Active Alignment (AA) is a key technology for the large-scale automated assembly of high-precision optical systems. Compared with labor-intensive per-model on-device calibration, a digital-twin pipeline built on optical simulation offers a substantial advantage in generating large-scale labeled data. However, complex imaging conditions induce a domain gap between simulation and real-world images, limiting the generalization of simulation-trained models. To address this, we propose augmenting a simulation baseline with minimal unlabeled real-world images captured at random misalignment positions, mitigating the gap from a domain adaptation perspective. We introduce Domain Adaptive Active Alignment (DA3), which utilizes an autoregressive domain transformation generator and an adversarial-based feature alignment strategy to distill real-world domain information via self-supervised learning. This enables the extraction of domain-invariant image degradation features to facilitate robust misalignment prediction. Experiments on two lens types reveal that DA3 improves accuracy by 46% over a purely simulation pipeline. Notably, it approaches the performance achieved with precisely labeled real-world data collected on 3 lens samples, while reducing on-device data collection time by 98.7%. The results demonstrate that domain adaptation effectively endows simulation-trained models with robust real-world performance, validating the digital-twin pipeline as a practical solution to significantly enhance the efficiency of large-scale optical assembly.

[163] Performance Analysis of Image Classification on Bangladeshi Datasets

Mohammed Sami Khan, Fabiha Muniat, Rowzatul Zannat

Main category: cs.CV

TL;DR: Custom CNN vs pre-trained models (VGG-16, ResNet-50, MobileNet) for image classification: Pre-trained models outperform custom CNN in accuracy and convergence, especially with limited data, but custom CNN has fewer parameters and lower computational complexity.

Details

Motivation: The paper addresses the practical dilemma in image classification between designing custom CNNs from scratch versus using established pre-trained architectures, aiming to provide empirical evidence on their comparative performance and trade-offs.

Method: Comparative analysis of a custom-designed CNN trained from scratch versus pre-trained architectures (VGG-16, ResNet-50, MobileNet) using transfer learning. All models evaluated under identical experimental settings with standard metrics: accuracy, precision, recall, and F1-score.

Result: Pre-trained CNN architectures consistently outperform custom CNN in classification accuracy and convergence speed, particularly with limited training data. However, custom CNN shows competitive performance with significantly fewer parameters and reduced computational complexity.

Conclusion: The study highlights trade-offs between model complexity, performance, and computational efficiency, providing practical insights for selecting appropriate CNN architectures based on specific constraints like computational resources and data availability.

Abstract: Convolutional Neural Networks (CNNs) have demonstrated remarkable success in image classification tasks; however, the choice between designing a custom CNN from scratch and employing established pre-trained architectures remains an important practical consideration. In this work, we present a comparative analysis of a custom-designed CNN and several widely used deep learning architectures, including VGG-16, ResNet-50, and MobileNet, for an image classification task. The custom CNN is developed and trained from scratch, while the popular architectures are employed using transfer learning under identical experimental settings. All models are evaluated using standard performance metrics such as accuracy, precision, recall, and F1-score. Experimental results show that pre-trained CNN architectures consistently outperform the custom CNN in terms of classification accuracy and convergence speed, particularly when training data is limited. However, the custom CNN demonstrates competitive performance with significantly fewer parameters and reduced computational complexity. This study highlights the trade-offs between model complexity, performance, and computational efficiency, and provides practical insights into selecting appropriate CNN architectures for image classification problems.

Jusheng Zhang, Yijia Fan, Zimo Wen, Jian Wang, Keze Wang

Main category: cs.CV

TL;DR: Tri MARF is a tri-modal framework using 2D images, text, and 3D point clouds with multi-agent collaboration to improve large-scale 3D object annotation, outperforming existing methods in accuracy and efficiency.

Details

Motivation: 3D object annotation is challenging due to spatial complexity, occlusion, and viewpoint inconsistency. Single-model approaches struggle with these issues, especially for applications in autonomous driving, robotics, and augmented reality.

Method: Tri MARF integrates tri-modal inputs (2D multi-view images, textual descriptions, 3D point clouds) using a multi-agent collaborative architecture. It has three specialized agents: 1) vision-language model agent for multi-view descriptions, 2) information aggregation agent for optimal description selection, and 3) gating agent aligning textual semantics with 3D geometry for refined captioning.

Result: Achieves CLIPScore of 88.7 (vs. prior SOTA), retrieval accuracy of 45.2 and 43.8 on ViLT R@5, and throughput of up to 12,000 objects/hour on a single NVIDIA A100 GPU. Tested on Objaverse, LVIS, Objaverse XL, and ABO datasets.

Conclusion: Tri MARF substantially outperforms existing methods for 3D object annotation by effectively leveraging multi-modal inputs through specialized agent collaboration, demonstrating superior accuracy and efficiency for large-scale applications.

Abstract: Driven by applications in autonomous driving robotics and augmented reality 3D object annotation presents challenges beyond 2D annotation including spatial complexity occlusion and viewpoint inconsistency Existing approaches based on single models often struggle to address these issues effectively We propose Tri MARF a novel framework that integrates tri modal inputs including 2D multi view images textual descriptions and 3D point clouds within a multi agent collaborative architecture to enhance large scale 3D annotation Tri MARF consists of three specialized agents a vision language model agent for generating multi view descriptions an information aggregation agent for selecting optimal descriptions and a gating agent that aligns textual semantics with 3D geometry for refined captioning Extensive experiments on Objaverse LVIS Objaverse XL and ABO demonstrate that Tri MARF substantially outperforms existing methods achieving a CLIPScore of 88 point 7 compared to prior state of the art methods retrieval accuracy of 45 point 2 and 43 point 8 on ViLT R at 5 and a throughput of up to 12000 objects per hour on a single NVIDIA A100 GPU

[165] From Preoperative CT to Postmastoidectomy Mesh Construction:1Mastoidectomy Shape Prediction for Cochlear Implant Surgery

Yike Zhang, Eduardo Davalos, Dingjie Su, Ange Lou, Jack Noble

Main category: cs.CV

TL;DR: A hybrid self-supervised and weakly-supervised learning framework predicts mastoidectomy shape from preoperative CT scans without human annotations, achieving 0.72 Dice score for cochlear implant surgical planning.

Details

Motivation: Accurate mastoidectomy shape prediction from preoperative imaging improves cochlear implant surgical planning, reduces risks, and enhances outcomes, but limited deep-learning studies exist due to challenges in acquiring ground-truth labels.

Method: Proposes a hybrid self-supervised and weakly-supervised learning framework that predicts mastoidectomy region directly from preoperative CT scans where the mastoid remains intact, without requiring human annotations.

Result: Achieves mean Dice score of 0.72 for predicting complex, boundary-less mastoidectomy shape, surpassing state-of-the-art approaches and demonstrating strong performance.

Conclusion: First work integrating self-supervised and weakly-supervised learning for mastoidectomy shape prediction, offering robust solution for CI surgical planning while leveraging 3D T-distribution loss in weakly-supervised medical imaging.

Abstract: Cochlear Implant (CI) surgery treats severe hearing loss by inserting an electrode array into the cochlea to stimulate the auditory nerve. An important step in this procedure is mastoidectomy, which removes part of the mastoid region of the temporal bone to provide surgical access. Accurate mastoidectomy shape prediction from preoperative imaging improves pre-surgical planning, reduces risks, and enhances surgical outcomes. Despite its importance, there are limited deep-learning-based studies regarding this topic due to the challenges of acquiring ground-truth labels. We address this gap by investigating self-supervised and weakly-supervised learning models to predict the mastoidectomy region without human annotations. We propose a hybrid self-supervised and weakly-supervised learning framework to predict the mastoidectomy region directly from preoperative CT scans, where the mastoid remains intact. Our hybrid method achieves a mean Dice score of 0.72 when predicting the complex and boundary-less mastoidectomy shape, surpassing state-of-the-art approaches and demonstrating strong performance. The method provides groundwork for constructing 3D postmastoidectomy surfaces directly from the corresponding preoperative CT scans. To our knowledge, this is the first work that integrating self-supervised and weakly-supervised learning for mastoidectomy shape prediction, offering a robust and efficient solution for CI surgical planning while leveraging 3D T-distribution loss in weakly-supervised medical imaging.

[166] CRUNet-MR-Univ: A Foundation Model for Diverse Cardiac MRI Reconstruction

Donghang Lyu, Marius Staring, Hildo Lamb, Mariya Doneva

Main category: cs.CV

TL;DR: CRUNet-MR-Univ is a foundation model for Cardiac MRI reconstruction that uses spatio-temporal correlations and prompt-based priors to handle diverse CMR scenarios, outperforming existing methods.

Details

Motivation: Current deep learning methods for CMR reconstruction lack generalizability due to wide variability in image contrast, sampling patterns, scanner vendors, anatomical structures, and disease types. Most models only handle narrow subsets of these variations, leading to performance degradation with distribution shifts.

Method: Proposes CRUNet-MR-Univ, a foundation model that leverages spatio-temporal correlations and prompt-based priors to effectively handle the full diversity of CMR scans.

Result: The approach consistently outperforms baseline methods across a wide range of settings, demonstrating effectiveness and promise.

Conclusion: CRUNet-MR-Univ represents a unified model capable of generalizing across diverse CMR scenarios, addressing the generalizability limitations of current deep learning methods in CMR reconstruction.

Abstract: In recent years, deep learning has attracted increasing attention in the field of Cardiac MRI (CMR) reconstruction due to its superior performance over traditional methods, particularly in handling higher acceleration factors, highlighting its potential for real-world clinical applications. However, current deep learning methods remain limited in generalizability. CMR scans exhibit wide variability in image contrast, sampling patterns, scanner vendors, anatomical structures, and disease types. Most existing models are designed to handle only a single or narrow subset of these variations, leading to performance degradation when faced with distribution shifts. Therefore, it is beneficial to develop a unified model capable of generalizing across diverse CMR scenarios. To this end, we propose CRUNet-MR-Univ, a foundation model that leverages spatio-temporal correlations and prompt-based priors to effectively handle the full diversity of CMR scans. Our approach consistently outperforms baseline methods across a wide range of settings, highlighting its effectiveness and promise.

[167] Addressing Overthinking in Large Vision-Language Models via Gated Perception-Reasoning Optimization

Xingjian Diao, Zheyuan Liu, Chunhui Zhang, Weiyi Wu, Keyi Kong, Lin Shi, Kaize Ding, Soroush Vosoughi, Jiang Gui

Main category: cs.CV

TL;DR: GPRO is a meta-reasoning controller that dynamically routes computation among three paths (fast, perception, reasoning) to address overthinking in LVLMs by fixing visual perception failures, improving both accuracy and efficiency.

Details

Motivation: Current LVLMs using chain-of-thought reasoning suffer from overthinking - producing verbose responses for simple queries, leading to inefficiency and degraded accuracy. Prior adaptive reasoning methods overlook the fundamental bottleneck of visual perception failures, which often cause reasoning errors rather than insufficient deliberation.

Method: Proposes Gated Perception-Reasoning Optimization (GPRO), a meta-reasoning controller that dynamically routes computation at each generation step among: 1) lightweight fast path, 2) slow perception path for re-examining visual inputs, and 3) slow reasoning path for internal self-reflection. Uses teacher models to derive failure attribution supervision from ~790k samples to distinguish perceptual hallucinations from reasoning errors, then trains controller with multi-objective reinforcement learning to optimize accuracy-computation trade-off.

Result: Experiments on five benchmarks show GPRO substantially improves both accuracy and efficiency, outperforming recent slow-thinking methods while generating significantly shorter responses.

Conclusion: Stable reasoning in LVLMs critically depends on low-level visual grounding, and addressing visual perception failures through dynamic routing of computation paths can effectively mitigate overthinking while improving both accuracy and efficiency.

Abstract: Large Vision-Language Models (LVLMs) have exhibited strong reasoning capabilities through chain-of-thought mechanisms that generate step-by-step rationales. However, such slow-thinking approaches often lead to overthinking, where models produce excessively verbose responses even for simple queries, resulting in test-time inefficiency and even degraded accuracy. Prior work has attempted to mitigate this issue via adaptive reasoning strategies, but these methods largely overlook a fundamental bottleneck: visual perception failures. We argue that stable reasoning critically depends on low-level visual grounding, and that reasoning errors often originate from imperfect perception rather than insufficient deliberation. To address this limitation, we propose Gated Perception-Reasoning Optimization (GPRO), a meta-reasoning controller that dynamically routes computation among three decision paths at each generation step: a lightweight fast path, a slow perception path for re-examining visual inputs, and a slow reasoning path for internal self-reflection. To learn this distinction, we derive large-scale failure attribution supervision from approximately 790k samples, using teacher models to distinguish perceptual hallucinations from reasoning errors. We then train the controller with multi-objective reinforcement learning to optimize the trade-off between task accuracy and computational cost under uncertainty. Experiments on five benchmarks demonstrate that GPRO substantially improves both accuracy and efficiency, outperforming recent slow-thinking methods while generating significantly shorter responses.

[168] UniDrive-WM: Unified Understanding, Planning and Generation World Model For Autonomous Driving

Zhexiao Xiong, Xin Ye, Burhan Yaman, Sheng Cheng, Yiren Lu, Jingru Luo, Nathan Jacobs, Liu Ren

Main category: cs.CV

TL;DR: UniDrive-WM is a unified vision-language model world model that jointly performs driving-scene understanding, trajectory planning, and future image generation in a single architecture, improving autonomous driving performance.

Details

Motivation: Current autonomous driving approaches treat perception, prediction, and planning as separate modules, lacking tight integration. The authors aim to create a unified world model that jointly handles these tasks to improve driving performance.

Method: Proposes UniDrive-WM, a unified VLM-based world model with three key components: 1) trajectory planner that predicts future trajectories, 2) VLM-based image generator conditioned on trajectories to produce future frames, and 3) iterative refinement where predictions enhance scene understanding and trajectory generation. Also compares discrete vs continuous output representations for future image prediction.

Result: On Bench2Drive benchmark, UniDrive-WM produces high-fidelity future images and improves planning performance by 5.9% in L2 trajectory error and 9.2% in collision rate reduction over previous best method.

Conclusion: Tight integration of VLM-driven reasoning, planning, and generative world modeling provides significant advantages for autonomous driving, demonstrating the effectiveness of unified architectures over modular approaches.

Abstract: World models have become central to autonomous driving, where accurate scene understanding and future prediction are crucial for safe control. Recent work has explored using vision-language models (VLMs) for planning, yet existing approaches typically treat perception, prediction, and planning as separate modules. We propose UniDrive-WM, a unified VLM-based world model that jointly performs driving-scene understanding, trajectory planning, and trajectory-conditioned future image generation within a single architecture. UniDrive-WM’s trajectory planner predicts a future trajectory, which conditions a VLM-based image generator to produce plausible future frames. These predictions provide additional supervisory signals that enhance scene understanding and iteratively refine trajectory generation. We further compare discrete and continuous output representations for future image prediction, analyzing their influence on downstream driving performance. Experiments on the challenging Bench2Drive benchmark show that UniDrive-WM produces high-fidelity future images and improves planning performance by 5.9% in L2 trajectory error and 9.2% in collision rate over the previous best method. These results demonstrate the advantages of tightly integrating VLM-driven reasoning, planning, and generative world modeling for autonomous driving. The project page is available at https://unidrive-wm.github.io/UniDrive-WM .

[169] Vision-Language Agents for Interactive Forest Change Analysis

James Brock, Ce Zhang, Nantheera Anantrasirichai

Main category: cs.CV

TL;DR: LLM-driven agent for forest change analysis using satellite imagery, combining change detection and semantic captioning with natural language querying.

Details

Motivation: Need to address challenges in pixel-level change detection and semantic change captioning for forest dynamics, and integrate LLMs with vision-language models for remote sensing image change interpretation.

Method: Proposed LLM-driven agent with multi-level change interpretation vision-language backbone and LLM-based orchestration, using Forest-Change dataset with bi-temporal satellite imagery, change masks, and semantic captions.

Result: Achieved mIoU 67.10% and BLEU-4 40.17% on Forest-Change dataset, and 88.13% mIoU and 34.41% BLEU-4 on LEVIR-MCI-Trees subset.

Conclusion: LLM-driven RSICI systems improve accessibility, interpretability, and efficiency of forest change analysis, with publicly available data and code.

Abstract: Modern forest monitoring workflows increasingly benefit from the growing availability of high-resolution satellite imagery and advances in deep learning. Two persistent challenges in this context are accurate pixel-level change detection and meaningful semantic change captioning for complex forest dynamics. While large language models (LLMs) are being adapted for interactive data exploration, their integration with vision-language models (VLMs) for remote sensing image change interpretation (RSICI) remains underexplored. To address this gap, we introduce an LLM-driven agent for integrated forest change analysis that supports natural language querying across multiple RSICI tasks. The proposed system builds upon a multi-level change interpretation (MCI) vision-language backbone with LLM-based orchestration. To facilitate adaptation and evaluation in forest environments, we further introduce the Forest-Change dataset, which comprises bi-temporal satellite imagery, pixel-level change masks, and multi-granularity semantic change captions generated using a combination of human annotation and rule-based methods. Experimental results show that the proposed system achieves mIoU and BLEU-4 scores of 67.10% and 40.17% on the Forest-Change dataset, and 88.13% and 34.41% on LEVIR-MCI-Trees, a tree-focused subset of LEVIR-MCI benchmark for joint change detection and captioning. These results highlight the potential of interactive, LLM-driven RSICI systems to improve accessibility, interpretability, and efficiency of forest change analysis. All data and code are publicly available at https://github.com/JamesBrockUoB/ForestChat.

[170] TokenSeg: Efficient 3D Medical Image Segmentation via Hierarchical Visual Token Compression

Sen Zeng, Hong Zhou, Zheng Zhu, Yang Liu

Main category: cs.CV

TL;DR: TokenSeg: A boundary-aware sparse token representation framework for efficient 3D medical volume segmentation that reduces computation by 64-68% while achieving state-of-the-art performance.

Details

Motivation: 3D medical image segmentation is computationally demanding due to cubic voxel growth and redundant computation on homogeneous regions. There's a need for efficient methods that maintain accuracy while reducing computational overhead.

Method: Three main components: (1) Multi-scale hierarchical encoder extracts 400 candidate tokens across four resolution levels; (2) Boundary-aware tokenizer combines VQ-VAE quantization with importance scoring to select 100 salient tokens (60% near boundaries); (3) Sparse-to-dense decoder reconstructs full-resolution masks through token reprojection, progressive upsampling, and skip connections.

Result: Achieves 94.49% Dice and 89.61% IoU on 3D breast DCE-MRI dataset (960 cases) while reducing GPU memory by 64% and inference latency by 68%. Generalizes well to MSD cardiac and brain MRI benchmarks, delivering optimal performance across heterogeneous anatomical structures.

Conclusion: TokenSeg demonstrates the effectiveness of anatomically informed sparse representation for accurate and efficient 3D medical image segmentation, offering significant computational savings without compromising performance.

Abstract: Three-dimensional medical image segmentation is a fundamental yet computationally demanding task due to the cubic growth of voxel processing and the redundant computation on homogeneous regions. To address these limitations, we propose \textbf{TokenSeg}, a boundary-aware sparse token representation framework for efficient 3D medical volume segmentation. Specifically, (1) we design a \emph{multi-scale hierarchical encoder} that extracts 400 candidate tokens across four resolution levels to capture both global anatomical context and fine boundary details; (2) we introduce a \emph{boundary-aware tokenizer} that combines VQ-VAE quantization with importance scoring to select 100 salient tokens, over 60% of which lie near tumor boundaries; and (3) we develop a \emph{sparse-to-dense decoder} that reconstructs full-resolution masks through token reprojection, progressive upsampling, and skip connections. Extensive experiments on a 3D breast DCE-MRI dataset comprising 960 cases demonstrate that TokenSeg achieves state-of-the-art performance with 94.49% Dice and 89.61% IoU, while reducing GPU memory and inference latency by 64% and 68%, respectively. To verify the generalization capability, our evaluations on MSD cardiac and brain MRI benchmark datasets demonstrate that TokenSeg consistently delivers optimal performance across heterogeneous anatomical structures. These results highlight the effectiveness of anatomically informed sparse representation for accurate and efficient 3D medical image segmentation.

[171] FaceRefiner: High-Fidelity Facial Texture Refinement with Differentiable Rendering-based Style Transfer

Chengyang Li, Baoping Cheng, Yao Cheng, Haocheng Zhang, Renshuai Liu, Yinglin Zheng, Jing Liao, Xuan Cheng

Main category: cs.CV

TL;DR: FaceRefiner is a style transfer-based facial texture refinement method that improves 3D facial texture generation by preserving input image details, structures, and identity through multi-level information transfer with differentiable rendering.

Details

Motivation: Current facial texture generation methods use deep networks to synthesize textures and fill UV maps, but these UV maps come from training data or 2D face generator spaces, limiting generalization for in-the-wild images. This leads to inconsistencies in facial details, structures, and identity with the input.

Method: FaceRefiner treats 3D sampled texture as style and texture generation output as content, using style transfer to transfer photo-realistic style. Unlike existing methods that only transfer high/middle level information, it integrates differentiable rendering to also transfer low-level (pixel-level) information in visible face regions, enabling multi-level information transfer.

Result: Extensive experiments on Multi-PIE, CelebA and FFHQ datasets show that FaceRefiner improves texture quality and face identity preservation compared to state-of-the-art methods.

Conclusion: The proposed style transfer-based refinement method with multi-level information transfer through differentiable rendering effectively preserves input details, structures, and semantics while improving facial texture generation quality.

Abstract: Recent facial texture generation methods prefer to use deep networks to synthesize image content and then fill in the UV map, thus generating a compelling full texture from a single image. Nevertheless, the synthesized texture UV map usually comes from a space constructed by the training data or the 2D face generator, which limits the methods’ generalization ability for in-the-wild input images. Consequently, their facial details, structures and identity may not be consistent with the input. In this paper, we address this issue by proposing a style transfer-based facial texture refinement method named FaceRefiner. FaceRefiner treats the 3D sampled texture as style and the output of a texture generation method as content. The photo-realistic style is then expected to be transferred from the style image to the content image. Different from current style transfer methods that only transfer high and middle level information to the result, our style transfer method integrates differentiable rendering to also transfer low level (or pixel level) information in the visible face regions. The main benefit of such multi-level information transfer is that, the details, structures and semantics in the input can thus be well preserved. The extensive experiments on Multi-PIE, CelebA and FFHQ datasets demonstrate that our refinement method can improve the texture quality and the face identity preserving ability, compared with state-of-the-arts.

[172] All Changes May Have Invariant Principles: Improving Ever-Shifting Harmful Meme Detection via Design Concept Reproduction

Ziyou Jiang, Mingyang Li, Junjie Wang, Yuekai Huang, Jie Huang, Zhiyuan Chang, Zhaoyang Li, Qing Wang

Main category: cs.CV

TL;DR: RepMD is a harmful meme detection method that identifies invariant design principles behind shifting memes using Design Concept Graphs derived from historical data, then guides MLLMs for detection.

Details

Motivation: Harmful memes constantly evolve in type and over time, making them difficult to analyze. However, different memes may share underlying invariant design principles used by malicious users, which could help understand why they're harmful.

Method: 1. Define Design Concept Graph (DCG) based on attack trees to describe steps for designing harmful memes. 2. Derive DCG from historical memes using design step reproduction and graph pruning. 3. Use DCG to guide Multimodal Large Language Models for harmful meme detection.

Result: Achieves 81.1% accuracy (highest), maintains slight accuracy decreases when generalized to type-shifting and temporal-evolving memes. Human evaluation shows improved efficiency: 15-30 seconds per meme for human discovery.

Conclusion: RepMD successfully addresses ever-shifting harmful memes by capturing invariant design principles through DCG, enabling effective detection even as memes evolve over time and type.

Abstract: Harmful memes are ever-shifting in the Internet communities, which are difficult to analyze due to their type-shifting and temporal-evolving nature. Although these memes are shifting, we find that different memes may share invariant principles, i.e., the underlying design concept of malicious users, which can help us analyze why these memes are harmful. In this paper, we propose RepMD, an ever-shifting harmful meme detection method based on the design concept reproduction. We first refer to the attack tree to define the Design Concept Graph (DCG), which describes steps that people may take to design a harmful meme. Then, we derive the DCG from historical memes with design step reproduction and graph pruning. Finally, we use DCG to guide the Multimodal Large Language Model (MLLM) to detect harmful memes. The evaluation results show that RepMD achieves the highest accuracy with 81.1% and has slight accuracy decreases when generalized to type-shifting and temporal-evolving memes. Human evaluation shows that RepMD can improve the efficiency of human discovery on harmful memes, with 15$\sim$30 seconds per meme.

[173] 3D Conditional Image Synthesis of Left Atrial LGE MRI from Composite Semantic Masks

Yusri Al-Sanaani, Rebecca Thornhill, Sreeraman Rajan

Main category: cs.CV

TL;DR: 3D conditional generative models (Pix2Pix GAN, SPADE-GAN, SPADE-LDM) synthesize LGE MRI from semantic labels to augment scarce training data, improving left atrial segmentation performance.

Details

Motivation: Segmentation of left atrial wall and endocardium from LGE MRI is essential for quantifying atrial fibrosis, but challenging due to limited data availability and complex anatomy. Need to augment scarce training data to improve segmentation accuracy.

Method: Developed pipeline to synthesize 3D LGE MRI volumes from composite semantic label maps combining expert annotations with unsupervised tissue clusters. Evaluated three 3D conditional generators: Pix2Pix GAN, SPADE-GAN, and SPADE-LDM. Used synthetic images to augment training data for downstream 3D U-Net segmentation model.

Result: SPADE-LDM generated most realistic images with FID of 4.063, outperforming Pix2Pix GAN (40.821) and SPADE-GAN (7.652). Augmentation with synthetic images improved LA cavity segmentation Dice score from 0.908 to 0.936 (statistically significant, p < 0.05).

Conclusion: Label-conditioned 3D synthesis effectively enhances segmentation of underrepresented cardiac structures by generating realistic training data, with SPADE-LDM showing superior performance over GAN-based approaches.

Abstract: Segmentation of the left atrial (LA) wall and endocardium from late gadolinium-enhanced (LGE) MRI is essential for quantifying atrial fibrosis in patients with atrial fibrillation. The development of accurate machine learning-based segmentation models remains challenging due to the limited availability of data and the complexity of anatomical structures. In this work, we investigate 3D conditional generative models as potential solution for augmenting scarce LGE training data and improving LA segmentation performance. We develop a pipeline to synthesize high-fidelity 3D LGE MRI volumes from composite semantic label maps combining anatomical expert annotations with unsupervised tissue clusters, using three 3D conditional generators (Pix2Pix GAN, SPADE-GAN, and SPADE-LDM). The synthetic images are evaluated for realism and their impact on downstream LA segmentation. SPADE-LDM generates the most realistic and structurally accurate images, achieving an FID of 4.063 and surpassing GAN models, which have FIDs of 40.821 and 7.652 for Pix2Pix and SPADE-GAN, respectively. When augmented with synthetic LGE images, the Dice score for LA cavity segmentation with a 3D U-Net model improved from 0.908 to 0.936, showing a statistically significant improvement (p < 0.05) over the baseline.These findings demonstrate the potential of label-conditioned 3D synthesis to enhance the segmentation of under-represented cardiac structures.

[174] MiLDEdit: Reasoning-Based Multi-Layer Design Document Editing

Zihao Lin, Wanrong Zhu, Jiuxiang Gu, Jihyung Kil, Christopher Tensmeyer, Lin Zhang, Shilong Liu, Ruiyi Zhang, Lifu Huang, Vlad I. Morariu, Tong Sun

Main category: cs.CV

TL;DR: MiLDEAgent is a reasoning-based framework for multi-layer document editing that combines RL-trained multimodal reasoning with targeted image editing, outperforming existing approaches on the new MiLDEBench benchmark.

Details

Motivation: Real-world design documents (posters, etc.) are multi-layered with decoration, text, and images, but existing approaches focus on single-layer image editing or multi-layer generation, lacking the layer-aware reasoning needed for precise editing from natural language instructions.

Method: MiLDEAgent combines an RL-trained multimodal reasoner for layer-wise understanding with an image editor for targeted modifications. The framework is evaluated on MiLDEBench (20K+ design documents with editing instructions) using MiLDEEval protocol across four dimensions: instruction following, layout consistency, aesthetics, and text rendering.

Result: Experiments on 14 open-source and 2 closed-source models show existing approaches fail to generalize: open-source models can’t complete multi-layer editing tasks, while closed-source models suffer from format violations. MiLDEAgent achieves strong layer-aware reasoning and precise editing, significantly outperforming open-source baselines and matching closed-source model performance.

Conclusion: MiLDEAgent establishes the first strong baseline for multi-layer document editing, demonstrating effective layer-aware reasoning and precise modifications that address the limitations of prior approaches in this challenging real-world task.

Abstract: Real-world design documents (e.g., posters) are inherently multi-layered, combining decoration, text, and images. Editing them from natural-language instructions requires fine-grained, layer-aware reasoning to identify relevant layers and coordinate modifications. Prior work largely overlooks multi-layer design document editing, focusing instead on single-layer image editing or multi-layer generation, which assume a flat canvas and lack the reasoning needed to determine what and where to modify. To address this gap, we introduce the Multi-Layer Document Editing Agent (MiLDEAgent), a reasoning-based framework that combines an RL-trained multimodal reasoner for layer-wise understanding with an image editor for targeted modifications. To systematically benchmark this setting, we introduce the MiLDEBench, a human-in-the-loop corpus of over 20K design documents paired with diverse editing instructions. The benchmark is complemented by a task-specific evaluation protocol, MiLDEEval, which spans four dimensions including instruction following, layout consistency, aesthetics, and text rendering. Extensive experiments on 14 open-source and 2 closed-source models reveal that existing approaches fail to generalize: open-source models often cannot complete multi-layer document editing tasks, while closed-source models suffer from format violations. In contrast, MiLDEAgent achieves strong layer-aware reasoning and precise editing, significantly outperforming all open-source baselines and attaining performance comparable to closed-source models, thereby establishing the first strong baseline for multi-layer document editing.

[175] Detection of Deployment Operational Deviations for Safety and Security of AI-Enabled Human-Centric Cyber Physical Systems

Bernard Ngabonziza, Ayan Banerjee, Sandeep K. S. Gupta

Main category: cs.CV

TL;DR: Paper proposes a framework for evaluating safety/security strategies in AI-enabled human-centric cyber-physical systems, with a case study on meal detection for diabetes control.

Details

Motivation: AI-enabled human-centric systems (medical monitoring, autonomous cars) face operational uncertainties when interacting with humans, potentially violating safety/security requirements.

Method: 1) Discuss operational deviations leading to unknown conditions; 2) Create framework to evaluate safety/security strategies; 3) Demonstrate with personalized image-based technique for detecting unannounced meals in diabetes control.

Result: Proposed framework for evaluating safety strategies and demonstrated a novel image-based technique for detecting non-announcement of meals in closed-loop blood glucose control systems.

Conclusion: Need systematic approaches to handle operational uncertainties in AI-enabled human-centric systems, with proposed framework and case study showing practical application for safety assurance.

Abstract: In recent years, Human-centric cyber-physical systems have increasingly involved artificial intelligence to enable knowledge extraction from sensor-collected data. Examples include medical monitoring and control systems, as well as autonomous cars. Such systems are intended to operate according to the protocols and guidelines for regular system operations. However, in many scenarios, such as closed-loop blood glucose control for Type 1 diabetics, self-driving cars, and monitoring systems for stroke diagnosis. The operations of such AI-enabled human-centric applications can expose them to cases for which their operational mode may be uncertain, for instance, resulting from the interactions with a human with the system. Such cases, in which the system is in uncertain conditions, can violate the system’s safety and security requirements. This paper will discuss operational deviations that can lead these systems to operate in unknown conditions. We will then create a framework to evaluate different strategies for ensuring the safety and security of AI-enabled human-centric cyber-physical systems in operation deployment. Then, as an example, we show a personalized image-based novel technique for detecting the non-announcement of meals in closed-loop blood glucose control for Type 1 diabetics.

[176] HUR-MACL: High-Uncertainty Region-Guided Multi-Architecture Collaborative Learning for Head and Neck Multi-Organ Segmentation

Xiaoyu Liu, Siwen Wei, Linhao Qu, Mingyuan Pan, Chengsheng Zhang, Yonghong Shi, Zhijian Song

Main category: cs.CV

TL;DR: HUR-MACL model improves head & neck organ segmentation by adaptively identifying high-uncertainty regions and using Vision Mamba + Deformable CNN collaboration with feature distillation.

Details

Motivation: Deep learning models struggle with small, complexly shaped organs in head & neck segmentation. Existing hybrid architectures simply concatenate features without exploiting component strengths, leading to functional overlap and limited accuracy.

Method: Proposes HUR-MACL: 1) CNN adaptively identifies high uncertainty regions, 2) Vision Mamba and Deformable CNN jointly improve segmentation in these regions, 3) Heterogeneous feature distillation loss promotes collaborative learning between architectures.

Result: Achieves state-of-the-art results on two public datasets and one private dataset for multi-organ segmentation in head and neck.

Conclusion: The proposed high uncertainty region-guided multi-architecture collaborative learning effectively addresses segmentation challenges for small, complex organs by leveraging complementary strengths of different architectures.

Abstract: Accurate segmentation of organs at risk in the head and neck is essential for radiation therapy, yet deep learning models often fail on small, complexly shaped organs. While hybrid architectures that combine different models show promise, they typically just concatenate features without exploiting the unique strengths of each component. This results in functional overlap and limited segmentation accuracy. To address these issues, we propose a high uncertainty region-guided multi-architecture collaborative learning (HUR-MACL) model for multi-organ segmentation in the head and neck. This model adaptively identifies high uncertainty regions using a convolutional neural network, and for these regions, Vision Mamba as well as Deformable CNN are utilized to jointly improve their segmentation accuracy. Additionally, a heterogeneous feature distillation loss was proposed to promote collaborative learning between the two architectures in high uncertainty regions to further enhance performance. Our method achieves SOTA results on two public datasets and one private dataset.

[177] HyperAlign: Hyperbolic Entailment Cones for Adaptive Text-to-Image Alignment Assessment

Wenzhi Chen, Bo Hu, Leida Li, Lihuo He, Wen Lu, Xinbo Gao

Main category: cs.CV

TL;DR: HyperAlign: A hyperbolic geometry-based framework for adaptive assessment of text-to-image alignment, outperforming existing Euclidean methods through dynamic entailment modeling and sample-level calibration.

Details

Motivation: Existing text-to-image alignment assessment methods rely on Euclidean space metrics, which neglect the structured nature of semantic alignment and lack adaptive capabilities for different samples. There's a need for a more sophisticated approach that can better capture semantic relationships and adapt to sample variations.

Method: 1) Extract Euclidean features using CLIP and map them to hyperbolic space; 2) Design a dynamic-supervision entailment modeling mechanism that transforms discrete entailment logic into continuous geometric structure supervision; 3) Propose an adaptive modulation regressor that uses hyperbolic geometric features to generate sample-level modulation parameters, adaptively calibrating Euclidean cosine similarity to predict final alignment scores.

Result: HyperAlign achieves highly competitive performance on both single database evaluation and cross-database generalization tasks, demonstrating superior effectiveness compared to existing methods.

Conclusion: The hyperbolic geometric modeling approach effectively addresses the limitations of Euclidean methods for text-to-image alignment assessment, providing better semantic structure capture and adaptive capabilities that lead to improved performance and generalization.

Abstract: With the rapid development of text-to-image generation technology, accurately assessing the alignment between generated images and text prompts has become a critical challenge. Existing methods rely on Euclidean space metrics, neglecting the structured nature of semantic alignment, while lacking adaptive capabilities for different samples. To address these limitations, we propose HyperAlign, an adaptive text-to-image alignment assessment framework based on hyperbolic entailment geometry. First, we extract Euclidean features using CLIP and map them to hyperbolic space. Second, we design a dynamic-supervision entailment modeling mechanism that transforms discrete entailment logic into continuous geometric structure supervision. Finally, we propose an adaptive modulation regressor that utilizes hyperbolic geometric features to generate sample-level modulation parameters, adaptively calibrating Euclidean cosine similarity to predict the final score. HyperAlign achieves highly competitive performance on both single database evaluation and cross-database generalization tasks, fully validating the effectiveness of hyperbolic geometric modeling for image-text alignment assessment.

[178] Agri-R1: Empowering Generalizable Agricultural Reasoning in Vision-Language Models with Reinforcement Learning

Wentao Zhang, Lifei Wang, Lina Lu, MingKun Xu, Shangyang Li, Yanchao Yang, Tao Fang

Main category: cs.CV

TL;DR: Agri-R1 is a 3B-parameter reasoning-enhanced vision-language model for agricultural disease diagnosis that uses automated reasoning data generation and Group Relative Policy Optimization to achieve competitive performance with larger models while requiring only 19% of training samples.

Details

Motivation: Agricultural disease diagnosis faces challenges with VLMs: conventional fine-tuning needs extensive labels, lacks interpretability, and generalizes poorly. Existing reasoning methods rely on costly expert annotations and don't handle the open-ended, diverse nature of agricultural queries.

Method: 1) Automated high-quality reasoning data generation via vision-language synthesis and LLM-based filtering (using only 19% of available samples). 2) Training with Group Relative Policy Optimization (GRPO) with novel reward function integrating domain-specific lexicons and fuzzy matching to assess correctness and linguistic flexibility in open-ended responses.

Result: The 3B-parameter model achieves performance competitive with 7B- to 13B-parameter baselines on CDDMBench: +23.2% relative gain in disease recognition accuracy, +33.3% in agricultural knowledge QA, and +26.10-point improvement in cross-domain generalization over standard fine-tuning.

Conclusion: The synergy between structured reasoning data and GRPO-driven exploration underpins performance gains, with benefits scaling as question complexity increases. The approach addresses agricultural VLM limitations while requiring minimal labeled data.

Abstract: Agricultural disease diagnosis challenges VLMs, as conventional fine-tuning requires extensive labels, lacks interpretability, and generalizes poorly. While reasoning improves model robustness, existing methods rely on costly expert annotations and rarely address the open-ended, diverse nature of agricultural queries. To address these limitations, we propose \textbf{Agri-R1}, a reasoning-enhanced large model for agriculture. Our framework automates high-quality reasoning data generation via vision-language synthesis and LLM-based filtering, using only 19% of available samples. Training employs Group Relative Policy Optimization (GRPO) with a novel proposed reward function that integrates domain-specific lexicons and fuzzy matching to assess both correctness and linguistic flexibility in open-ended responses. Evaluated on CDDMBench, our resulting 3B-parameter model achieves performance competitive with 7B- to 13B-parameter baselines, showing a +23.2% relative gain in disease recognition accuracy, +33.3% in agricultural knowledge QA, and a +26.10-point improvement in cross-domain generalization over standard fine-tuning. Ablation studies confirm that the synergy between structured reasoning data and GRPO-driven exploration underpins these gains, with benefits scaling as question complexity increases.

[179] DB-MSMUNet:Dual Branch Multi-scale Mamba UNet for Pancreatic CT Scans Segmentation

Qiu Guan, Zhiqiang Yang, Dezhang Ye, Yang Chen, Xinli Xu, Ying Tang

Main category: cs.CV

TL;DR: DB-MSMUNet: A dual-branch multi-scale Mamba UNet for robust pancreatic segmentation in CT scans, achieving state-of-the-art performance across multiple datasets.

Details

Motivation: Pancreatic segmentation in CT scans is challenging due to low tissue contrast, blurry boundaries, irregular organ shapes, and small lesion sizes, which hinder accurate diagnosis and treatment of pancreatic cancer.

Method: Proposes DB-MSMUNet with encoder using Multi-scale Mamba Module (MSMM) combining deformable convolutions and multi-scale state space modeling. Features dual decoders: edge decoder with Edge Enhancement Path for boundary refinement, and area decoder with Multi-layer Decoder for fine detail preservation. Includes Auxiliary Deep Supervision heads at multiple scales.

Result: Achieves Dice Similarity Coefficients of 89.47% (NIH Pancreas), 87.59% (MSD), and 89.02% (clinical tumor dataset), outperforming most existing state-of-the-art methods in segmentation accuracy, edge preservation, and robustness.

Conclusion: DB-MSMUNet demonstrates effectiveness and generalizability for real-world pancreatic CT segmentation tasks, addressing key challenges through its innovative architecture design.

Abstract: Accurate segmentation of the pancreas and its lesions in CT scans is crucial for the precise diagnosis and treatment of pancreatic cancer. However, it remains a highly challenging task due to several factors such as low tissue contrast with surrounding organs, blurry anatomical boundaries, irregular organ shapes, and the small size of lesions. To tackle these issues, we propose DB-MSMUNet (Dual-Branch Multi-scale Mamba UNet), a novel encoder-decoder architecture designed specifically for robust pancreatic segmentation. The encoder is constructed using a Multi-scale Mamba Module (MSMM), which combines deformable convolutions and multi-scale state space modeling to enhance both global context modeling and local deformation adaptation. The network employs a dual-decoder design: the edge decoder introduces an Edge Enhancement Path (EEP) to explicitly capture boundary cues and refine fuzzy contours, while the area decoder incorporates a Multi-layer Decoder (MLD) to preserve fine-grained details and accurately reconstruct small lesions by leveraging multi-scale deep semantic features. Furthermore, Auxiliary Deep Supervision (ADS) heads are added at multiple scales to both decoders, providing more accurate gradient feedback and further enhancing the discriminative capability of multi-scale features. We conduct extensive experiments on three datasets: the NIH Pancreas dataset, the MSD dataset, and a clinical pancreatic tumor dataset provided by collaborating hospitals. DB-MSMUNet achieves Dice Similarity Coefficients of 89.47%, 87.59%, and 89.02%, respectively, outperforming most existing state-of-the-art methods in terms of segmentation accuracy, edge preservation, and robustness across different datasets. These results demonstrate the effectiveness and generalizability of the proposed method for real-world pancreatic CT segmentation tasks.

[180] HATIR: Heat-Aware Diffusion for Turbulent Infrared Video Super-Resolution

Yang Zou, Xingyue Zhu, Kaiqi Han, Jun Ma, Xingyuan Li, Zhiying Jiang, Jinyuan Liu

Main category: cs.CV

TL;DR: HATIR is a diffusion-based method for joint turbulence mitigation and super-resolution of infrared videos, using heat-aware deformation priors and phasor-guided flow estimation.

Details

Motivation: Infrared videos suffer from severe atmospheric turbulence and compression degradation, but existing methods either ignore the modality gap between infrared/visible images or fail to restore turbulence-induced distortions. Cascading turbulence mitigation with VSR leads to error propagation due to decoupled degradation modeling.

Method: HATIR injects heat-aware deformation priors into diffusion sampling to jointly model inverse processes of turbulent degradation and detail loss. Uses Phasor-Guided Flow Estimator (based on thermal phasor consistency) for turbulence-aware flow guidance, and Turbulence-Aware Decoder with turbulence gating and structure-aware attention for stable feature aggregation.

Result: Authors built FLIR-IVSR, the first dataset for turbulent infrared VSR with 640 diverse scenes from FLIR T1050sc camera (1024×768). The method enables joint turbulence mitigation and super-resolution.

Conclusion: HATIR provides an effective diffusion-based framework for joint turbulence mitigation and super-resolution of infrared videos, with a new dataset to encourage future infrared VSR research.

Abstract: Infrared video has been of great interest in visual tasks under challenging environments, but often suffers from severe atmospheric turbulence and compression degradation. Existing video super-resolution (VSR) methods either neglect the inherent modality gap between infrared and visible images or fail to restore turbulence-induced distortions. Directly cascading turbulence mitigation (TM) algorithms with VSR methods leads to error propagation and accumulation due to the decoupled modeling of degradation between turbulence and resolution. We introduce HATIR, a Heat-Aware Diffusion for Turbulent InfraRed Video Super-Resolution, which injects heat-aware deformation priors into the diffusion sampling path to jointly model the inverse process of turbulent degradation and structural detail loss. Specifically, HATIR constructs a Phasor-Guided Flow Estimator, rooted in the physical principle that thermally active regions exhibit consistent phasor responses over time, enabling reliable turbulence-aware flow to guide the reverse diffusion process. To ensure the fidelity of structural recovery under nonuniform distortions, a Turbulence-Aware Decoder is proposed to selectively suppress unstable temporal cues and enhance edge-aware feature aggregation via turbulence gating and structure-aware attention. We built FLIR-IVSR, the first dataset for turbulent infrared VSR, comprising paired LR-HR sequences from a FLIR T1050sc camera (1024 X 768) spanning 640 diverse scenes with varying camera and object motion conditions. This encourages future research in infrared VSR. Project page: https://github.com/JZ0606/HATIR

[181] WebCryptoAgent: Agentic Crypto Trading with Web Informatics

Ali Kurban, Wei Luo, Liangyu Zuo, Zeyu Zhang, Renda Han, Zhaolu Kang, Hao Tang

Main category: cs.CV

TL;DR: WebCryptoAgent: An agentic trading framework that decomposes web-informed cryptocurrency trading into modality-specific agents with a decoupled control architecture separating strategic reasoning from real-time risk management.

Details

Motivation: Cryptocurrency trading requires timely integration of heterogeneous web information and market microstructure signals for short-horizon decisions under extreme volatility. Existing systems struggle with: 1) jointly reasoning over noisy multi-source web evidence while maintaining robustness to rapid price shocks, and 2) risk control as slow reasoning pipelines are ill-suited for handling abrupt market shocks requiring immediate defensive responses.

Method: Proposes WebCryptoAgent with two key innovations: 1) decomposes web-informed decision making into modality-specific agents and consolidates their outputs into a unified evidence document for confidence-calibrated reasoning, and 2) introduces a decoupled control architecture that separates strategic hourly reasoning from a real-time second-level risk model for fast shock detection and protective intervention independent of the trading loop.

Result: Extensive experiments on real-world cryptocurrency markets demonstrate that WebCryptoAgent improves trading stability, reduces spurious activity, and enhances tail-risk handling compared to existing baselines.

Conclusion: WebCryptoAgent addresses the dual challenges of synthesizing noisy multi-source web evidence and maintaining robustness to rapid price shocks through its agentic framework and decoupled control architecture, resulting in more stable and risk-aware cryptocurrency trading systems.

Abstract: Cryptocurrency trading increasingly depends on timely integration of heterogeneous web information and market microstructure signals to support short-horizon decision making under extreme volatility. However, existing trading systems struggle to jointly reason over noisy multi-source web evidence while maintaining robustness to rapid price shocks at sub-second timescales. The first challenge lies in synthesizing unstructured web content, social sentiment, and structured OHLCV signals into coherent and interpretable trading decisions without amplifying spurious correlations, while the second challenge concerns risk control, as slow deliberative reasoning pipelines are ill-suited for handling abrupt market shocks that require immediate defensive responses. To address these challenges, we propose WebCryptoAgent, an agentic trading framework that decomposes web-informed decision making into modality-specific agents and consolidates their outputs into a unified evidence document for confidence-calibrated reasoning. We further introduce a decoupled control architecture that separates strategic hourly reasoning from a real-time second-level risk model, enabling fast shock detection and protective intervention independent of the trading loop. Extensive experiments on real-world cryptocurrency markets demonstrate that WebCryptoAgent improves trading stability, reduces spurious activity, and enhances tail-risk handling compared to existing baselines. Code will be available at https://github.com/AIGeeksGroup/WebCryptoAgent.

[182] Forge-and-Quench: Enhancing Image Generation for Higher Fidelity in Unified Multimodal Models

Yanbing Zeng, Jia Wang, Hanghang Ma, Junqiang Wu, Jie Zhu, Xiaoming Wei, Jie Hu

Main category: cs.CV

TL;DR: Forge-and-Quench is a unified framework that leverages multimodal understanding models to enhance image generation fidelity and detail richness by creating a Bridge Feature that transfers insights from understanding to generation.

Details

Motivation: While integrating image generation and understanding is important, previous works haven't fully explored how understanding can effectively assist generation, particularly in enhancing image fidelity and detail richness rather than just leveraging reasoning abilities.

Method: Proposes Forge-and-Quench framework where: 1) MLLM reasons over conversational context to produce enhanced text instruction, 2) Bridge Adapter maps this to a virtual visual representation called Bridge Feature, 3) This feature is injected into T2I backbone as visual guidance alongside enhanced text instruction.

Result: Framework shows exceptional extensibility and flexibility, enabling efficient migration across different MLLM and T2I models with significant training overhead savings. Experiments show significant improvements in image fidelity and detail across multiple models while maintaining instruction-following accuracy and enhancing world knowledge application.

Conclusion: Forge-and-Quench successfully demonstrates how understanding models can enhance image generation fidelity and detail through a novel bridging mechanism, offering a flexible and extensible framework that preserves MLLM capabilities while improving T2I performance.

Abstract: Integrating image generation and understanding into a single framework has become a pivotal goal in the multimodal domain. However, how understanding can effectively assist generation has not been fully explored. Unlike previous works that focus on leveraging reasoning abilities and world knowledge from understanding models, this paper introduces a novel perspective: leveraging understanding to enhance the fidelity and detail richness of generated images. To this end, we propose Forge-and-Quench, a new unified framework that puts this principle into practice. In the generation process of our framework, an MLLM first reasons over the entire conversational context, including text instructions, to produce an enhanced text instruction. This refined instruction is then mapped to a virtual visual representation, termed the Bridge Feature, via a novel Bridge Adapter. This feature acts as a crucial link, forging insights from the understanding model to quench and refine the generation process. It is subsequently injected into the T2I backbone as a visual guidance signal, alongside the enhanced text instruction that replaces the original input. To validate this paradigm, we conduct comprehensive studies on the design of the Bridge Feature and Bridge Adapter. Our framework demonstrates exceptional extensibility and flexibility, enabling efficient migration across different MLLM and T2I models with significant savings in training overhead, all without compromising the MLLM’s inherent multimodal understanding capabilities. Experiments show that Forge-and-Quench significantly improves image fidelity and detail across multiple models, while also maintaining instruction-following accuracy and enhancing world knowledge application. Models and codes are available at https://github.com/YanbingZeng/Forge-and-Quench.

[183] On the Holistic Approach for Detecting Human Image Forgery

Xiao Guo, Jie Zhu, Anil Jain, Xiaoming Liu

Main category: cs.CV

TL;DR: HuForDet is a holistic human image forgery detection framework with dual-branch architecture for both facial and full-body manipulations, achieving state-of-the-art performance across diverse forgery types.

Details

Motivation: Existing deepfake detection methods are fragmented, specializing either in facial forgeries or full-body synthetic images, failing to generalize across the full spectrum of human image manipulations as AIGC threats escalate.

Method: Dual-branch architecture: (1) Face forgery detection branch with heterogeneous experts in RGB and frequency domains, including adaptive Laplacian-of-Gaussian module; (2) Contextualized forgery detection branch using Multi-Modal Large Language Model for full-body semantic consistency analysis with confidence estimation for dynamic feature fusion weighting.

Result: HuForDet achieves state-of-the-art forgery detection performance and superior robustness across diverse human image forgeries, validated through extensive experiments on the curated HuFor dataset.

Conclusion: The proposed holistic framework effectively addresses the fragmentation in existing detection methods by combining specialized face analysis with full-body semantic consistency checking, demonstrating comprehensive detection capabilities for the evolving AIGC threat landscape.

Abstract: The rapid advancement of AI-generated content (AIGC) has escalated the threat of deepfakes, from facial manipulations to the synthesis of entire photorealistic human bodies. However, existing detection methods remain fragmented, specializing either in facial-region forgeries or full-body synthetic images, and consequently fail to generalize across the full spectrum of human image manipulations. We introduce HuForDet, a holistic framework for human image forgery detection, which features a dual-branch architecture comprising: (1) a face forgery detection branch that employs heterogeneous experts operating in both RGB and frequency domains, including an adaptive Laplacian-of-Gaussian (LoG) module designed to capture artifacts ranging from fine-grained blending boundaries to coarse-scale texture irregularities; and (2) a contextualized forgery detection branch that leverages a Multi-Modal Large Language Model (MLLM) to analyze full-body semantic consistency, enhanced with a confidence estimation mechanism that dynamically weights its contribution during feature fusion. We curate a human image forgery (HuFor) dataset that unifies existing face forgery data with a new corpus of full-body synthetic humans. Extensive experiments show that our HuForDet achieves state-of-the-art forgery detection performance and superior robustness across diverse human image forgeries.

[184] Training a Custom CNN on Five Heterogeneous Image Datasets

Anika Tabassum, Tasnuva Mahazabin Tuba, Nafisa Naznin

Main category: cs.CV

TL;DR: This paper evaluates CNN architectures across five diverse visual classification tasks, comparing custom lightweight CNNs with established models like ResNet-18 and VGG-16, analyzing when transfer learning provides advantages in data-constrained environments.

Details

Motivation: To investigate the effectiveness of CNN-based architectures across heterogeneous real-world visual classification tasks, particularly in resource-limited environments with varying challenges like illumination differences, resolution variations, environmental complexity, and class imbalance.

Method: Systematic evaluation of a lightweight custom CNN alongside established deep architectures (ResNet-18, VGG-16) across five diverse datasets spanning agricultural and urban domains. Models were trained both from scratch and using transfer learning, with systematic preprocessing and augmentation techniques.

Result: The custom CNN achieved competitive performance across multiple application domains, and the comparative analysis revealed when transfer learning and deep architectures provide substantial advantages, particularly in data-constrained environments.

Conclusion: The findings offer practical insights for deploying deep learning models in resource-limited yet high-impact real-world visual classification tasks, demonstrating that lightweight custom CNNs can be effective while transfer learning provides advantages in data-scarce scenarios.

Abstract: Deep learning has transformed visual data analysis, with Convolutional Neural Networks (CNNs) becoming highly effective in learning meaningful feature representations directly from images. Unlike traditional manual feature engineering methods, CNNs automatically extract hierarchical visual patterns, enabling strong performance across diverse real-world contexts. This study investigates the effectiveness of CNN-based architectures across five heterogeneous datasets spanning agricultural and urban domains: mango variety classification, paddy variety identification, road surface condition assessment, auto-rickshaw detection, and footpath encroachment monitoring. These datasets introduce varying challenges, including differences in illumination, resolution, environmental complexity, and class imbalance, necessitating adaptable and robust learning models. We evaluate a lightweight, task-specific custom CNN alongside established deep architectures, including ResNet-18 and VGG-16, trained both from scratch and using transfer learning. Through systematic preprocessing, augmentation, and controlled experimentation, we analyze how architectural complexity, model depth, and pre-training influence convergence, generalization, and performance across datasets of differing scale and difficulty. The key contributions of this work are: (1) the development of an efficient custom CNN that achieves competitive performance across multiple application domains, and (2) a comprehensive comparative analysis highlighting when transfer learning and deep architectures provide substantial advantages, particularly in data-constrained environments. These findings offer practical insights for deploying deep learning models in resource-limited yet high-impact real-world visual classification tasks.

[185] AIVD: Adaptive Edge-Cloud Collaboration for Accurate and Efficient Industrial Visual Detection

Yunqing Hu, Zheming Yang, Chang Zhao, Qi Guo, Meng Gao, Pengcheng Li, Wen Ji

Main category: cs.CV

TL;DR: AIVD framework enables precise object localization and semantic generation by combining lightweight edge detectors with cloud MLLMs, featuring noise-robust fine-tuning and resource-aware scheduling for edge-cloud deployment.

Details

Motivation: MLLMs have strong semantic understanding but struggle with precise object localization and face deployment challenges on resource-constrained edge devices. There's a need to bridge the gap between edge detection capabilities and cloud-based MLLM reasoning while maintaining efficiency.

Method: Proposes AIVD framework with: 1) Collaboration between lightweight edge detectors and cloud MLLMs, 2) Visual-semantic collaborative augmentation fine-tuning to handle edge cropped-box noise, 3) Heterogeneous resource-aware dynamic scheduling algorithm for edge devices and network conditions.

Result: AIVD reduces resource consumption while improving MLLM classification accuracy and semantic generation quality. The scheduling strategy achieves higher throughput and lower latency across diverse scenarios.

Conclusion: The AIVD framework successfully addresses MLLM limitations in edge-cloud deployment by enabling precise localization and high-quality semantic generation through collaborative edge-cloud architecture, robust fine-tuning, and efficient resource scheduling.

Abstract: Multimodal large language models (MLLMs) demonstrate exceptional capabilities in semantic understanding and visual reasoning, yet they still face challenges in precise object localization and resource-constrained edge-cloud deployment. To address this, this paper proposes the AIVD framework, which achieves unified precise localization and high-quality semantic generation through the collaboration between lightweight edge detectors and cloud-based MLLMs. To enhance the cloud MLLM’s robustness against edge cropped-box noise and scenario variations, we design an efficient fine-tuning strategy with visual-semantic collaborative augmentation, significantly improving classification accuracy and semantic consistency. Furthermore, to maintain high throughput and low latency across heterogeneous edge devices and dynamic network conditions, we propose a heterogeneous resource-aware dynamic scheduling algorithm. Experimental results demonstrate that AIVD substantially reduces resource consumption while improving MLLM classification performance and semantic generation quality. The proposed scheduling strategy also achieves higher throughput and lower latency across diverse scenarios.

[186] Skeletonization-Based Adversarial Perturbations on Large Vision Language Model’s Mathematical Text Recognition

Masatomo Yoshida, Haruto Namura, Nicola Adami, Masahiro Okuda

Main category: cs.CV

TL;DR: Novel adversarial attack using skeletonization to target foundation models’ visual capabilities, especially on mathematical formula images, with evaluation of character/semantic changes and demonstration on ChatGPT.

Details

Motivation: To explore the visual capabilities and limitations of foundation models by developing an effective adversarial attack method, particularly focusing on challenging cases like mathematical formulas with LaTeX conversion and complex structures.

Method: Introduces a novel adversarial attack method utilizing skeletonization to reduce search space effectively. Targets images containing text, especially mathematical formulas, and evaluates both character and semantic changes between original and perturbed outputs.

Result: The method effectively demonstrates vulnerabilities in foundation models’ visual interpretation. Application to ChatGPT shows practical implications in real-world scenarios, revealing limitations in models’ reasoning about complex visual structures.

Conclusion: Skeletonization-based adversarial attacks provide insights into foundation models’ visual interpretation capabilities, highlighting vulnerabilities particularly in handling complex structured content like mathematical formulas, with important implications for real-world applications.

Abstract: This work explores the visual capabilities and limitations of foundation models by introducing a novel adversarial attack method utilizing skeletonization to reduce the search space effectively. Our approach specifically targets images containing text, particularly mathematical formula images, which are more challenging due to their LaTeX conversion and intricate structure. We conduct a detailed evaluation of both character and semantic changes between original and adversarially perturbed outputs to provide insights into the models’ visual interpretation and reasoning abilities. The effectiveness of our method is further demonstrated through its application to ChatGPT, which shows its practical implications in real-world scenarios.

[187] ProFuse: Efficient Cross-View Context Fusion for Open-Vocabulary 3D Gaussian Splatting

Yen-Jen Chiou, Wei-Tse Cheng, Yuan-Fu Yang

Main category: cs.CV

TL;DR: ProFuse is an efficient context-aware framework for open-vocabulary 3D scene understanding using 3D Gaussian Splatting that achieves semantic attachment in about 5 minutes per scene (2x faster than SOTA) without render-supervised fine-tuning.

Details

Motivation: Current open-vocabulary 3D scene understanding methods often suffer from cross-view inconsistency and intra-mask cohesion issues, while requiring significant computational overhead and fine-tuning. There's a need for more efficient approaches that maintain language coherence across views without sacrificing geometric accuracy.

Method: ProFuse introduces a dense correspondence-guided pre-registration phase that initializes Gaussians with accurate geometry while jointly constructing 3D Context Proposals via cross-view clustering. Each proposal carries a global feature obtained through weighted aggregation of member embeddings, which is fused onto Gaussians during direct registration to maintain per-primitive language coherence across views. The approach requires no additional optimization beyond standard reconstruction and retains geometric refinement without densification.

Result: ProFuse achieves strong open-vocabulary 3DGS understanding while completing semantic attachment in about five minutes per scene, which is two times faster than state-of-the-art methods. The framework enhances cross-view consistency and intra-mask cohesion within a direct registration setup with minimal overhead.

Conclusion: ProFuse presents an efficient and effective framework for open-vocabulary 3D scene understanding that addresses key challenges in cross-view consistency and computational efficiency, enabling rapid semantic attachment without compromising geometric accuracy or requiring extensive fine-tuning.

Abstract: We present ProFuse, an efficient context-aware framework for open-vocabulary 3D scene understanding with 3D Gaussian Splatting (3DGS). The pipeline enhances cross-view consistency and intra-mask cohesion within a direct registration setup, adding minimal overhead and requiring no render-supervised fine-tuning. Instead of relying on a pretrained 3DGS scene, we introduce a dense correspondence-guided pre-registration phase that initializes Gaussians with accurate geometry while jointly constructing 3D Context Proposals via cross-view clustering. Each proposal carries a global feature obtained through weighted aggregation of member embeddings, and this feature is fused onto Gaussians during direct registration to maintain per-primitive language coherence across views. With associations established in advance, semantic fusion requires no additional optimization beyond standard reconstruction, and the model retains geometric refinement without densification. ProFuse achieves strong open-vocabulary 3DGS understanding while completing semantic attachment in about five minutes per scene, which is two times faster than SOTA.

[188] Segmentation-Driven Monocular Shape from Polarization based on Physical Model

Jinyu Zhang, Xu Ma, Weili Chen, Gonzalo R. Arce

Main category: cs.CV

TL;DR: A new monocular shape-from-polarization method uses segmentation to break global reconstruction into local convex regions, solving azimuth ambiguity and improving 3D reconstruction accuracy.

Details

Motivation: Existing monocular shape-from-polarization methods suffer from azimuth angle ambiguity, which severely compromises reconstruction accuracy and stability. This inherent limitation of polarization analysis needs to be addressed.

Method: Proposes a segmentation-driven monocular SfP framework that reformulates global shape recovery into local reconstructions over adaptively segmented convex sub-regions. Uses polarization-aided adaptive region growing (PARG) segmentation to decompose global convexity into locally convex regions, and multi-scale fusion convexity prior (MFCP) constraint for local surface consistency.

Result: Extensive experiments on synthetic and real-world datasets show significant improvements in disambiguation accuracy and geometric fidelity compared with existing physics-based monocular SfP techniques.

Conclusion: The segmentation-driven approach effectively suppresses azimuth ambiguities and preserves surface continuity, providing a more accurate and stable monocular shape-from-polarization method for 3D reconstruction.

Abstract: Monocular shape-from-polarization (SfP) leverages the intrinsic relationship between light polarization properties and surface geometry to recover surface normals from single-view polarized images, providing a compact and robust approach for three-dimensional (3D) reconstruction. Despite its potential, existing monocular SfP methods suffer from azimuth angle ambiguity, an inherent limitation of polarization analysis, that severely compromises reconstruction accuracy and stability. This paper introduces a novel segmentation-driven monocular SfP (SMSfP) framework that reformulates global shape recovery into a set of local reconstructions over adaptively segmented convex sub-regions. Specifically, a polarization-aided adaptive region growing (PARG) segmentation strategy is proposed to decompose the global convexity assumption into locally convex regions, effectively suppressing azimuth ambiguities and preserving surface continuity. Furthermore, a multi-scale fusion convexity prior (MFCP) constraint is developed to ensure local surface consistency and enhance the recovery of fine textural and structural details. Extensive experiments on both synthetic and real-world datasets validate the proposed approach, showing significant improvements in disambiguation accuracy and geometric fidelity compared with existing physics-based monocular SfP techniques.

[189] GeM-VG: Towards Generalized Multi-image Visual Grounding with Multimodal Large Language Models

Shurong Zheng, Yousong Zhu, Hongyin Zhao, Fan Yang, Yufei Zhan, Ming Tang, Jinqiao Wang

Main category: cs.CV

TL;DR: GeM-VG is a Multimodal LLM for Generalized Multi-image Visual Grounding that outperforms previous models on multi-image tasks while maintaining strong single-image grounding and general multi-image understanding capabilities.

Details

Motivation: Existing multi-image grounding methods are limited to single-target localization and few practical tasks due to lack of unified modeling for generalized grounding tasks. There's a need for a model that can handle diverse multi-image grounding scenarios with robust cross-image reasoning.

Method: 1) Systematically categorize multi-image grounding tasks by cross-image cue reliance; 2) Introduce MG-Data-240K dataset to address target quantity and image relation limitations; 3) Propose hybrid reinforcement finetuning strategy combining chain-of-thought reasoning and direct answering with R1-like algorithm guided by rule-based rewards.

Result: Outperforms previous leading MLLMs by 2.0% on MIG-Bench and 9.7% on MC-Bench for multi-image grounding. Achieves 9.1% improvement over base model on ODINW for single-image grounding. Maintains strong general multi-image understanding capabilities.

Conclusion: GeM-VG demonstrates superior generalized grounding capabilities across diverse multi-image tasks through systematic task categorization, comprehensive dataset creation, and hybrid reinforcement finetuning strategy that enhances both perception and reasoning.

Abstract: Multimodal Large Language Models (MLLMs) have demonstrated impressive progress in single-image grounding and general multi-image understanding. Recently, some methods begin to address multi-image grounding. However, they are constrained by single-target localization and limited types of practical tasks, due to the lack of unified modeling for generalized grounding tasks. Therefore, we propose GeM-VG, an MLLM capable of Generalized Multi-image Visual Grounding. To support this, we systematically categorize and organize existing multi-image grounding tasks according to their reliance of cross-image cues and reasoning, and introduce the MG-Data-240K dataset, addressing the limitations of existing datasets regarding target quantity and image relation. To tackle the challenges of robustly handling diverse multi-image grounding tasks, we further propose a hybrid reinforcement finetuning strategy that integrates chain-of-thought (CoT) reasoning and direct answering, considering their complementary strengths. This strategy adopts an R1-like algorithm guided by a carefully designed rule-based reward, effectively enhancing the model’s overall perception and reasoning capabilities. Extensive experiments demonstrate the superior generalized grounding capabilities of our model. For multi-image grounding, it outperforms the previous leading MLLMs by 2.0% and 9.7% on MIG-Bench and MC-Bench, respectively. In single-image grounding, it achieves a 9.1% improvement over the base model on ODINW. Furthermore, our model retains strong capabilities in general multi-image understanding.

[190] Defocus Aberration Theory Confirms Gaussian Model in Most Imaging Devices

Akbar Saadat

Main category: cs.CV

TL;DR: The paper validates that Gaussian model accurately approximates defocus blur in conventional imaging devices, with less than 1% error, making it suitable for depth estimation.

Details

Motivation: Depth estimation from 2D images remains challenging due to the ill-posed nature of inferring spatial variant defocus blur. While Gaussian model offers mathematical simplicity for real-time applications, its applicability to actual imaging devices needs validation.

Method: Analyzes defocus within geometric optics framework and uses defocus aberration theory in diffraction-limited optics to evaluate accuracy of Gaussian approximation. Tests with typical focused depths (1-100 meters) and maximum depth variation of 10% at focused depth.

Result: Gaussian model shows maximum Mean Absolute Error (MAE) less than 1% for defocus operators in most imaging devices, confirming its accuracy and reliability for depth estimation applications.

Conclusion: The Gaussian model is validated as an accurate approximation for defocus operators in conventional imaging devices, supporting its use for real-time depth estimation from defocus information.

Abstract: Over the past three decades, defocus has consistently provided groundbreaking depth information in scene images. However, accurately estimating depth from 2D images continues to be a persistent and fundamental challenge in the field of 3D recovery. Heuristic approaches involve with the ill-posed problem for inferring the spatial variant defocusing blur, as the desired blur cannot be distinguished from the inherent blur. Given a prior knowledge of the defocus model, the problem become well-posed with an analytic solution for the relative blur between two images, taken at the same viewpoint with different camera settings for the focus. The Gaussian model stands out as an optimal choice for real-time applications, due to its mathematical simplicity and computational efficiency. And theoretically, it is the only model can be applied at the same time to both the absolute blur caused by depth in a single image and the relative blur resulting from depth differences between two images. This paper introduces the settings, for conventional imaging devices, to ensure that the defocusing operator adheres to the Gaussian model. Defocus analysis begins within the framework of geometric optics and is conducted by defocus aberration theory in diffraction-limited optics to obtain the accuracy of fitting the actual model to its Gaussian approximation. The results for a typical set of focused depths between $1$ and $100$ meters, with a maximum depth variation of $10%$ at the focused depth, confirm the Gaussian model’s applicability for defocus operators in most imaging devices. The findings demonstrate a maximum Mean Absolute Error $(!M!A!E)$ of less than $1%$, underscoring the model’s accuracy and reliability.

[191] SRU-Pix2Pix: A Fusion-Driven Generator Network for Medical Image Translation with Few-Shot Learning

Xihe Qiu, Yang Dai, Xiaoyu Tan, Sijia Li, Fenghao Sun, Lu Gan, Liang Liu

Main category: cs.CV

TL;DR: Enhanced Pix2Pix framework with SEResNet and U-Net++ improves MRI image translation quality and structural fidelity under few-shot conditions.

Details

Motivation: MRI has limitations in acquisition time, cost, and resolution. Image translation can address these limitations, but existing Pix2Pix methods haven't been fully optimized for medical imaging tasks.

Method: Proposed enhanced Pix2Pix framework integrating Squeeze-and-Excitation Residual Networks (SEResNet) for channel attention and U-Net++ for multi-scale feature fusion, with a simplified PatchGAN discriminator for training stability.

Result: The method achieves consistent structural fidelity and superior image quality across multiple intra-modality MRI translation tasks under few-shot conditions (fewer than 500 images), demonstrating strong generalization ability.

Conclusion: The proposed framework provides an effective extension of Pix2Pix for medical image translation, addressing MRI limitations through improved image generation quality and structural fidelity.

Abstract: Magnetic Resonance Imaging (MRI) provides detailed tissue information, but its clinical application is limited by long acquisition time, high cost, and restricted resolution. Image translation has recently gained attention as a strategy to address these limitations. Although Pix2Pix has been widely applied in medical image translation, its potential has not been fully explored. In this study, we propose an enhanced Pix2Pix framework that integrates Squeeze-and-Excitation Residual Networks (SEResNet) and U-Net++ to improve image generation quality and structural fidelity. SEResNet strengthens critical feature representation through channel attention, while U-Net++ enhances multi-scale feature fusion. A simplified PatchGAN discriminator further stabilizes training and refines local anatomical realism. Experimental results demonstrate that under few-shot conditions with fewer than 500 images, the proposed method achieves consistent structural fidelity and superior image quality across multiple intra-modality MRI translation tasks, showing strong generalization ability. These results suggest an effective extension of Pix2Pix for medical image translation.

[192] Measurement-Consistent Langevin Corrector: A Remedy for Latent Diffusion Inverse Solvers

Lee Hyoseok, Sohwi Lim, Eunju Cha, Tae-Hyun Oh

Main category: cs.CV

TL;DR: MCLC is a plug-and-play correction module that stabilizes latent diffusion model-based inverse solvers by reducing the discrepancy between solver’s and true reverse diffusion dynamics through measurement-consistent Langevin updates.

Details

Motivation: Existing latent diffusion inverse solvers suffer from instability, exhibiting undesirable artifacts and degraded quality. The authors identify this instability as a discrepancy between the solver's and true reverse diffusion dynamics.

Method: Introduces Measurement-Consistent Langevin Corrector (MCLC), a theoretically grounded plug-and-play correction module that remedies LDM-based inverse solvers through measurement-consistent Langevin updates. Unlike prior approaches that rely on linear manifold assumptions (which often don’t hold in latent space), MCLC operates without this assumption.

Result: MCLC demonstrates effectiveness and compatibility with existing solvers across diverse image restoration tasks. The method leads to more stable and reliable behavior, and the authors also analyze blob artifacts and provide insights into their underlying causes.

Conclusion: MCLC is a key step toward more robust zero-shot inverse problem solvers, offering a theoretically grounded approach that addresses the instability issues in existing latent diffusion inverse solvers.

Abstract: With recent advances in generative models, diffusion models have emerged as powerful priors for solving inverse problems in each domain. Since Latent Diffusion Models (LDMs) provide generic priors, several studies have explored their potential as domain-agnostic zero-shot inverse solvers. Despite these efforts, existing latent diffusion inverse solvers suffer from their instability, exhibiting undesirable artifacts and degraded quality. In this work, we first identify the instability as a discrepancy between the solver’s and true reverse diffusion dynamics, and show that reducing this gap stabilizes the solver. Building on this, we introduce Measurement-Consistent Langevin Corrector (MCLC), a theoretically grounded plug-and-play correction module that remedies the LDM-based inverse solvers through measurement-consistent Langevin updates. Compared to prior approaches that rely on linear manifold assumptions, which often do not hold in latent space, MCLC operates without this assumption, leading to more stable and reliable behavior. We experimentally demonstrate the effectiveness of MCLC and its compatibility with existing solvers across diverse image restoration tasks. Additionally, we analyze blob artifacts and offer insights into their underlying causes. We highlight that MCLC is a key step toward more robust zero-shot inverse problem solvers.

[193] PyramidalWan: On Making Pretrained Video Model Pyramidal for Efficient Inference

Denis Korzhenkov, Adil Karjauv, Animesh Karnewar, Mohsen Ghafoorian, Amirhossein Habibian

Main category: cs.CV

TL;DR: A pipeline to convert pretrained diffusion models into pyramidal ones via low-cost finetuning, maintaining quality while improving computational efficiency through hierarchical multi-resolution processing.

Details

Motivation: Existing pyramidal video models trained from scratch underperform compared to state-of-the-art systems in visual quality, and there's a need to leverage pretrained diffusion models more effectively while reducing computational costs.

Method: Developed a pipeline to convert pretrained diffusion models into pyramidal models through low-cost finetuning, plus investigated various step distillation strategies to enhance inference efficiency in the hierarchical multi-resolution framework.

Result: Successfully transformed pretrained diffusion models into pyramidal ones without degradation in output video quality, while exploring step distillation methods to further improve inference efficiency.

Conclusion: The proposed approach enables efficient pyramidal video generation by leveraging existing pretrained models, offering a practical solution that maintains quality while reducing computational costs compared to training from scratch.

Abstract: Recently proposed pyramidal models decompose the conventional forward and backward diffusion processes into multiple stages operating at varying resolutions. These models handle inputs with higher noise levels at lower resolutions, while less noisy inputs are processed at higher resolutions. This hierarchical approach significantly reduces the computational cost of inference in multi-step denoising models. However, existing open-source pyramidal video models have been trained from scratch and tend to underperform compared to state-of-the-art systems in terms of visual plausibility. In this work, we present a pipeline that converts a pretrained diffusion model into a pyramidal one through low-cost finetuning, achieving this transformation without degradation in quality of output videos. Furthermore, we investigate and compare various strategies for step distillation within pyramidal models, aiming to further enhance the inference efficiency. Our results are available at https://qualcomm-ai-research.github.io/PyramidalWan.

[194] Detector-Augmented SAMURAI for Long-Duration Drone Tracking

Tamara R. Lenhard, Andreas Weinmann, Hichem Snoussi, Tobias Koch

Main category: cs.CV

TL;DR: First systematic evaluation of SAMURAI foundation model for drone tracking, with detector-augmented extension that improves robustness in urban surveillance scenarios.

Details

Motivation: Drone tracking is critical for surveillance but current RGB-based approaches are limited and rely on conventional motion models. Foundation models like SAMURAI show strong category-agnostic tracking performance but haven't been evaluated for drone-specific scenarios.

Method: Systematic evaluation of SAMURAI for drone tracking, plus introduction of a detector-augmented extension to mitigate sensitivity to bounding-box initialization and sequence length.

Result: Proposed extension significantly improves robustness in complex urban environments, especially for long-duration sequences and drone exit-re-entry events. Incorporation of detector cues yields consistent gains with success rate improvements up to +0.393 and FNR reductions up to -0.475.

Conclusion: SAMURAI shows promising potential for drone tracking, and detector-augmented extension effectively addresses its limitations, making it suitable for robust long-term drone surveillance in urban settings.

Abstract: Robust long-term tracking of drone is a critical requirement for modern surveillance systems, given their increasing threat potential. While detector-based approaches typically achieve strong frame-level accuracy, they often suffer from temporal inconsistencies caused by frequent detection dropouts. Despite its practical relevance, research on RGB-based drone tracking is still limited and largely reliant on conventional motion models. Meanwhile, foundation models like SAMURAI have established their effectiveness across other domains, exhibiting strong category-agnostic tracking performance. However, their applicability in drone-specific scenarios has not been investigated yet. Motivated by this gap, we present the first systematic evaluation of SAMURAI’s potential for robust drone tracking in urban surveillance settings. Furthermore, we introduce a detector-augmented extension of SAMURAI to mitigate sensitivity to bounding-box initialization and sequence length. Our findings demonstrate that the proposed extension significantly improves robustness in complex urban environments, with pronounced benefits in long-duration sequences - especially under drone exit-re-entry events. The incorporation of detector cues yields consistent gains over SAMURAI’s zero-shot performance across datasets and metrics, with success rate improvements of up to +0.393 and FNR reductions of up to -0.475.

[195] Integrated Framework for Selecting and Enhancing Ancient Marathi Inscription Images from Stone, Metal Plate, and Paper Documents

Bapu D. Chendage, Rajivkumar S. Mente

Main category: cs.CV

TL;DR: Proposes binarization and complementary preprocessing techniques to enhance degraded ancient script images by removing stains and improving unclear text, evaluated on stone, metal plate, and document scripts with K-NN and SVM classifiers.

Details

Motivation: Ancient script images suffer from severe background noise, low contrast, and degradation from aging/environmental effects, making inscriptions difficult to read due to similar visual characteristics between foreground text and background.

Method: Image enhancement approach based on binarization and complementary preprocessing techniques for removing stains and enhancing unclear ancient text, evaluated on different types of ancient scripts (stone, metal plates, historical documents).

Result: Using K-NN classifier: 55.7% accuracy for stone, 62% for metal plate, 65.6% for document scripts. Using SVM classifier: 53.2% for stone, 59.5% for metal plate, 67.8% for document scripts.

Conclusion: The proposed enhancement method effectively improves readability of ancient Marathi inscription images, demonstrating practical value for historical document preservation and analysis.

Abstract: Ancient script images often suffer from severe background noise, low contrast, and degradation caused by aging and environmental effects. In many cases, the foreground text and background exhibit similar visual characteristics, making the inscriptions difficult to read. The primary objective of image enhancement is to improve the readability of such degraded ancient images. This paper presents an image enhancement approach based on binarization and complementary preprocessing techniques for removing stains and enhancing unclear ancient text. The proposed methods are evaluated on different types of ancient scripts, including inscriptions on stone, metal plates, and historical documents. Experimental results show that the proposed approach achieves classification accuracies of 55.7%, 62%, and 65.6% for stone, metal plate, and document scripts, respectively, using the K-Nearest Neighbor (K-NN) classifier. Using the Support Vector Machine (SVM) classifier, accuracies of 53.2%, 59.5%, and 67.8% are obtained. The results demonstrate the effectiveness of the proposed enhancement method in improving the readability of ancient Marathi inscription images.

[196] SOVABench: A Vehicle Surveillance Action Retrieval Benchmark for Multimodal Large Language Models

Oriol Rabasseda, Zenjie Li, Kamal Nasrollahi, Sergio Escalera

Main category: cs.CV

TL;DR: SOVABench: A surveillance video retrieval benchmark for vehicle action discrimination, with a training-free MLLM framework for interpretable embeddings.

Details

Motivation: Existing video retrieval benchmarks focus on scene-level similarity but lack evaluation of action discrimination needed for surveillance applications. There's a gap in benchmarks that specifically test cross-action discrimination and temporal understanding in surveillance contexts.

Method: 1) Created SOVABench from real surveillance footage with vehicle-related actions; 2) Defined two evaluation protocols (inter-pair for cross-action discrimination, intra-pair for temporal direction); 3) Developed training-free framework using MLLMs to generate interpretable embeddings from MLLM-generated descriptions.

Result: Action discrimination remains challenging for state-of-the-art vision and multimodal models despite being intuitive for humans. The proposed MLLM framework achieves strong performance on SOVABench and outperforms contrastive Vision-Language Models on spatial and counting benchmarks where they often fail.

Conclusion: SOVABench addresses a critical gap in surveillance video evaluation, and the MLLM-based framework provides an effective training-free solution for interpretable embeddings that handles challenging action discrimination tasks in surveillance contexts.

Abstract: Automatic identification of events and recurrent behavior analysis are critical for video surveillance. However, most existing content-based video retrieval benchmarks focus on scene-level similarity and do not evaluate the action discrimination required in surveillance. To address this gap, we introduce SOVABench (Surveillance Opposite Vehicle Actions Benchmark), a real-world retrieval benchmark built from surveillance footage and centered on vehicle-related actions. SOVABench defines two evaluation protocols (inter-pair and intra-pair) to assess cross-action discrimination and temporal direction understanding. Although action distinctions are generally intuitive for human observers, our experiments show that they remain challenging for state-of-the-art vision and multimodal models. Leveraging the visual reasoning and instruction-following capabilities of Multimodal Large Language Models (MLLMs), we present a training-free framework for producing interpretable embeddings from MLLM-generated descriptions for both images and videos. The framework achieves strong performance on SOVABench as well as on several spatial and counting benchmarks where contrastive Vision-Language Models often fail. The code, annotations, and instructions to construct the benchmark are publicly available.

[197] Character Detection using YOLO for Writer Identification in multiple Medieval books

Alessandra Scotto di Freca, Tiziana D Alessandro, Francesco Fontanella, Filippo Sarria, Claudio De Stefano

Main category: cs.CV

TL;DR: This paper presents a YOLO-based approach for scribe identification in medieval manuscripts, replacing previous template matching and CNN methods to improve letter detection and writer attribution accuracy.

Details

Motivation: The study aims to advance paleography by developing more reliable digital methods for identifying individual scribes in medieval manuscripts, which is crucial for dating documents and understanding script evolution. Previous template matching approaches had limitations requiring manual threshold setting.

Method: The authors replaced their previous template matching and CNN system with YOLOv5 object detection to identify the letter “a” in manuscript pages. This approach eliminates the need for manual threshold setting and provides confidence scores for detected letters, enabling a two-stage classification system with rejection thresholds.

Result: YOLO effectively extracts more letter instances than template matching, leading to improved second-stage classification accuracy. The confidence scores allow for developing a rejection threshold system that enables reliable writer identification even in unseen manuscripts.

Conclusion: YOLO-based object detection represents a significant improvement over previous template matching approaches for scribe identification in paleography, offering better letter detection, more accurate classification, and the potential for reliable application to unseen manuscripts through confidence-based rejection thresholds.

Abstract: Paleography is the study of ancient and historical handwriting, its key objectives include the dating of manuscripts and understanding the evolution of writing. Estimating when a document was written and tracing the development of scripts and writing styles can be aided by identifying the individual scribes who contributed to a medieval manuscript. Although digital technologies have made significant progress in this field, the general problem remains unsolved and continues to pose open challenges. … We previously proposed an approach focused on identifying specific letters or abbreviations that characterize each writer. In that study, we considered the letter “a”, as it was widely present on all pages of text and highly distinctive, according to the suggestions of expert paleographers. We used template matching techniques to detect the occurrences of the character “a” on each page and the convolutional neural network (CNN) to attribute each instance to the correct scribe. Moving from the interesting results achieved from this previous system and being aware of the limitations of the template matching technique, which requires an appropriate threshold to work, we decided to experiment in the same framework with the use of the YOLO object detection model to identify the scribe who contributed to the writing of different medieval books. We considered the fifth version of YOLO to implement the YOLO object detection model, which completely substituted the template matching and CNN used in the previous work. The experimental results demonstrate that YOLO effectively extracts a greater number of letters considered, leading to a more accurate second-stage classification. Furthermore, the YOLO confidence score provides a foundation for developing a system that applies a rejection threshold, enabling reliable writer identification even in unseen manuscripts.

[198] DivAS: Interactive 3D Segmentation of NeRFs via Depth-Weighted Voxel Aggregation

Ayush Pande

Main category: cs.CV

TL;DR: DivAS is an optimization-free, interactive framework for segmenting Neural Radiance Fields (NeRFs) using 2D SAM masks refined with NeRF depth priors, achieving real-time performance with a custom CUDA kernel.

Details

Motivation: Existing NeRF segmentation methods are optimization-based, requiring slow per-scene training that sacrifices the zero-shot capabilities of 2D foundation models like SAM.

Method: Fast GUI-based workflow where 2D SAM masks from user point prompts are refined using NeRF-derived depth priors, then aggregated into a unified 3D voxel grid using a custom CUDA kernel in under 200ms.

Result: Achieves segmentation quality comparable to optimization-based methods, 2-2.5x faster end-to-end, and up to an order of magnitude faster when excluding user prompting time on Mip-NeRF 360° and LLFF datasets.

Conclusion: DivAS provides an optimization-free, fully interactive framework for NeRF segmentation that maintains zero-shot capabilities of 2D models while enabling real-time feedback through efficient voxel aggregation.

Abstract: Existing methods for segmenting Neural Radiance Fields (NeRFs) are often optimization-based, requiring slow per-scene training that sacrifices the zero-shot capabilities of 2D foundation models. We introduce DivAS (Depth-interactive Voxel Aggregation Segmentation), an optimization-free, fully interactive framework that addresses these limitations. Our method operates via a fast GUI-based workflow where 2D SAM masks, generated from user point prompts, are refined using NeRF-derived depth priors to improve geometric accuracy and foreground-background separation. The core of our contribution is a custom CUDA kernel that aggregates these refined multi-view masks into a unified 3D voxel grid in under 200ms, enabling real-time visual feedback. This optimization-free design eliminates the need for per-scene training. Experiments on Mip-NeRF 360° and LLFF show that DivAS achieves segmentation quality comparable to optimization-based methods, while being 2-2.5x faster end-to-end, and up to an order of magnitude faster when excluding user prompting time.

[199] Scaling Vision Language Models for Pharmaceutical Long Form Video Reasoning on Industrial GenAI Platform

Suyash Mishra, Qiang Li, Srikanth Patil, Satyanarayan Pati, Baddu Narendra

Main category: cs.CV

TL;DR: Industrial GenAI framework for pharmaceutical video understanding that processes 200K+ PDFs, 25K+ videos, and 888 multilingual audio files, analyzing 40+ VLMs under real-world constraints with 3-8x efficiency gains.

Details

Motivation: Most VLM evaluations focus on short videos with unlimited resources, but industrial applications like pharmaceutical content understanding require processing long-form videos under strict GPU, latency, and cost constraints where existing approaches fail to scale.

Method: Developed an industrial large-scale architecture for multimodal reasoning, empirically analyzed over 40 VLMs on Video-MME and MMBench benchmarks plus proprietary dataset of 25,326 videos across 14 disease areas, focusing on practical deployment constraints.

Result: Achieved 3-8x efficiency gains using SDPA attention on commodity GPUs, multimodality improved performance in 8/12 task domains (especially length-dependent tasks), and identified bottlenecks in temporal alignment and keyframe detection across both open- and closed-source VLMs.

Conclusion: Rather than proposing new models, this paper characterizes practical limits, trade-offs, and failure patterns of current VLMs under realistic deployment constraints, providing actionable guidance for designing scalable multimodal systems for long-form video understanding in industrial domains.

Abstract: Vision Language Models (VLMs) have shown strong performance on multimodal reasoning tasks, yet most evaluations focus on short videos and assume unconstrained computational resources. In industrial settings such as pharmaceutical content understanding, practitioners must process long-form videos under strict GPU, latency, and cost constraints, where many existing approaches fail to scale. In this work, we present an industrial GenAI framework that processes over 200,000 PDFs, 25,326 videos across eight formats (e.g., MP4, M4V, etc.), and 888 multilingual audio files in more than 20 languages. Our study makes three contributions: (i) an industrial large-scale architecture for multimodal reasoning in pharmaceutical domains; (ii) empirical analysis of over 40 VLMs on two leading benchmarks (Video-MME and MMBench) and proprietary dataset of 25,326 videos across 14 disease areas; and (iii) four findings relevant to long-form video reasoning: the role of multimodality, attention mechanism trade-offs, temporal reasoning limits, and challenges of video splitting under GPU constraints. Results show 3-8 times efficiency gains with SDPA attention on commodity GPUs, multimodality improving up to 8/12 task domains (especially length-dependent tasks), and clear bottlenecks in temporal alignment and keyframe detection across open- and closed-source VLMs. Rather than proposing a new “A+B” model, this paper characterizes practical limits, trade-offs, and failure patterns of current VLMs under realistic deployment constraints, and provide actionable guidance for both researchers and practitioners designing scalable multimodal systems for long-form video understanding in industrial domains.

[200] Rotation-Robust Regression with Convolutional Model Trees

Hongyi Li, William Ward Armstrong, Jun Xu

Main category: cs.CV

TL;DR: The paper studies rotation-robust learning using Convolutional Model Trees with geometry-aware inductive biases and deployment-time orientation search to improve robustness under image rotations.

Details

Motivation: To develop rotation-robust learning methods for image inputs that can handle geometric transformations at deployment time, particularly in-plane rotations, using structured model architectures.

Method: Uses Convolutional Model Trees (CMTs) with three geometry-aware inductive biases: convolutional smoothing, tilt dominance constraint, and importance-based pruning. Also implements deployment-time orientation search that selects discrete rotations maximizing forest-level confidence without updating model parameters.

Result: Orientation search improves robustness under severe rotations but can be harmful near canonical orientation when confidence is misaligned with correctness. Consistent trends observed on MNIST digit recognition implemented as one-vs-rest regression.

Conclusion: The study highlights both the promise and limitations of confidence-based orientation selection for model-tree ensembles, showing that structured inductive biases and deployment-time orientation search can enhance rotation robustness but require careful handling of confidence-calibration issues.

Abstract: We study rotation-robust learning for image inputs using Convolutional Model Trees (CMTs) [1], whose split and leaf coefficients can be structured on the image grid and transformed geometrically at deployment time. In a controlled MNIST setting with a rotation-invariant regression target, we introduce three geometry-aware inductive biases for split directions – convolutional smoothing, a tilt dominance constraint, and importance-based pruning – and quantify their impact on robustness under in-plane rotations. We further evaluate a deployment-time orientation search that selects a discrete rotation maximizing a forest-level confidence proxy without updating model parameters. Orientation search improves robustness under severe rotations but can be harmful near the canonical orientation when confidence is misaligned with correctness. Finally, we observe consistent trends on MNIST digit recognition implemented as one-vs-rest regression, highlighting both the promise and limitations of confidence-based orientation selection for model-tree ensembles.

[201] Prototypicality Bias Reveals Blindspots in Multimodal Evaluation Metrics

Subhadeep Roy, Gagan Bhatia, Steffen Eger

Main category: cs.CV

TL;DR: The paper identifies prototypicality bias in multimodal evaluation metrics and introduces ProtoBias benchmark to test this bias, showing current metrics often misrank semantically correct but non-prototypical images vs. incorrect but prototypical ones. They propose ProtoScore, a more robust 7B-parameter metric.

Details

Motivation: Automatic metrics are widely used to evaluate text-to-image models but may prioritize visually/socially prototypical images from biased data rather than semantic correctness. The paper aims to study systematic prototypicality bias in multimodal evaluation.

Method: Introduces ProtoBias benchmark with controlled contrastive pairs across Animals, Objects, and Demography categories. Each pair contains semantically correct but non-prototypical images vs. subtly incorrect yet prototypical adversarial counterparts. Tests metrics’ ability to follow textual semantics vs. default to prototypes.

Result: Widely used metrics (CLIPScore, PickScore, VQA-based scores) frequently misrank pairs, favoring prototypical but incorrect images. LLM-as-Judge systems show uneven robustness. Human evaluations consistently favor semantic correctness with larger margins. ProtoScore (7B-parameter metric) substantially reduces failure rates and misranking while being much faster than GPT-5 inference.

Conclusion: Current multimodal evaluation metrics suffer from prototypicality bias, prioritizing visual/social prototypes over semantic correctness. ProtoScore offers a more robust alternative that better aligns with human judgment while being computationally efficient.

Abstract: Automatic metrics are now central to evaluating text-to-image models, often substituting for human judgment in benchmarking and large-scale filtering. However, it remains unclear whether these metrics truly prioritize semantic correctness or instead favor visually and socially prototypical images learned from biased data distributions. We identify and study \emph{prototypicality bias} as a systematic failure mode in multimodal evaluation. We introduce a controlled contrastive benchmark \textsc{\textbf{ProtoBias}} (\textit{\textbf{Proto}typical \textbf{Bias}}), spanning Animals, Objects, and Demography images, where semantically correct but non-prototypical images are paired with subtly incorrect yet prototypical adversarial counterparts. This setup enables a directional evaluation of whether metrics follow textual semantics or default to prototypes. Our results show that widely used metrics, including CLIPScore, PickScore, and VQA-based scores, frequently misrank these pairs, while even LLM-as-Judge systems exhibit uneven robustness in socially grounded cases. Human evaluations consistently favour semantic correctness with larger decision margins. Motivated by these findings, we propose \textbf{\textsc{ProtoScore}}, a robust 7B-parameter metric that substantially reduces failure rates and suppresses misranking, while running at orders of magnitude faster than the inference time of GPT-5, approaching the robustness of much larger closed-source judges.

[202] TEA: Temporal Adaptive Satellite Image Semantic Segmentation

Juyuan Kang, Hao Zhu, Yan Zhu, Wei Zhang, Jianing Chen, Tianxiang Xiao, Yike Ma, Hao Jiang, Feng Dai

Main category: cs.CV

TL;DR: TEA: A temporal adaptive method for satellite image time-series semantic segmentation that improves model generalization across varying sequence lengths through teacher-student knowledge transfer and reconstruction tasks.

Details

Motivation: Existing SITS segmentation methods work well with predetermined sequence lengths but fail to generalize across scenarios with varying temporal lengths, leading to poor segmentation results when sequence lengths differ.

Method: Proposes TEA (TEmporal Adaptive SITS segmentation) with: 1) Teacher model encapsulating global sequence knowledge to guide student model with adaptive temporal inputs, 2) Knowledge transfer via intermediate embeddings, prototypes, and soft labels, 3) Dynamic student model aggregation to mitigate knowledge forgetting, 4) Full-sequence reconstruction as auxiliary task to enhance representation quality across varying temporal lengths.

Result: Extensive experiments demonstrate remarkable improvements across inputs of different temporal lengths on common benchmarks compared to existing approaches.

Conclusion: TEA effectively addresses the generalization problem in SITS segmentation across varying temporal lengths, providing a robust solution for agricultural parcel segmentation with satellite image time-series data.

Abstract: Crop mapping based on satellite images time-series (SITS) holds substantial economic value in agricultural production settings, in which parcel segmentation is an essential step. Existing approaches have achieved notable advancements in SITS segmentation with predetermined sequence lengths. However, we found that these approaches overlooked the generalization capability of models across scenarios with varying temporal length, leading to markedly poor segmentation results in such cases. To address this issue, we propose TEA, a TEmporal Adaptive SITS semantic segmentation method to enhance the model’s resilience under varying sequence lengths. We introduce a teacher model that encapsulates the global sequence knowledge to guide a student model with adaptive temporal input lengths. Specifically, teacher shapes the student’s feature space via intermediate embedding, prototypes and soft label perspectives to realize knowledge transfer, while dynamically aggregating student model to mitigate knowledge forgetting. Finally, we introduce full-sequence reconstruction as an auxiliary task to further enhance the quality of representations across inputs of varying temporal lengths. Through extensive experiments, we demonstrate that our method brings remarkable improvements across inputs of different temporal lengths on common benchmarks. Our code will be publicly available.

[203] SparseLaneSTP: Leveraging Spatio-Temporal Priors with Sparse Transformers for 3D Lane Detection

Maximilian Pittner, Joel Janai, Mario Faigle, Alexandru Paul Condurache

Main category: cs.CV

TL;DR: SparseLaneSTP: A sparse lane transformer that integrates lane geometric priors and temporal information for improved 3D lane detection, with a new auto-labeled dataset.

Details

Motivation: Existing 3D lane detection methods have limitations: dense BEV approaches suffer from erroneous transformations and poor feature alignment; sparse detectors ignore valuable lane-specific priors; and no methods utilize historical lane observations to resolve visibility ambiguities.

Method: SparseLaneSTP integrates geometric lane structure and temporal information into a sparse lane transformer. It introduces: 1) lane-specific spatio-temporal attention mechanism, 2) continuous lane representation for sparse architectures, and 3) temporal regularization. Also creates a new precise 3D lane dataset using auto-labeling.

Result: Achieves state-of-the-art performance across all detection and error metrics on existing 3D lane detection benchmarks and on the novel dataset. Demonstrates benefits of integrating lane priors and temporal information.

Conclusion: SparseLaneSTP effectively addresses limitations of existing 3D lane detection methods by incorporating lane geometric priors and temporal information, while the new dataset provides more precise and consistent ground truth for evaluation.

Abstract: 3D lane detection has emerged as a critical challenge in autonomous driving, encompassing identification and localization of lane markings and the 3D road surface. Conventional 3D methods detect lanes from dense birds-eye-viewed (BEV) features, though erroneous transformations often result in a poor feature representation misaligned with the true 3D road surface. While recent sparse lane detectors have surpassed dense BEV approaches, they completely disregard valuable lane-specific priors. Furthermore, existing methods fail to utilize historic lane observations, which yield the potential to resolve ambiguities in situations of poor visibility. To address these challenges, we present SparseLaneSTP, a novel method that integrates both geometric properties of the lane structure and temporal information into a sparse lane transformer. It introduces a new lane-specific spatio-temporal attention mechanism, a continuous lane representation tailored for sparse architectures as well as temporal regularization. Identifying weaknesses of existing 3D lane datasets, we also introduce a precise and consistent 3D lane dataset using a simple yet effective auto-labeling strategy. Our experimental section proves the benefits of our contributions and demonstrates state-of-the-art performance across all detection and error metrics on existing 3D lane detection benchmarks as well as on our novel dataset.

[204] OceanSplat: Object-aware Gaussian Splatting with Trinocular View Consistency for Underwater Scene Reconstruction

Minseong Kweon, Jinsun Park

Main category: cs.CV

TL;DR: OceanSplat improves 3D Gaussian Splatting for underwater scenes by enforcing trinocular view consistency, using synthetic epipolar depth priors, and implementing depth-aware alpha adjustment to reduce medium artifacts.

Details

Motivation: Underwater scenes suffer from optical degradation causing multi-view inconsistencies and floating artifacts in 3D reconstruction, making it challenging to accurately represent object geometry separate from the scattering medium.

Method: 1) Enforces trinocular view consistency by rendering horizontally/vertically translated camera views and aligning via inverse warping. 2) Derives synthetic epipolar depth prior through triangulation as self-supervised depth regularizer. 3) Implements depth-aware alpha adjustment to modulate 3D Gaussian opacity based on z-component and viewing direction during early training.

Result: OceanSplat substantially outperforms existing methods for both scene reconstruction and restoration in scattering media on real-world underwater and simulated scenes, disentangling 3D Gaussians from scattering medium and significantly reducing floating artifacts.

Conclusion: The proposed geometric constraints and depth-aware regularization enable robust representation of object geometry in underwater environments by preventing medium-induced primitive formation and preserving scene structure.

Abstract: We introduce OceanSplat, a novel 3D Gaussian Splatting-based approach for accurately representing 3D geometry in underwater scenes. To overcome multi-view inconsistencies caused by underwater optical degradation, our method enforces trinocular view consistency by rendering horizontally and vertically translated camera views relative to each input view and aligning them via inverse warping. Furthermore, these translated camera views are used to derive a synthetic epipolar depth prior through triangulation, which serves as a self-supervised depth regularizer. These geometric constraints facilitate the spatial optimization of 3D Gaussians and preserve scene structure in underwater environments. We also propose a depth-aware alpha adjustment that modulates the opacity of 3D Gaussians during early training based on their $z$-component and viewing direction, deterring the formation of medium-induced primitives. With our contributions, 3D Gaussians are disentangled from the scattering medium, enabling robust representation of object geometry and significantly reducing floating artifacts in reconstructed underwater scenes. Experiments on real-world underwater and simulated scenes demonstrate that OceanSplat substantially outperforms existing methods for both scene reconstruction and restoration in scattering media.

[205] Higher-Order Adversarial Patches for Real-Time Object Detectors

Jens Bayer, Stefan Becker, David Münch, Michael Arens, Jürgen Beyerer

Main category: cs.CV

TL;DR: Higher-order adversarial attacks on object detectors show stronger generalization than lower-order attacks, and adversarial training alone is insufficient for effective defense.

Details

Motivation: To investigate the impact of higher-order adversarial attacks on object detectors, examining the cat-and-mouse dynamic between attack patterns and adversarial training defenses.

Method: Successively train adversarial attack patterns and harden object detectors with adversarial training, using YOLOv10 as representative and adversarial patches in evasion attacks.

Result: Higher-order adversarial patches demonstrate stronger generalization capacity compared to lower-order patches, and adversarial training alone is insufficient to effectively harden object detectors against such attacks.

Conclusion: Higher-order adversarial attacks pose a significant threat to object detectors with enhanced generalization, requiring more robust defense strategies beyond standard adversarial training.

Abstract: Higher-order adversarial attacks can directly be considered the result of a cat-and-mouse game – an elaborate action involving constant pursuit, near captures, and repeated escapes. This idiom describes the enduring circular training of adversarial attack patterns and adversarial training the best. The following work investigates the impact of higher-order adversarial attacks on object detectors by successively training attack patterns and hardening object detectors with adversarial training. The YOLOv10 object detector is chosen as a representative, and adversarial patches are used in an evasion attack manner. Our results indicate that higher-order adversarial patches are not only affecting the object detector directly trained on but rather provide a stronger generalization capacity compared to lower-order adversarial patches. Moreover, the results highlight that solely adversarial training is not sufficient to harden an object detector efficiently against this kind of adversarial attack. Code: https://github.com/JensBayer/HigherOrder

[206] Patch-based Representation and Learning for Efficient Deformation Modeling

Ruochen Chen, Thuy Tran, Shaifali Parashar

Main category: cs.CV

TL;DR: PolyFit: A patch-based surface representation using local jet functions that enables efficient surface deformation by updating compact jet coefficients rather than per-vertex optimization.

Details

Motivation: Current surface deformation methods often require optimizing per-vertex degrees of freedom, which is computationally expensive. There's a need for more efficient representations that can handle various surface types and enable faster deformation for computer vision and graphics applications.

Method: PolyFit learns a patch-based surface representation by fitting jet functions locally on surface patches. This supervised learning approach works with both analytic functions and real data. The learned representation allows surface deformation through updating compact jet coefficients rather than per-vertex optimization.

Result: The method demonstrates competitive performance in two applications: 1) Shape-from-template with test-time optimization that’s faster than offline physics-based solvers and more accurate than physics-guided neural simulators, 2) Garment draping with a self-supervised, mesh- and garment-agnostic model that generalizes across resolutions and garment types, offering up to 10x faster inference than baselines.

Conclusion: PolyFit provides an efficient patch-based surface representation that enables fast and accurate surface deformation for various computer vision and graphics tasks, outperforming existing methods in both speed and accuracy.

Abstract: In this paper, we present a patch-based representation of surfaces, PolyFit, which is obtained by fitting jet functions locally on surface patches. Such a representation can be learned efficiently in a supervised fashion from both analytic functions and real data. Once learned, it can be generalized to various types of surfaces. Using PolyFit, the surfaces can be efficiently deformed by updating a compact set of jet coefficients rather than optimizing per-vertex degrees of freedom for many downstream tasks in computer vision and graphics. We demonstrate the capabilities of our proposed methodologies with two applications: 1) Shape-from-template (SfT): where the goal is to deform the input 3D template of an object as seen in image/video. Using PolyFit, we adopt test-time optimization that delivers competitive accuracy while being markedly faster than offline physics-based solvers, and outperforms recent physics-guided neural simulators in accuracy at modest additional runtime. 2) Garment draping. We train a self-supervised, mesh- and garment-agnostic model that generalizes across resolutions and garment types, delivering up to an order-of-magnitude faster inference than strong baselines.

[207] From Understanding to Engagement: Personalized pharmacy Video Clips via Vision Language Models (VLMs)

Suyash Mishra, Qiang Li, Srikanth Patil, Anubhav Girdhar

Main category: cs.CV

TL;DR: Domain-adapted Video to Video Clip Generation framework using ALMs and VLMs for pharmaceutical industry, achieving 3-4x speedup, 4x cost reduction, and improved clip quality over SOTA baselines.

Details

Motivation: Traditional manual annotation of heterogeneous pharmaceutical data (text, images, video, audio, web links) is inconsistent, inefficient, and struggles with long-form content like clinical trial interviews and educational seminars.

Method: Three-fold approach: (1) Cut & Merge algorithm with fade in/out and timestamp normalization for smooth transitions; (2) Personalization via role definition and prompt injection for marketing/training/regulatory outputs; (3) Cost-efficient end-to-end pipeline balancing ALM/VLM enhanced processing.

Result: 3-4x speedup, 4x cost reduction on 16,159 pharmacy videos across 14 disease areas. Improved clip coherence (0.348) and informativeness (0.721) scores over SOTA VLM baselines like Gemini 2.5 Pro.

Conclusion: The framework enables transparent, custom extractive, and compliance-supporting video summarization for life sciences, demonstrating the potential of domain-adapted ALM/VLM integration for pharmaceutical content processing.

Abstract: Vision Language Models (VLMs) are poised to revolutionize the digital transformation of pharmacyceutical industry by enabling intelligent, scalable, and automated multi-modality content processing. Traditional manual annotation of heterogeneous data modalities (text, images, video, audio, and web links), is prone to inconsistencies, quality degradation, and inefficiencies in content utilization. The sheer volume of long video and audio data further exacerbates these challenges, (e.g. long clinical trial interviews and educational seminars). Here, we introduce a domain adapted Video to Video Clip Generation framework that integrates Audio Language Models (ALMs) and Vision Language Models (VLMs) to produce highlight clips. Our contributions are threefold: (i) a reproducible Cut & Merge algorithm with fade in/out and timestamp normalization, ensuring smooth transitions and audio/visual alignment; (ii) a personalization mechanism based on role definition and prompt injection for tailored outputs (marketing, training, regulatory); (iii) a cost efficient e2e pipeline strategy balancing ALM/VLM enhanced processing. Evaluations on Video MME benchmark (900) and our proprietary dataset of 16,159 pharmacy videos across 14 disease areas demonstrate 3 to 4 times speedup, 4 times cost reduction, and competitive clip quality. Beyond efficiency gains, we also report our methods improved clip coherence scores (0.348) and informativeness scores (0.721) over state of the art VLM baselines (e.g., Gemini 2.5 Pro), highlighting the potential of transparent, custom extractive, and compliance supporting video summarization for life sciences.

[208] Driving on Registers

Ellington Kirby, Alexandre Boulch, Yihong Xu, Yuan Yin, Gilles Puy, Éloi Zablocki, Andrei Bursuc, Spyros Gidaris, Renaud Marlet, Florent Bartoccioni, Anh-Quan Cao, Nermin Samet, Tuan-Hung VU, Matthieu Cord

Main category: cs.CV

TL;DR: DrivoR is a transformer-based autonomous driving system that uses camera-aware register tokens to compress multi-camera features, with lightweight decoders for trajectory generation and scoring based on safety, comfort, and efficiency metrics.

Details

Motivation: To create an efficient end-to-end autonomous driving system that reduces computational overhead while maintaining accuracy, and enables behavior-conditioned driving through interpretable scoring metrics.

Method: Uses pretrained Vision Transformers with camera-aware register tokens to compress multi-camera features, then employs two lightweight transformer decoders: one for trajectory generation and another for scoring trajectories based on safety, comfort, and efficiency metrics learned from an oracle.

Result: Outperforms or matches strong contemporary baselines across NAVSIM-v1, NAVSIM-v2, and HUGSIM benchmarks, demonstrating that pure-transformer architecture with token compression enables accurate, efficient, and adaptive driving.

Conclusion: A pure-transformer architecture with targeted token compression is sufficient for accurate, efficient, and adaptive end-to-end autonomous driving, offering interpretable behavior-conditioned driving through learned sub-scores.

Abstract: We present DrivoR, a simple and efficient transformer-based architecture for end-to-end autonomous driving. Our approach builds on pretrained Vision Transformers (ViTs) and introduces camera-aware register tokens that compress multi-camera features into a compact scene representation, significantly reducing downstream computation without sacrificing accuracy. These tokens drive two lightweight transformer decoders that generate and then score candidate trajectories. The scoring decoder learns to mimic an oracle and predicts interpretable sub-scores representing aspects such as safety, comfort, and efficiency, enabling behavior-conditioned driving at inference. Despite its minimal design, DrivoR outperforms or matches strong contemporary baselines across NAVSIM-v1, NAVSIM-v2, and the photorealistic closed-loop HUGSIM benchmark. Our results show that a pure-transformer architecture, combined with targeted token compression, is sufficient for accurate, efficient, and adaptive end-to-end driving. Code and checkpoints will be made available via the project page.

[209] UniLiPs: Unified LiDAR Pseudo-Labeling with Geometry-Grounded Dynamic Scene Decomposition

Filippo Ghilotti, Samuel Brucker, Nahku Saidy, Matteo Matteucci, Mario Bijelic, Felix Heide

Main category: cs.CV

TL;DR: Unsupervised 3D pseudo-labeling method that lifts text and 2D vision foundation model cues into 3D using temporal-geometric consistency across LiDAR sweeps, producing semantic labels, bounding boxes, and dense scans without manual supervision.

Details

Motivation: Unlabeled LiDAR logs in autonomous driving are abundant but useless without expensive human labels, creating a major cost barrier for perception research. The paper aims to overcome this by leveraging temporal-geometric consistency to extract value from unlabeled data.

Method: Uses unsupervised multi-modal pseudo-labeling with geometric priors from temporally accumulated LiDAR maps. Features a novel iterative update rule enforcing joint geometric-semantic consistency and detects moving objects from inconsistencies. Lifts cues from text and 2D vision foundation models directly into 3D.

Result: Method simultaneously produces 3D semantic labels, 3D bounding boxes, and dense LiDAR scans with robust generalization across three datasets. Outperforms existing pseudo-labeling methods that require manual supervision. Improves depth prediction by 51.5% and 22.0% MAE in 80-150m and 150-250m ranges respectively.

Conclusion: The approach successfully addresses the labeling bottleneck in autonomous perception by leveraging temporal-geometric consistency to create valuable 3D annotations from unlabeled LiDAR data without human intervention, demonstrating significant improvements in downstream tasks.

Abstract: Unlabeled LiDAR logs, in autonomous driving applications, are inherently a gold mine of dense 3D geometry hiding in plain sight - yet they are almost useless without human labels, highlighting a dominant cost barrier for autonomous-perception research. In this work we tackle this bottleneck by leveraging temporal-geometric consistency across LiDAR sweeps to lift and fuse cues from text and 2D vision foundation models directly into 3D, without any manual input. We introduce an unsupervised multi-modal pseudo-labeling method relying on strong geometric priors learned from temporally accumulated LiDAR maps, alongside with a novel iterative update rule that enforces joint geometric-semantic consistency, and vice-versa detecting moving objects from inconsistencies. Our method simultaneously produces 3D semantic labels, 3D bounding boxes, and dense LiDAR scans, demonstrating robust generalization across three datasets. We experimentally validate that our method compares favorably to existing semantic segmentation and object detection pseudo-labeling methods, which often require additional manual supervision. We confirm that even a small fraction of our geometrically consistent, densified LiDAR improves depth prediction by 51.5% and 22.0% MAE in the 80-150 and 150-250 meters range, respectively.

[210] From Rays to Projections: Better Inputs for Feed-Forward View Synthesis

Zirui Wu, Zeren Jiang, Martin R. Oswald, Jie Song

Main category: cs.CV

TL;DR: Proposes projective conditioning for view synthesis, replacing camera parameters with stable 2D projective cues to improve geometric consistency and robustness.

Details

Motivation: Existing feed-forward view synthesis models use Plücker ray maps that tie predictions to arbitrary world coordinates, making them sensitive to small camera transformations and undermining geometric consistency. The paper seeks better conditioning inputs for robust and consistent view synthesis.

Method: Introduces projective conditioning which replaces raw camera parameters with target-view projective cues (stable 2D inputs), reframing the task from geometric regression in ray space to well-conditioned image-to-image translation. Also proposes masked autoencoding pretraining tailored to these cues for large-scale uncalibrated data.

Result: Shows improved fidelity and stronger cross-view consistency compared to ray-conditioned baselines on view-consistency benchmark. Achieves state-of-the-art quality on standard novel view synthesis benchmarks.

Conclusion: Projective conditioning with stable 2D cues provides more robust and geometrically consistent view synthesis than ray-based conditioning, enabling better performance through reframing the problem as image-to-image translation with effective pretraining.

Abstract: Feed-forward view synthesis models predict a novel view in a single pass with minimal 3D inductive bias. Existing works encode cameras as Plücker ray maps, which tie predictions to the arbitrary world coordinate gauge and make them sensitive to small camera transformations, thereby undermining geometric consistency. In this paper, we ask what inputs best condition a model for robust and consistent view synthesis. We propose projective conditioning, which replaces raw camera parameters with a target-view projective cue that provides a stable 2D input. This reframes the task from a brittle geometric regression problem in ray space to a well-conditioned target-view image-to-image translation problem. Additionally, we introduce a masked autoencoding pretraining strategy tailored to this cue, enabling the use of large-scale uncalibrated data for pretraining. Our method shows improved fidelity and stronger cross-view consistency compared to ray-conditioned baselines on our view-consistency benchmark. It also achieves state-of-the-art quality on standard novel view synthesis benchmarks.

[211] Re-Align: Structured Reasoning-guided Alignment for In-Context Image Generation and Editing

Runze He, Yiji Cheng, Tiankai Hang, Zhimin Li, Yu Xu, Zijin Yin, Shiyi Zhang, Wenxun Dai, Penghui Du, Ao Ma, Chunyu Wang, Qinglin Lu, Jizhong Han, Jiao Dai

Main category: cs.CV

TL;DR: Re-Align is a unified framework that bridges understanding and generation in in-context image tasks through structured reasoning-guided alignment and RL training.

Details

Motivation: Current unified multimodal models have strong understanding capabilities but fail to effectively transfer these strengths to image generation tasks, creating a gap between understanding and generation in in-context image generation and editing.

Method: Introduces In-Context Chain-of-Thought (IC-CoT) to decouple semantic guidance and reference association, plus an RL training scheme using surrogate rewards to align structured reasoning text with generated images.

Result: Extensive experiments show Re-Align outperforms competitive methods of comparable model scale and resources on both in-context image generation and editing tasks.

Conclusion: Re-Align successfully bridges the gap between understanding and generation through structured reasoning-guided alignment, improving performance on ICGE tasks.

Abstract: In-context image generation and editing (ICGE) enables users to specify visual concepts through interleaved image-text prompts, demanding precise understanding and faithful execution of user intent. Although recent unified multimodal models exhibit promising understanding capabilities, these strengths often fail to transfer effectively to image generation. We introduce Re-Align, a unified framework that bridges the gap between understanding and generation through structured reasoning-guided alignment. At its core lies the In-Context Chain-of-Thought (IC-CoT), a structured reasoning paradigm that decouples semantic guidance and reference association, providing clear textual target and mitigating confusion among reference images. Furthermore, Re-Align introduces an effective RL training scheme that leverages a surrogate reward to measure the alignment between structured reasoning text and the generated image, thereby improving the model’s overall performance on ICGE tasks. Extensive experiments verify that Re-Align outperforms competitive methods of comparable model scale and resources on both in-context image generation and editing tasks.

[212] VERSE: Visual Embedding Reduction and Space Exploration. Clustering-Guided Insights for Training Data Enhancement in Visually-Rich Document Understanding

Ignacio de Rodrigo, Alvaro J. Lopez-Lopez, Jaime Boal

Main category: cs.CV

TL;DR: VERSE is a methodology for analyzing and improving Vision-Language Models for document understanding by visualizing latent representations, identifying problematic regions, and generating synthetic data to enhance performance.

Details

Motivation: To improve Vision-Language Models for Visually-rich Document Understanding by enabling better analysis of their visual embedding space and addressing performance gaps in specific visual feature clusters.

Method: VERSE methodology involves: 1) visualizing latent representations of visual embeddings, 2) identifying problematic regions/clusters in the embedding space, 3) generating synthetic data targeting those problematic clusters, and 4) retraining models with the enhanced dataset.

Result: VERSE successfully uncovered visual features associated with error-prone clusters. Retraining with synthetic data containing these features substantially boosted F1 performance without degrading generalization. On-premise models (Donut, Idefics2) optimized with VERSE matched or surpassed SaaS solutions (GPT-4, Pixtral).

Conclusion: VERSE provides an effective methodology for analyzing and improving Vision-Language Models for document understanding through targeted synthetic data generation, enabling on-premise models to achieve state-of-the-art performance comparable to commercial SaaS solutions.

Abstract: This work introduces VERSE, a methodology for analyzing and improving Vision-Language Models applied to Visually-rich Document Understanding by exploring their visual embedding space. VERSE enables the visualization of latent representations, supporting the assessment of model feasibility. It also facilitates the identification of problematic regions and guides the generation of synthetic data to enhance performance in those clusters. We validate the methodology by training on the synthetic MERIT Dataset and evaluating on its real-world counterpart, MERIT Secret. Results show that VERSE helps uncover the visual features associated with error-prone clusters, and that retraining with samples containing these features substantially boosts F1 performance without degrading generalization. Furthermore, we demonstrate that on-premise models such as Donut and Idefics2, when optimized with VERSE, match or even surpass the performance of SaaS solutions like GPT-4 and Pixtral.

[213] VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control

Sixiao Zheng, Minghao Yin, Wenbo Hu, Xiaoyu Li, Ying Shan, Yanwei Fu

Main category: cs.CV

TL;DR: VerseCrafter is a 4D-aware video world model that enables explicit control over camera and object dynamics using a novel 4D Geometric Control representation, trained on automatically extracted 4D data from in-the-wild videos.

Details

Motivation: Existing video world models struggle with unified and precise control over camera and multi-object motion since videos operate in 2D image space, lacking explicit 4D geometric understanding.

Method: Introduces 4D Geometric Control representation using static background point clouds and per-object 3D Gaussian trajectories, rendered as conditioning signals for a pretrained video diffusion model. Uses automatic data engine to extract 4D controls from in-the-wild videos for training.

Result: Enables generation of high-fidelity, view-consistent videos that precisely adhere to specified camera and object dynamics, overcoming data scarcity through automatic 4D annotation extraction.

Conclusion: VerseCrafter bridges the gap between 2D video generation and 4D geometric control, providing a flexible, category-agnostic approach to video world modeling with explicit control over both camera and object motion.

Abstract: Video world models aim to simulate dynamic, real-world environments, yet existing methods struggle to provide unified and precise control over camera and multi-object motion, as videos inherently operate dynamics in the projected 2D image plane. To bridge this gap, we introduce VerseCrafter, a 4D-aware video world model that enables explicit and coherent control over both camera and object dynamics within a unified 4D geometric world state. Our approach is centered on a novel 4D Geometric Control representation, which encodes the world state through a static background point cloud and per-object 3D Gaussian trajectories. This representation captures not only an object’s path but also its probabilistic 3D occupancy over time, offering a flexible, category-agnostic alternative to rigid bounding boxes or parametric models. These 4D controls are rendered into conditioning signals for a pretrained video diffusion model, enabling the generation of high-fidelity, view-consistent videos that precisely adhere to the specified dynamics. Unfortunately, another major challenge lies in the scarcity of large-scale training data with explicit 4D annotations. We address this by developing an automatic data engine that extracts the required 4D controls from in-the-wild videos, allowing us to train our model on a massive and diverse dataset.

[214] A Lightweight and Explainable Vision-Language Framework for Crop Disease Visual Question Answering

Md. Zahid Hossain, Most. Sharmin Sultana Samu, Md. Rakibul Islam, Md. Siam Ansary

Main category: cs.CV

TL;DR: A lightweight vision-language framework for crop disease VQA using Swin Transformer encoder with sequence decoders achieves high accuracy with fewer parameters than large-scale baselines.

Details

Motivation: Crop disease visual question answering requires accurate visual understanding and reliable language generation, but existing large-scale vision-language models are parameter-heavy and may not be optimized for this specific agricultural domain.

Method: Combines Swin Transformer vision encoder with sequence-to-sequence language decoders using two-stage training strategy for improved visual representation learning and cross-modal alignment.

Result: High accuracy for both crop and disease identification, strong performance on BLEU, ROUGE and BERTScore metrics, outperforms large-scale vision-language baselines with significantly fewer parameters.

Conclusion: Task-specific visual pretraining is effective for crop disease VQA, and the lightweight framework demonstrates robust performance under diverse user queries while maintaining explainability through Grad-CAM and token-level attribution.

Abstract: Visual question answering for crop disease analysis requires accurate visual understanding and reliable language generation. This work presents a lightweight vision-language framework for crop and disease identification from leaf images. The proposed approach combines a Swin Transformer vision encoder with sequence-to-sequence language decoders. A two-stage training strategy is adopted to improve visual representation learning and cross-modal alignment. The model is evaluated on a large-scale crop disease dataset using classification and natural language generation metrics. Experimental results show high accuracy for both crop and disease identification. The framework also achieves strong performance on BLEU, ROUGE and BERTScore. Our proposed models outperform large-scale vision-language baselines while using significantly fewer parameters. Explainability is assessed using Grad-CAM and token-level attribution. Qualitative results demonstrate robust performance under diverse user-driven queries. These findings highlight the effectiveness of task-specific visual pretraining for crop disease visual question answering.

[215] Atlas 2 – Foundation models for clinical deployment

Maximilian Alber, Timo Milbich, Alexandra Carpen-Amarie, Stephan Tietz, Jonas Dippel, Lukas Muttenthaler, Beatriz Perez Cancer, Alessandro Benetti, Panos Korfiatis, Elias Eulig, Jérôme Lüscher, Jiasen Wu, Sayed Abid Hashimi, Gabriel Dernbach, Simon Schallenberg, Neelay Shah, Moritz Krügener, Aniruddh Jammoria, Jake Matras, Patrick Duffy, Matt Redlon, Philipp Jurmeister, David Horst, Lukas Ruff, Klaus-Robert Müller, Frederick Klauschen, Andrew Norgan

Main category: cs.CV

TL;DR: Atlas 2 series pathology foundation models achieve SOTA performance, robustness, and efficiency across 80 benchmarks, trained on largest pathology dataset (5.5M WSIs from 3 institutions).

Details

Motivation: Existing pathology foundation models have tradeoffs in performance, robustness, and computational requirements that limit clinical deployment.

Method: Developed three pathology vision foundation models (Atlas 2, Atlas 2-B, Atlas 2-S) trained on largest pathology dataset to date (5.5 million whole slide images from Charité Berlin, LMU Munich, and Mayo Clinic).

Result: Models show state-of-the-art performance in prediction performance, robustness, and resource efficiency across eighty public benchmarks.

Conclusion: Atlas 2 series bridges shortcomings of previous models, advancing computational pathology toward clinical deployment.

Abstract: Pathology foundation models substantially advanced the possibilities in computational pathology – yet tradeoffs in terms of performance, robustness, and computational requirements remained, which limited their clinical deployment. In this report, we present Atlas 2, Atlas 2-B, and Atlas 2-S, three pathology vision foundation models which bridge these shortcomings by showing state-of-the-art performance in prediction performance, robustness, and resource efficiency in a comprehensive evaluation across eighty public benchmarks. Our models were trained on the largest pathology foundation model dataset to date comprising 5.5 million histopathology whole slide images, collected from three medical institutions Charité - Universtätsmedizin Berlin, LMU Munich, and Mayo Clinic.

[216] Multi-Scale Local Speculative Decoding for Image Generation

Elia Peruzzo, Guillaume Sautière, Amirhossein Habibian

Main category: cs.CV

TL;DR: MuLo-SD accelerates autoregressive image generation using multi-resolution drafting with local rejection/resampling, achieving 1.7× speedup while maintaining quality.

Details

Motivation: Autoregressive models have high latency due to sequential nature. Existing speculative decoding approaches suffer from token-level ambiguity and lack spatial awareness, limiting acceleration potential.

Method: Multi-scale local speculative decoding with low-resolution drafter + learned up-samplers to propose candidate tokens, parallel verification by high-resolution target model, and local rejection/resampling mechanism focusing on spatial neighborhoods rather than raster-scan resampling.

Result: Achieves up to 1.7× speedup, outperforming EAGLE-2 and LANTERN baselines while maintaining comparable semantic alignment and perceptual quality on MS-COCO 5k validation split using GenEval, DPG-Bench, and FID/HPSv2 metrics.

Conclusion: MuLo-SD sets new SOTA in speculative decoding for image synthesis, bridging efficiency-fidelity gap through multi-resolution drafting with spatially-aware local correction mechanisms.

Abstract: Autoregressive (AR) models have achieved remarkable success in image synthesis, yet their sequential nature imposes significant latency constraints. Speculative Decoding offers a promising avenue for acceleration, but existing approaches are limited by token-level ambiguity and lack of spatial awareness. In this work, we introduce Multi-Scale Local Speculative Decoding (MuLo-SD), a novel framework that combines multi-resolution drafting with spatially informed verification to accelerate AR image generation. Our method leverages a low-resolution drafter paired with learned up-samplers to propose candidate image tokens, which are then verified in parallel by a high-resolution target model. Crucially, we incorporate a local rejection and resampling mechanism, enabling efficient correction of draft errors by focusing on spatial neighborhoods rather than raster-scan resampling after the first rejection. We demonstrate that MuLo-SD achieves substantial speedups - up to $\mathbf{1.7\times}$ - outperforming strong speculative decoding baselines such as EAGLE-2 and LANTERN in terms of acceleration, while maintaining comparable semantic alignment and perceptual quality. These results are validated using GenEval, DPG-Bench, and FID/HPSv2 on the MS-COCO 5k validation split. Extensive ablations highlight the impact of up-sampling design, probability pooling, and local rejection and resampling with neighborhood expansion. Our approach sets a new state-of-the-art in speculative decoding for image synthesis, bridging the gap between efficiency and fidelity.

[217] Vision-Language Introspection: Mitigating Overconfident Hallucinations in MLLMs via Interpretable Bi-Causal Steering

Shuliang Liu, Songbo Yang, Dong Fang, Sihang Jia, Yuqi Tang, Lingfeng Su, Ruoshui Peng, Yibo Yan, Xin Zou, Xuming Hu

Main category: cs.CV

TL;DR: Vision-Language Introspection (VLI) is a training-free inference framework that reduces object hallucination in Multimodal Large Language Models by simulating metacognitive self-correction through attributive introspection and interpretable bi-causal steering.

Details

Motivation: Object hallucination critically undermines MLLM reliability due to models blindly trusting linguistic priors over visual evidence. Existing mitigations are limited: contrastive decoding is superficial, and latent steering methods use static vectors lacking instance-specific precision.

Method: VLI performs Attributive Introspection to diagnose hallucination risks via probabilistic conflict detection and localize causal visual anchors. Then uses Interpretable Bi-Causal Steering to actively modulate inference, dynamically isolating visual evidence from noise while neutralizing blind confidence through adaptive calibration.

Result: VLI achieves state-of-the-art performance, reducing object hallucination rates by 12.67% on MMHal-Bench and improving accuracy by 5.8% on POPE benchmark.

Conclusion: VLI provides an effective training-free framework for reducing object hallucination in MLLMs through metacognitive self-correction, addressing fundamental cognitive introspection failures while maintaining interpretability and instance-specific precision.

Abstract: Object hallucination critically undermines the reliability of Multimodal Large Language Models, often stemming from a fundamental failure in cognitive introspection, where models blindly trust linguistic priors over specific visual evidence. Existing mitigations remain limited: contrastive decoding approaches operate superficially without rectifying internal semantic misalignments, while current latent steering methods rely on static vectors that lack instance-specific precision. We introduce Vision-Language Introspection (VLI), a training-free inference framework that simulates a metacognitive self-correction process. VLI first performs Attributive Introspection to diagnose hallucination risks via probabilistic conflict detection and localize the causal visual anchors. It then employs Interpretable Bi-Causal Steering to actively modulate the inference process, dynamically isolating visual evidence from background noise while neutralizing blind confidence through adaptive calibration. VLI achieves state-of-the-art performance on advanced models, reducing object hallucination rates by 12.67% on MMHal-Bench and improving accuracy by 5.8% on POPE.

[218] CoV: Chain-of-View Prompting for Spatial Reasoning

Haoyu Zhao, Akide Liu, Zeyu Zhang, Weijie Wang, Feng Chen, Ruihan Zhu, Gholamreza Haffari, Bohan Zhuang

Main category: cs.CV

TL;DR: Chain-of-View (CoV) prompting enables VLMs to actively explore 3D environments for embodied question answering by selecting and adjusting views through iterative reasoning.

Details

Motivation: Current VLMs are limited by fixed input views, which restricts their ability to gather distributed context and perform complex spatial reasoning in 3D environments for embodied question answering.

Method: CoV prompting transforms VLMs into active viewpoint reasoners through a coarse-to-fine exploration process: 1) View Selection agent filters redundant frames and identifies anchor views, 2) Fine-grained view adjustment via iterative reasoning with discrete camera actions, obtaining new observations from 3D scene representations.

Result: +11.56% average improvement in LLM-Match on OpenEQA across four VLMs (max +13.62% on Qwen3-VL-Flash); test-time scaling yields additional +2.51% improvement with increased action budget; strong performance on ScanQA (116 CIDEr/31.9 EM@1) and SQA3D (51.1 EM@1).

Conclusion: Question-aligned view selection with open-view search is an effective, model-agnostic strategy for improving spatial reasoning in 3D embodied question answering without additional training.

Abstract: Embodied question answering (EQA) in 3D environments often requires collecting context that is distributed across multiple viewpoints and partially occluded. However, most recent vision–language models (VLMs) are constrained to a fixed and finite set of input views, which limits their ability to acquire question-relevant context at inference time and hinders complex spatial reasoning. We propose Chain-of-View (CoV) prompting, a training-free, test-time reasoning framework that transforms a VLM into an active viewpoint reasoner through a coarse-to-fine exploration process. CoV first employs a View Selection agent to filter redundant frames and identify question-aligned anchor views. It then performs fine-grained view adjustment by interleaving iterative reasoning with discrete camera actions, obtaining new observations from the underlying 3D scene representation until sufficient context is gathered or a step budget is reached. We evaluate CoV on OpenEQA across four mainstream VLMs and obtain an average +11.56% improvement in LLM-Match, with a maximum gain of +13.62% on Qwen3-VL-Flash. CoV further exhibits test-time scaling: increasing the minimum action budget yields an additional +2.51% average improvement, peaking at +3.73% on Gemini-2.5-Flash. On ScanQA and SQA3D, CoV delivers strong performance (e.g., 116 CIDEr / 31.9 EM@1 on ScanQA and 51.1 EM@1 on SQA3D). Overall, these results suggest that question-aligned view selection coupled with open-view search is an effective, model-agnostic strategy for improving spatial reasoning in 3D EQA without additional training.

[219] VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice

Shuming Liu, Mingchen Zhuge, Changsheng Zhao, Jun Chen, Lemeng Wu, Zechun Liu, Chenchen Zhu, Zhipeng Cai, Chong Zhou, Haozhe Liu, Ernie Chang, Saksham Suri, Hongyu Xu, Qi Qian, Wei Wen, Balakrishnan Varadarajan, Zhuang Liu, Hu Xu, Florian Bordes, Raghuraman Krishnamoorthi, Bernard Ghanem, Vikas Chandra, Yunyang Xiong

Main category: cs.CV

TL;DR: VideoAuto-R1 is a video understanding framework that uses a “reason-when-necessary” strategy, where models only perform chain-of-thought reasoning when confidence in initial answers is low, achieving state-of-the-art accuracy with 3.3x efficiency gains.

Details

Motivation: The paper challenges the assumption that chain-of-thought reasoning is always necessary for video understanding. The authors found that direct answering often matches or surpasses CoT performance for RL-trained video models, despite CoT's higher computational cost. This motivates developing a more efficient approach that only uses reasoning when truly needed.

Method: VideoAuto-R1 uses a “Thinking Once, Answering Twice” paradigm during training: 1) model generates initial answer, 2) performs reasoning, 3) outputs reviewed answer (both answers supervised via verifiable rewards). During inference, the model uses confidence scores of initial answers to determine whether to proceed with reasoning - only activating thinking mode when confidence is low.

Result: Achieves state-of-the-art accuracy on video QA and grounding benchmarks with significantly improved efficiency: reduces average response length by ~3.3x (from 149 to 44 tokens). Shows low thinking-mode activation on perception tasks but higher activation on reasoning-intensive tasks, suggesting explicit reasoning is beneficial but not always necessary.

Conclusion: Explicit language-based reasoning is generally beneficial for video understanding but not always necessary. The proposed VideoAuto-R1 framework demonstrates that selective reasoning based on confidence scores can achieve state-of-the-art performance while dramatically improving efficiency, making video understanding models more practical for real-world applications.

Abstract: Chain-of-thought (CoT) reasoning has emerged as a powerful tool for multimodal large language models on video understanding tasks. However, its necessity and advantages over direct answering remain underexplored. In this paper, we first demonstrate that for RL-trained video models, direct answering often matches or even surpasses CoT performance, despite CoT producing step-by-step analyses at a higher computational cost. Motivated by this, we propose VideoAuto-R1, a video understanding framework that adopts a reason-when-necessary strategy. During training, our approach follows a Thinking Once, Answering Twice paradigm: the model first generates an initial answer, then performs reasoning, and finally outputs a reviewed answer. Both answers are supervised via verifiable rewards. During inference, the model uses the confidence score of the initial answer to determine whether to proceed with reasoning. Across video QA and grounding benchmarks, VideoAuto-R1 achieves state-of-the-art accuracy with significantly improved efficiency, reducing the average response length by ~3.3x, e.g., from 149 to just 44 tokens. Moreover, we observe a low rate of thinking-mode activation on perception-oriented tasks, but a higher rate on reasoning-intensive tasks. This suggests that explicit language-based reasoning is generally beneficial but not always necessary.

[220] Cutting AI Research Costs: How Task-Aware Compression Makes Large Language Model Agents Affordable

Zuhair Ahmed Khan Taha, Mohammed Mudassir Uddin, Shahnawaz Alam

Main category: cs.CV

TL;DR: AgentCompress reduces LLM computational costs by 68.3% while maintaining 96.2% success rate by routing tasks to appropriately compressed model variants based on difficulty assessment.

Details

Motivation: High computational costs of large language models (up to $127 per session for 70B parameter models) make them inaccessible to many academic labs, limiting their use for autonomous research tasks like literature review and hypothesis generation.

Method: Uses a small neural network to assess task difficulty from opening words, then routes tasks to suitably compressed model variants in under a millisecond. Differentiates between complex tasks (like hypothesis generation) and simpler ones (like bibliography reformatting).

Result: Tested across 500 research workflows in four scientific fields, achieving 68.3% reduction in compute costs while maintaining 96.2% of original success rate.

Conclusion: AgentCompress makes LLM-powered research tools financially accessible to academic labs by significantly reducing computational costs without substantial performance loss, enabling more labs to conduct experiments rather than being sidelined by budget constraints.

Abstract: When researchers deploy large language models for autonomous tasks like reviewing literature or generating hypotheses, the computational bills add up quickly. A single research session using a 70-billion parameter model can cost around $127 in cloud fees, putting these tools out of reach for many academic labs. We developed AgentCompress to tackle this problem head-on. The core idea came from a simple observation during our own work: writing a novel hypothesis clearly demands more from the model than reformatting a bibliography. Why should both tasks run at full precision? Our system uses a small neural network to gauge how hard each incoming task will be, based only on its opening words, then routes it to a suitably compressed model variant. The decision happens in under a millisecond. Testing across 500 research workflows in four scientific fields, we cut compute costs by 68.3% while keeping 96.2% of the original success rate. For labs watching their budgets, this could mean the difference between running experiments and sitting on the sidelines

[221] Mechanisms of Prompt-Induced Hallucination in Vision-Language Models

William Rudman, Michal Golovanevsky, Dana Arad, Yonatan Belinkov, Ritambhara Singh, Carsten Eickhoff, Kyle Mahowald

Main category: cs.CV

TL;DR: VLMs often hallucinate by favoring text over visual evidence. In object-counting tasks, they increasingly conform to incorrect prompts as object counts rise. Researchers identified specific attention heads causing this and reduced hallucinations by 40+% through ablation without retraining.

Details

Motivation: Large vision-language models frequently hallucinate by prioritizing textual prompts over visual evidence, which undermines their reliability. This paper investigates this failure mode systematically to understand and mitigate prompt-induced hallucinations.

Method: Used controlled object-counting experiments where prompts overstate object counts. Conducted mechanistic analysis of three VLMs to identify specific attention heads responsible for prompt-induced hallucinations. Applied ablation to these heads without additional training.

Result: Found that as object counts increase, models increasingly conform to incorrect prompts. Identified a small set of attention heads whose ablation reduces prompt-induced hallucinations by at least 40%. Discovered model-specific differences in how these heads mediate prompt copying.

Conclusion: Prompt-induced hallucinations in VLMs are mediated by specific attention heads that can be targeted through ablation. The findings provide mechanistic insights into how VLMs prioritize text over vision and offer a training-free intervention to reduce hallucinations.

Abstract: Large vision-language models (VLMs) are highly capable, yet often hallucinate by favoring textual prompts over visual evidence. We study this failure mode in a controlled object-counting setting, where the prompt overstates the number of objects in the image (e.g., asking a model to describe four waterlilies when only three are present). At low object counts, models often correct the overestimation, but as the number of objects increases, they increasingly conform to the prompt regardless of the discrepancy. Through mechanistic analysis of three VLMs, we identify a small set of attention heads whose ablation substantially reduces prompt-induced hallucinations (PIH) by at least 40% without additional training. Across models, PIH-heads mediate prompt copying in model-specific ways. We characterize these differences and show that PIH ablation increases correction toward visual evidence. Our findings offer insights into the internal mechanisms driving prompt-induced hallucinations, revealing model-specific differences in how these behaviors are implemented.

[222] MoE3D: A Mixture-of-Experts Module for 3D Reconstruction

Zichen Wang, Ang Cao, Liam J. Wang, Jeong Joon Park

Main category: cs.CV

TL;DR: MoE3D is a mixture-of-experts module that improves 3D reconstruction by predicting multiple depth maps and fusing them with dynamic weighting to sharpen depth boundaries and reduce artifacts.

Details

Motivation: Existing feed-forward 3D reconstruction models suffer from blurry depth boundaries and flying-point artifacts, which degrade reconstruction quality.

Method: MoE3D predicts multiple candidate depth maps and fuses them using dynamic weighting (mixture-of-experts approach) to sharpen depth boundaries and mitigate artifacts.

Result: When integrated with pre-trained 3D reconstruction backbones like VGGT, MoE3D substantially enhances reconstruction quality while adding minimal computational overhead.

Conclusion: MoE3D effectively improves 3D reconstruction by addressing boundary sharpness and artifact issues through a mixture-of-experts fusion approach.

Abstract: MoE3D is a mixture-of-experts module designed to sharpen depth boundaries and mitigate flying-point artifacts (highlighted in red) of existing feed-forward 3D reconstruction models (left side). MoE3D predicts multiple candidate depth maps and fuses them via dynamic weighting (visualized by MoE weights on the right side). When integrated with a pre-trained 3D reconstruction backbone such as VGGT, it substantially enhances reconstruction quality with minimal additional computational overhead. Best viewed digitally.

[223] FlowLet: Conditional 3D Brain MRI Synthesis using Wavelet Flow Matching

Danilo Danese, Angela Lombardi, Matteo Attimonelli, Giuseppe Fasano, Tommaso Di Noia

Main category: cs.CV

TL;DR: FlowLet: A conditional generative framework using flow matching in invertible 3D wavelet domain to synthesize age-conditioned 3D MRIs for improving Brain Age Prediction fairness and performance.

Details

Motivation: Existing 3D MRI datasets for Brain Age Prediction are demographically skewed, limiting fairness and generalizability. Current generative methods (latent diffusion models) are slow, may introduce artifacts, and rarely condition on age, affecting BAP performance.

Method: FlowLet uses flow matching within an invertible 3D wavelet domain to synthesize age-conditioned 3D MRIs, avoiding reconstruction artifacts and reducing computational demands compared to latent diffusion models.

Result: FlowLet generates high-fidelity volumes with few sampling steps. Training BAP models with FlowLet-generated data improves performance for underrepresented age groups, and region-based analysis confirms preservation of anatomical structures.

Conclusion: FlowLet provides an effective conditional generative framework for synthesizing age-conditioned 3D MRIs that improves Brain Age Prediction fairness and performance while addressing limitations of existing generative methods.

Abstract: Brain Magnetic Resonance Imaging (MRI) plays a central role in studying neurological development, aging, and diseases. One key application is Brain Age Prediction (BAP), which estimates an individual’s biological brain age from MRI data. Effective BAP models require large, diverse, and age-balanced datasets, whereas existing 3D MRI datasets are demographically skewed, limiting fairness and generalizability. Acquiring new data is costly and ethically constrained, motivating generative data augmentation. Current generative methods are often based on latent diffusion models, which operate in learned low dimensional latent spaces to address the memory demands of volumetric MRI data. However, these methods are typically slow at inference, may introduce artifacts due to latent compression, and are rarely conditioned on age, thereby affecting the BAP performance. In this work, we propose FlowLet, a conditional generative framework that synthesizes age-conditioned 3D MRIs by leveraging flow matching within an invertible 3D wavelet domain, helping to avoid reconstruction artifacts and reducing computational demands. Experiments show that FlowLet generates high-fidelity volumes with few sampling steps. Training BAP models with data generated by FlowLet improves performance for underrepresented age groups, and region-based analysis confirms preservation of anatomical structures.

[224] ObjectForesight: Predicting Future 3D Object Trajectories from Human Videos

Rustin Soraki, Homanga Bharadhwaj, Ali Farhadi, Roozbeh Mottaghi

Main category: cs.CV

TL;DR: ObjectForesight is a 3D object-centric dynamics model that predicts future 6-DoF poses and trajectories of rigid objects from short egocentric videos, using explicit 3D object representations for geometrically grounded predictions.

Details

Motivation: Humans can effortlessly anticipate object motions through interaction, but computational systems lack similar predictive abilities from passive visual observation. The goal is to enable AI systems to predict plausible future object motions directly from visual input.

Method: ObjectForesight uses explicit 3D object-level representations rather than pixel or latent space approaches. It leverages segmentation, mesh reconstruction, and 3D pose estimation to create a dataset of 2M+ clips with pseudo-ground-truth 3D object trajectories for training.

Result: The model achieves significant gains in accuracy, geometric consistency, and generalization to unseen objects and scenes. It establishes a scalable framework for learning physically grounded, object-centric dynamics models directly from observation.

Conclusion: ObjectForesight provides a novel approach to predicting object motions that captures object affordances and trajectories, offering geometrically grounded and temporally coherent predictions through explicit 3D object representations.

Abstract: Humans can effortlessly anticipate how objects might move or change through interaction–imagining a cup being lifted, a knife slicing, or a lid being closed. We aim to endow computational systems with a similar ability to predict plausible future object motions directly from passive visual observation. We introduce ObjectForesight, a 3D object-centric dynamics model that predicts future 6-DoF poses and trajectories of rigid objects from short egocentric video sequences. Unlike conventional world or dynamics models that operate in pixel or latent space, ObjectForesight represents the world explicitly in 3D at the object level, enabling geometrically grounded and temporally coherent predictions that capture object affordances and trajectories. To train such a model at scale, we leverage recent advances in segmentation, mesh reconstruction, and 3D pose estimation to curate a dataset of 2 million plus short clips with pseudo-ground-truth 3D object trajectories. Through extensive experiments, we show that ObjectForesight achieves significant gains in accuracy, geometric consistency, and generalization to unseen objects and scenes, establishing a scalable framework for learning physically grounded, object-centric dynamics models directly from observation. objectforesight.github.io

[225] Plenoptic Video Generation

Xiao Fu, Shitao Tang, Min Shi, Xian Liu, Jinwei Gu, Ming-Yu Liu, Dahua Lin, Chen-Hsuan Lin

Main category: cs.CV

TL;DR: PlenopticDreamer is a framework for multi-view video re-rendering that maintains spatio-temporal consistency in hallucinated regions through synchronized generative hallucinations and camera-guided video retrieval.

Details

Motivation: Existing camera-controlled generative video re-rendering methods struggle with maintaining consistency across multi-view scenarios and ensuring spatio-temporal coherence in hallucinated regions due to generative model stochasticity.

Method: Trains a multi-in-single-out video-conditioned model autoregressively with camera-guided video retrieval to select salient previous generations as conditional inputs. Incorporates progressive context-scaling, self-conditioning for robustness against error accumulation, and long-video conditioning for extended generation.

Result: Achieves state-of-the-art video re-rendering on Basic and Agibot benchmarks, delivering superior view synchronization, high-fidelity visuals, accurate camera control, and diverse view transformations including third-person to third-person and robotic manipulation views.

Conclusion: PlenopticDreamer effectively addresses multi-view consistency challenges in video re-rendering through synchronized hallucinations and adaptive conditioning strategies, enabling robust and diverse view transformations.

Abstract: Camera-controlled generative video re-rendering methods, such as ReCamMaster, have achieved remarkable progress. However, despite their success in single-view setting, these works often struggle to maintain consistency across multi-view scenarios. Ensuring spatio-temporal coherence in hallucinated regions remains challenging due to the inherent stochasticity of generative models. To address it, we introduce PlenopticDreamer, a framework that synchronizes generative hallucinations to maintain spatio-temporal memory. The core idea is to train a multi-in-single-out video-conditioned model in an autoregressive manner, aided by a camera-guided video retrieval strategy that adaptively selects salient videos from previous generations as conditional inputs. In addition, Our training incorporates progressive context-scaling to improve convergence, self-conditioning to enhance robustness against long-range visual degradation caused by error accumulation, and a long-video conditioning mechanism to support extended video generation. Extensive experiments on the Basic and Agibot benchmarks demonstrate that PlenopticDreamer achieves state-of-the-art video re-rendering, delivering superior view synchronization, high-fidelity visuals, accurate camera control, and diverse view transformations (e.g., third-person to third-person, and head-view to gripper-view in robotic manipulation). Project page: https://research.nvidia.com/labs/dir/plenopticdreamer/

[226] RoboVIP: Multi-View Video Generation with Visual Identity Prompting Augments Robot Manipulation

Boyang Wang, Haoran Zhang, Shujie Zhang, Jinkun Hao, Mingda Jia, Qi Lv, Yucheng Mao, Zhaoyang Lyu, Jia Zeng, Xudong Xu, Jiangmiao Pang

Main category: cs.CV

TL;DR: The paper introduces visual identity prompting for data augmentation in robotics, using exemplar images to guide diffusion models for generating multi-view, temporally coherent manipulation data, leading to improved policy performance.

Details

Motivation: Collecting large-scale real-world manipulation data is difficult due to hardware and setup constraints. Existing text-prompt conditioned diffusion models overlook the need for multi-view and temporally coherent observations, and text prompts alone cannot reliably specify scene setups.

Method: Introduces visual identity prompting, which supplies exemplar images as conditioning inputs to guide diffusion models in generating desired scene setups. Also builds a scalable pipeline to curate a visual identity pool from large robotics datasets.

Result: Using the augmented manipulation data to train downstream vision-language-action and visuomotor policy models yields consistent performance gains in both simulation and real-robot settings.

Conclusion: Visual identity prompting provides effective visual guidance for diffusion models to generate realistic manipulation data, addressing limitations of text-only conditioning and improving robot policy training.

Abstract: The diversity, quantity, and quality of manipulation data are critical for training effective robot policies. However, due to hardware and physical setup constraints, collecting large-scale real-world manipulation data remains difficult to scale across diverse environments. Recent work uses text-prompt conditioned image diffusion models to augment manipulation data by altering the backgrounds and tabletop objects in the visual observations. However, these approaches often overlook the practical need for multi-view and temporally coherent observations required by state-of-the-art policy models. Further, text prompts alone cannot reliably specify the scene setup. To provide the diffusion model with explicit visual guidance, we introduce visual identity prompting, which supplies exemplar images as conditioning inputs to guide the generation of the desired scene setup. To this end, we also build a scalable pipeline to curate a visual identity pool from large robotics datasets. Using our augmented manipulation data to train downstream vision-language-action and visuomotor policy models yields consistent performance gains in both simulation and real-robot settings.

[227] GREx: Generalized Referring Expression Segmentation, Comprehension, and Generation

Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, Yu-Gang Jiang

Main category: cs.CV

TL;DR: This paper introduces GREx (Generalized Referring Expression Segmentation/Comprehension/Generation) which extends classic REx to handle multi-target and no-target expressions, not just single-target ones. They create gRefCOCO dataset and propose ReLA baseline method.

Details

Motivation: Existing REx (RES/REC/REG) datasets and methods only support single-target expressions (one expression refers to one object), which limits real-world applications. Real expressions often refer to multiple objects or no objects.

Method: 1) Create gRefCOCO dataset with multi-target, no-target, and single-target expressions. 2) Propose ReLA baseline that adaptively divides images into regions with sub-instance clues and explicitly models region-region and region-language dependencies.

Result: ReLA achieves state-of-the-art results on both GRES and GREC tasks. The gRefCOCO dataset enables studying performance gaps of existing REx methods on generalized tasks.

Conclusion: GREx extends REx to handle arbitrary numbers of objects, making it more practical for real applications. The proposed dataset and method provide foundations for future research in generalized referring expression tasks.

Abstract: Referring Expression Segmentation (RES) and Comprehension (REC) respectively segment and detect the object described by an expression, while Referring Expression Generation (REG) generates an expression for the selected object. Existing datasets and methods commonly support single-target expressions only, i.e., one expression refers to one object, not considering multi-target and no-target expressions. This greatly limits the real applications of REx (RES/REC/REG). This paper introduces three new benchmarks called Generalized Referring Expression Segmentation (GRES), Comprehension (GREC), and Generation (GREG), collectively denoted as GREx, which extend the classic REx to allow expressions to identify an arbitrary number of objects. We construct the first large-scale GREx dataset gRefCOCO that contains multi-target, no-target, and single-target expressions and their corresponding images with labeled targets. GREx and gRefCOCO are designed to be backward-compatible with REx, facilitating extensive experiments to study the performance gap of the existing REx methods on GREx tasks. One of the challenges of GRES/GREC is complex relationship modeling, for which we propose a baseline ReLA that adaptively divides the image into regions with sub-instance clues and explicitly models the region-region and region-language dependencies. The proposed ReLA achieves the state-of-the-art results on the both GRES and GREC tasks. The proposed gRefCOCO dataset and method are available at https://henghuiding.github.io/GREx.

[228] Pixel-Perfect Visual Geometry Estimation

Gangwei Xu, Haotong Lin, Hongcheng Luo, Haiyang Sun, Bing Wang, Guang Chen, Sida Peng, Hangjun Ye, Xin Yang

Main category: cs.CV

TL;DR: Pixel-perfect visual geometry models (PPD/PPVD) use pixel-space diffusion transformers with semantic prompting and cascade architecture to generate high-quality, flying-pixel-free depth maps and point clouds from images/videos.

Details

Motivation: Existing geometry foundation models suffer from flying pixels and loss of fine details, which is problematic for robotics and augmented reality applications requiring clean and accurate geometry.

Method: 1) Pixel-Perfect Depth (PPD): Monocular depth model using pixel-space diffusion transformers (DiT) with Semantics-Prompted DiT (incorporates semantic representations from vision foundation models) and Cascade DiT architecture (progressively increases image tokens). 2) Pixel-Perfect Video Depth (PPVD): Extends PPD with Semantics-Consistent DiT (extracts temporally consistent semantics from multi-view geometry foundation model) and reference-guided token propagation for temporal coherence.

Result: Achieves best performance among all generative monocular and video depth estimation models, producing significantly cleaner point clouds than all other models.

Conclusion: The proposed pixel-perfect visual geometry models successfully address flying pixel issues and preserve fine details through generative modeling in pixel space with semantic prompting and efficient cascade architectures.

Abstract: Recovering clean and accurate geometry from images is essential for robotics and augmented reality. However, existing geometry foundation models still suffer severely from flying pixels and the loss of fine details. In this paper, we present pixel-perfect visual geometry models that can predict high-quality, flying-pixel-free point clouds by leveraging generative modeling in the pixel space. We first introduce Pixel-Perfect Depth (PPD), a monocular depth foundation model built upon pixel-space diffusion transformers (DiT). To address the high computational complexity associated with pixel-space diffusion, we propose two key designs: 1) Semantics-Prompted DiT, which incorporates semantic representations from vision foundation models to prompt the diffusion process, preserving global semantics while enhancing fine-grained visual details; and 2) Cascade DiT architecture that progressively increases the number of image tokens, improving both efficiency and accuracy. To further extend PPD to video (PPVD), we introduce a new Semantics-Consistent DiT, which extracts temporally consistent semantics from a multi-view geometry foundation model. We then perform reference-guided token propagation within the DiT to maintain temporal coherence with minimal computational and memory overhead. Our models achieve the best performance among all generative monocular and video depth estimation models and produce significantly cleaner point clouds than all other models.

[229] RL-AWB: Deep Reinforcement Learning for Auto White Balance Correction in Low-Light Night-time Scenes

Yuan-Kang Lee, Kuan-Lin Chen, Chia-Che Chang, Yu-Lun Liu

Main category: cs.CV

TL;DR: RL-AWB combines statistical methods with deep reinforcement learning for nighttime white balance, achieving superior generalization across lighting conditions.

Details

Motivation: Nighttime color constancy is challenging due to low-light noise and complex illumination conditions, requiring better solutions than existing approaches.

Method: Combines statistical algorithm (salient gray pixel detection + novel illumination estimation) with deep reinforcement learning that uses the statistical method as its core, mimicking professional AWB tuning experts to dynamically optimize parameters per image.

Result: Achieves superior generalization capability across both low-light and well-illuminated images, validated on the first multi-sensor nighttime dataset they introduced.

Conclusion: RL-AWB presents an effective framework for nighttime white balance that combines statistical and learning-based approaches, with strong cross-sensor generalization performance.

Abstract: Nighttime color constancy remains a challenging problem in computational photography due to low-light noise and complex illumination conditions. We present RL-AWB, a novel framework combining statistical methods with deep reinforcement learning for nighttime white balance. Our method begins with a statistical algorithm tailored for nighttime scenes, integrating salient gray pixel detection with novel illumination estimation. Building on this foundation, we develop the first deep reinforcement learning approach for color constancy that leverages the statistical algorithm as its core, mimicking professional AWB tuning experts by dynamically optimizing parameters for each image. To facilitate cross-sensor evaluation, we introduce the first multi-sensor nighttime dataset. Experiment results demonstrate that our method achieves superior generalization capability across low-light and well-illuminated images. Project page: https://ntuneillee.github.io/research/rl-awb/

[230] QNeRF: Neural Radiance Fields on a Simulated Gate-Based Quantum Computer

Daniele Lizzio Bosco, Shuteng Wang, Giuseppe Serra, Vladislav Golyanik

Main category: cs.CV

TL;DR: QNeRF is the first hybrid quantum-classical model for novel-view synthesis that uses parameterized quantum circuits to encode spatial and view-dependent information, achieving comparable performance to classical NeRF with less than half the parameters.

Details

Motivation: While Neural Radiance Fields (NeRFs) have advanced novel-view synthesis, they suffer from large model sizes and intensive training requirements. Quantum Visual Fields (QVFs) have shown promise in model compactness and convergence speed, suggesting quantum approaches could address NeRF's limitations for 3D representation learning from 2D images.

Method: QNeRF introduces two architectural variants: 1) Full QNeRF that maximizes quantum amplitude usage for enhanced representation, and 2) Dual-Branch QNeRF with task-informed inductive bias that branches spatial and view-dependent quantum state preparations to reduce complexity and improve scalability/hardware compatibility.

Result: When trained on moderate-resolution images, QNeRF matches or outperforms classical NeRF baselines while using less than half the number of parameters, demonstrating quantum superposition and entanglement can create more compact models for 3D scene representation.

Conclusion: Quantum machine learning can serve as a competitive alternative for continuous signal representation in mid-level computer vision tasks like 3D representation learning from 2D observations, offering model compactness advantages over classical approaches.

Abstract: Recently, Quantum Visual Fields (QVFs) have shown promising improvements in model compactness and convergence speed for learning the provided 2D or 3D signals. Meanwhile, novel-view synthesis has seen major advances with Neural Radiance Fields (NeRFs), where models learn a compact representation from 2D images to render 3D scenes, albeit at the cost of larger models and intensive training. In this work, we extend the approach of QVFs by introducing QNeRF, the first hybrid quantum-classical model designed for novel-view synthesis from 2D images. QNeRF leverages parameterised quantum circuits to encode spatial and view-dependent information via quantum superposition and entanglement, resulting in more compact models compared to the classical counterpart. We present two architectural variants. Full QNeRF maximally exploits all quantum amplitudes to enhance representational capabilities. In contrast, Dual-Branch QNeRF introduces a task-informed inductive bias by branching spatial and view-dependent quantum state preparations, drastically reducing the complexity of this operation and ensuring scalability and potential hardware compatibility. Our experiments demonstrate that – when trained on images of moderate resolution – QNeRF matches or outperforms classical NeRF baselines while using less than half the number of parameters. These results suggest that quantum machine learning can serve as a competitive alternative for continuous signal representation in mid-level tasks in computer vision, such as 3D representation learning from 2D observations.

[231] Mesh4D: 4D Mesh Reconstruction and Tracking from Monocular Video

Zeren Jiang, Chuanxia Zheng, Iro Laina, Diane Larlus, Andrea Vedaldi

Main category: cs.CV

TL;DR: Mesh4D: Feed-forward model for monocular 4D mesh reconstruction from video, using skeletal-guided autoencoder and latent diffusion for one-shot animation prediction.

Details

Motivation: To reconstruct complete 3D shape and motion of dynamic objects from monocular videos, addressing the challenge of representing complex deformations over time.

Method: 1) Autoencoder with skeletal guidance during training to learn compact latent space encoding entire animation sequences. 2) Spatio-temporal attention for stable deformation representation. 3) Latent diffusion model conditioned on input video and first-frame mesh for one-shot animation prediction.

Result: Outperforms prior methods on reconstruction and novel view synthesis benchmarks, achieving accurate 3D shape and deformation recovery without requiring skeletal information at inference.

Conclusion: Mesh4D enables efficient monocular 4D mesh reconstruction through skeletal-guided latent learning and diffusion-based animation prediction, demonstrating state-of-the-art performance.

Abstract: We propose Mesh4D, a feed-forward model for monocular 4D mesh reconstruction. Given a monocular video of a dynamic object, our model reconstructs the object’s complete 3D shape and motion, represented as a deformation field. Our key contribution is a compact latent space that encodes the entire animation sequence in a single pass. This latent space is learned by an autoencoder that, during training, is guided by the skeletal structure of the training objects, providing strong priors on plausible deformations. Crucially, skeletal information is not required at inference time. The encoder employs spatio-temporal attention, yielding a more stable representation of the object’s overall deformation. Building on this representation, we train a latent diffusion model that, conditioned on the input video and the mesh reconstructed from the first frame, predicts the full animation in one shot. We evaluate Mesh4D on reconstruction and novel view synthesis benchmarks, outperforming prior methods in recovering accurate 3D shape and deformation.

[232] Controllable Generation with Text-to-Image Diffusion Models: A Survey

Pu Cao, Feng Zhou, Qing Song, Lu Yang

Main category: cs.CV

TL;DR: Survey paper reviewing controllable generation techniques for text-to-image diffusion models, covering theoretical foundations, practical methods, and organizing approaches by condition types.

Details

Motivation: Text-only conditioning in diffusion models is insufficient for diverse application requirements, creating a need for methods to control pre-trained T2I models with additional conditions beyond text.

Method: Comprehensive literature review organized by condition types: generation with specific conditions, multiple conditions, and universal controllable generation. Includes theoretical analysis of how novel conditions are introduced into the denoising process.

Result: Systematic categorization of controllable generation research with theoretical foundations and practical advancements, accompanied by a curated repository of surveyed literature.

Conclusion: Controllable generation with T2I diffusion models is an important research direction that addresses limitations of text-only conditioning, with diverse approaches emerging for different condition types and applications.

Abstract: In the rapidly advancing realm of visual generation, diffusion models have revolutionized the landscape, marking a significant shift in capabilities with their impressive text-guided generative functions. However, relying solely on text for conditioning these models does not fully cater to the varied and complex requirements of different applications and scenarios. Acknowledging this shortfall, a variety of studies aim to control pre-trained text-to-image (T2I) models to support novel conditions. In this survey, we undertake a thorough review of the literature on controllable generation with T2I diffusion models, covering both the theoretical foundations and practical advancements in this domain. Our review begins with a brief introduction to the basics of denoising diffusion probabilistic models (DDPMs) and widely used T2I diffusion models. We then reveal the controlling mechanisms of diffusion models, theoretically analyzing how novel conditions are introduced into the denoising process for conditional generation. Additionally, we offer a detailed overview of research in this area, organizing it into distinct categories from the condition perspective: generation with specific conditions, generation with multiple conditions, and universal controllable generation. For an exhaustive list of the controllable generation literature surveyed, please refer to our curated repository at https://github.com/PRIV-Creation/Awesome-Controllable-T2I-Diffusion-Models.

[233] BOP-Distrib: Revisiting 6D Pose Estimation Benchmarks for Better Evaluation under Visual Ambiguities

Boris Meden, Asma Brazi, Fabrice Mayran de Chamisso, Steve Bourgeois, Vincent Lepetit

Main category: cs.CV

TL;DR: The paper proposes a new evaluation framework for 6D pose estimation that accounts for viewpoint-dependent visual ambiguities, re-annotates datasets with image-specific pose distributions, re-evaluates single-pose methods, and creates the first benchmark for pose distribution methods on real images.

Details

Motivation: Current 6D pose estimation methods are benchmarked on datasets that only consider global object symmetries for ground truth annotations, ignoring viewpoint-dependent visual ambiguities that occur when symmetry-breaking elements are occluded. This leads to inaccurate evaluation of methods.

Method: 1) Proposes automatic method to re-annotate datasets with 6D pose distribution specific to each image, considering object surface visibility to determine visual ambiguities. 2) Re-evaluates state-of-the-art single pose methods using improved ground truth. 3) Derives precision/recall formulation to evaluate pose distribution methods against image-wise distribution ground truth.

Result: The re-annotation and re-evaluation significantly modifies the ranking of state-of-the-art single pose methods. The paper creates the first benchmark for evaluating pose distribution methods on real images using the proposed precision/recall formulation.

Conclusion: The paper addresses a critical limitation in current 6D pose evaluation by introducing viewpoint-aware ground truth annotations and evaluation metrics, providing more accurate benchmarking for both single-pose and pose-distribution methods.

Abstract: 6D pose estimation aims at determining the object pose that best explains the camera observation. The unique solution for non-ambiguous objects can turn into a multi-modal pose distribution for symmetrical objects or when occlusions of symmetry-breaking elements happen, depending on the viewpoint. Currently, 6D pose estimation methods are benchmarked on datasets that consider, for their ground truth annotations, visual ambiguities as only related to global object symmetries, whereas they should be defined per-image to account for the camera viewpoint. We thus first propose an automatic method to re-annotate those datasets with a 6D pose distribution specific to each image, taking into account the object surface visibility in the image to correctly determine the visual ambiguities. Second, given this improved ground truth, we re-evaluate the state-of-the-art single pose methods and show that this greatly modifies the ranking of these methods. Third, as some recent works focus on estimating the complete set of solutions, we derive a precision/recall formulation to evaluate them against our image-wise distribution ground truth, making it the first benchmark for pose distribution methods on real images.

[234] Beyond Fixed Topologies: Unregistered Training and Comprehensive Evaluation Metrics for 3D Talking Heads

Federico Nocentini, Thomas Besnier, Claudio Ferrari, Sylvain Arguillere, Mohamed Daoudi, Stefano Berretti

Main category: cs.CV

TL;DR: First framework for speech-driven 3D talking heads that works with arbitrary mesh topologies including real scanned data, using heat diffusion for topology-robust features.

Details

Motivation: Previous methods assume fixed mesh structures with point-wise correspondence, but real-world applications need to handle varying mesh topologies where no such correspondence exists across meshes.

Method: Uses heat diffusion to predict features robust to mesh topology. Explores two training settings: registered (fixed topology during training but arbitrary at test) and fully unregistered (varying topologies during training). Also proposes new lip-syncing evaluation metrics.

Result: Approach performs favorably compared to fixed topology techniques, sets new benchmark for versatile 3D talking heads without topology constraints. Code and pre-trained model are available.

Conclusion: Presents first framework capable of animating 3D faces with arbitrary topologies, offering versatile and high-fidelity solution while addressing limitations of current evaluation metrics.

Abstract: Generating speech-driven 3D talking heads presents numerous challenges; among those is dealing with varying mesh topologies where no point-wise correspondence exists across the meshes the model can animate. While previous literature works assume fixed mesh structures, in this work we present the first framework capable of animating 3D faces in arbitrary topologies, including real scanned data. Our approach leverages heat diffusion to predict features that are robust to the mesh topology. We explore two training settings: a registered one, in which meshes in a training sequences share a fixed topology but any mesh can be animated at test time, and an fully unregistered one, which allows effective training with varying mesh structures. Additionally, we highlight the limitations of current evaluation metrics and propose new metrics for better lip-syncing evaluation. An extensive evaluation shows our approach performs favorably compared to fixed topology techniques, setting a new benchmark by offering a versatile and high-fidelity solution for 3D talking heads where the topology constraint is dropped. The code along with the pre-trained model are available.

[235] Explainable Binary Classification of Separable Shape Ensembles

Zachary Grey, Nicholas Fisher, Andrew Glaws

Main category: cs.CV

TL;DR: Novel pattern recognition formalisms for analyzing large ensembles of segmented curves from images, using separable shape tensors for explainable classification and discrepancy detection without labeled data.

Details

Motivation: Scientists and engineers need to analyze large ensembles of segmented curves from images to extract patterns and detect important changes, but existing methods lack explainable, efficient approaches for comparing thousands of curves without labeled data.

Method: Approximating eigenspaces of composite integral operators to create discrete dual representations of curves at quadrature nodes, projecting onto matrix manifolds to obtain separable shape tensors that decompose curves into linear scale variations and nonlinear undulations.

Result: Demonstrated explainable binary classification using product maximum mean discrepancy on thousands of segmented curves, building interpretable feature spaces in seconds without high-performance computation, and detecting discrepancies below visual inspection thresholds.

Conclusion: The proposed formalism enables efficient, explainable analysis of large curve ensembles from images, providing interpretable shape decompositions and discrepancy detection capabilities that outperform visual inspection, with applications across scientific and engineering domains.

Abstract: Scientists, engineers, biologists, and technology specialists universally leverage image segmentation to extract shape ensembles containing many thousands of curves representing patterns in observations and measurements. These large curve ensembles facilitate inferences about important changes when comparing and contrasting images. We introduce novel pattern recognition formalisms combined with inference methods over large ensembles of segmented curves. Our formalism involves accurately approximating eigenspaces of composite integral operators to motivate discrete, dual representations of curves collocated at quadrature nodes. Approximations are projected onto underlying matrix manifolds and the resulting separable shape tensors constitute rigid-invariant decompositions of curves into generalized (linear) scale variations and complementary (nonlinear) undulations. With thousands of curves segmented from pairs of images, we demonstrate how data-driven features of separable shape tensors inform explainable binary classification utilizing a product maximum mean discrepancy; absent labeled data, building interpretable feature spaces in seconds without high performance computation, and detecting discrepancies below cursory visual inspections.

[236] SynDroneVision: A Synthetic Dataset for Image-Based Drone Detection

Tamara R. Lenhard, Andreas Weinmann, Kai Franke, Tobias Koch

Main category: cs.CV

TL;DR: SynDroneVision is a synthetic drone detection dataset created using game engine simulations to address limited real-world training data, showing improved model performance when used for data enrichment.

Details

Motivation: Limited availability of large-scale annotated drone detection data and high costs of real-world data collection motivate the creation of synthetic alternatives to overcome training data constraints.

Method: Created SynDroneVision synthetic dataset using game engine-based simulations, featuring diverse backgrounds, lighting conditions, and drone models. Evaluated effectiveness through comparative analysis of recent YOLO detection models.

Result: SynDroneVision demonstrates value as a resource for real-world data enrichment, achieving notable enhancements in model performance and robustness while significantly reducing time and costs of real-world data acquisition.

Conclusion: Synthetic data generation via game engines provides a promising, cost-effective solution for drone detection systems, with SynDroneVision offering a comprehensive training foundation that will be publicly released.

Abstract: Developing robust drone detection systems is often constrained by the limited availability of large-scale annotated training data and the high costs associated with real-world data collection. However, leveraging synthetic data generated via game engine-based simulations provides a promising and cost-effective solution to overcome this issue. Therefore, we present SynDroneVision, a synthetic dataset specifically designed for RGB-based drone detection in surveillance applications. Featuring diverse backgrounds, lighting conditions, and drone models, SynDroneVision offers a comprehensive training foundation for deep learning algorithms. To evaluate the dataset’s effectiveness, we perform a comparative analysis across a selection of recent YOLO detection models. Our findings demonstrate that SynDroneVision is a valuable resource for real-world data enrichment, achieving notable enhancements in model performance and robustness, while significantly reducing the time and costs of real-world data acquisition. SynDroneVision will be publicly released upon paper acceptance.

[237] Is Contrastive Distillation Enough for Learning Comprehensive 3D Representations?

Yifan Zhang, Junhui Hou

Main category: cs.CV

TL;DR: CMCR is a cross-modal comprehensive representation learning framework that improves 3D representation by better integrating both modality-shared and modality-specific features, addressing limitations of current contrastive methods.

Details

Motivation: Existing cross-modal contrastive distillation methods focus primarily on modality-shared features while neglecting modality-specific features during pre-training, leading to suboptimal 3D representations.

Method: Introduces masked image modeling and occupancy estimation tasks to learn comprehensive modality-specific features, proposes a novel multi-modal unified codebook for shared embedding space, and adds geometry-enhanced masked image modeling to boost 3D representation learning.

Result: Extensive experiments show the method mitigates challenges faced by traditional approaches and consistently outperforms existing image-to-LiDAR contrastive distillation methods in downstream tasks.

Conclusion: CMCR provides a more comprehensive approach to 3D representation learning by effectively integrating both modality-shared and modality-specific features, advancing cross-modal representation learning.

Abstract: Cross-modal contrastive distillation has recently been explored for learning effective 3D representations. However, existing methods focus primarily on modality-shared features, neglecting the modality-specific features during the pre-training process, which leads to suboptimal representations. In this paper, we theoretically analyze the limitations of current contrastive methods for 3D representation learning and propose a new framework, namely CMCR (Cross-Modal Comprehensive Representation Learning), to address these shortcomings. Our approach improves upon traditional methods by better integrating both modality-shared and modality-specific features. Specifically, we introduce masked image modeling and occupancy estimation tasks to guide the network in learning more comprehensive modality-specific features. Furthermore, we propose a novel multi-modal unified codebook that learns an embedding space shared across different modalities. Besides, we introduce geometry-enhanced masked image modeling to further boost 3D representation learning. Extensive experiments demonstrate that our method mitigates the challenges faced by traditional approaches and consistently outperforms existing image-to-LiDAR contrastive distillation methods in downstream tasks. Code will be available at https://github.com/Eaphan/CMCR.

[238] Spontaneous emergence of linguistic statistical laws in images via artificial neural networks

Ping-Rui Tsai, Chi-hsiang Wang, Yu-Cheng Liao, Hong-Yue Huang, Tzay-Ming Hong

Main category: cs.CV

TL;DR: Images exhibit statistical regularities similar to language (Zipf’s, Heaps’, Benford’s laws) when processed through neural networks, suggesting structured representations emerge naturally from perceptual processing without explicit symbols.

Details

Motivation: To investigate whether images follow statistical regularities similar to linguistic systems, given that visual input constitutes 60% of human sensory experience and guided by symbol-grounding theory which suggests meaningful symbols originate from perception.

Method: Treat images as vision-centric artifacts, employ pre-trained neural networks to model visual processing, detect kernel activations, extract pixels to obtain text-like units, then analyze these representations for statistical patterns.

Result: Image-derived representations adhere to statistical laws (Zipf’s, Heaps’, Benford’s laws) analogous to linguistic data. These regularities emerge spontaneously without explicit symbols or hybrid architectures.

Conclusion: Connectionist networks can automatically develop structured, quasi-symbolic units through perceptual processing alone, suggesting text- and symbol-like properties naturally emerge from neural networks, providing a novel perspective for interpretation.

Abstract: As a core element of culture, images transform perception into structured representations and undergo evolution similar to natural languages. Given that visual input accounts for 60% of human sensory experience, it is natural to ask whether images follow statistical regularities similar to those in linguistic systems. Guided by symbol-grounding theory, which posits that meaningful symbols originate from perception, we treat images as vision-centric artifacts and employ pre-trained neural networks to model visual processing. By detecting kernel activations and extracting pixels, we obtain text-like units, which reveal that these image-derived representations adhere to statistical laws such as Zipf’s, Heaps’, and Benford’s laws, analogous to linguistic data. Notably, these statistical regularities emerge spontaneously, without the need for explicit symbols or hybrid architectures. Our results indicate that connectionist networks can automatically develop structured, quasi-symbolic units through perceptual processing alone, suggesting that text- and symbol-like properties can naturally emerge from neural networks and providing a novel perspective for interpretation.

[239] Two-Stream Thermal Imaging Fusion for Enhanced Time of Birth Detection in Neonatal Care

Jorge García-Torres, Øyvind Meinich-Bache, Sara Brunner, Siren Rettedal, Vilde Kolstad, Kjersti Engan

Main category: cs.CV

TL;DR: Two-stream fusion system combining image and video analysis achieves accurate Time of Birth detection from thermal recordings with 95.7% precision and 84.8% recall.

Details

Motivation: Accurate Time of Birth documentation is crucial for neonatal resuscitation, but current manual methods are prone to inaccuracies. Around 10% of newborns need breathing assistance and 5% require ventilation, making timely interventions vital.

Method: Novel two-stream fusion system combining static (image) and dynamic (video) streams from thermal recordings to capture richer spatiotemporal birth-related features. Includes a score aggregation module for precise ToB estimation.

Result: System achieves 95.7% precision and 84.8% recall in detecting birth within short video clips. With score aggregation, identifies ToB in 100% of test cases with median absolute error of 2 seconds and absolute mean deviation of 4.5 seconds compared to manual annotations.

Conclusion: The fusion of image and video analysis modalities enhances performance over single-stream approaches, providing robust and precise Time of Birth detection for optimizing neonatal care in delivery rooms and operating theaters.

Abstract: Around 10% of newborns require some help to initiate breathing, and 5% need ventilation assistance. Accurate Time of Birth (ToB) documentation is essential for optimizing neonatal care, as timely interventions are vital for proper resuscitation. However, current clinical methods for recording ToB often rely on manual processes, which can be prone to inaccuracies. In this study, we present a novel two-stream fusion system that combines the power of image and video analysis to accurately detect the ToB from thermal recordings in the delivery room and operating theater. By integrating static and dynamic streams, our approach captures richer birth-related spatiotemporal features, leading to more robust and precise ToB estimation. We demonstrate that this synergy between data modalities enhances performance over single-stream approaches. Our system achieves 95.7% precision and 84.8% recall in detecting birth within short video clips. Additionally, with the help of a score aggregation module, it successfully identifies ToB in 100% of test cases, with a median absolute error of 2 seconds and an absolute mean deviation of 4.5 seconds compared to manual annotations.

[240] From Dataset to Real-world: General 3D Object Detection via Generalized Cross-domain Few-shot Learning

Shuangzhi Li, Junlong Shen, Lei Ma, Xingyu Li

Main category: cs.CV

TL;DR: A unified framework for cross-domain few-shot 3D object detection that bridges 2D open-set semantics with 3D spatial reasoning to adapt to new domains with limited annotations.

Details

Motivation: LiDAR-based 3D object detection models struggle to generalize to real-world environments due to limited object diversity in existing datasets, creating a need for adaptation to new domains with only few-shot annotations.

Method: 1) Image-guided multi-modal fusion using vision-language models to inject transferable 2D semantic cues into 3D pipeline; 2) Physically-aware box search using LiDAR priors to enhance 2D-to-3D alignment; 3) Contrastive-enhanced prototype learning to encode few-shot instances into discriminative semantic anchors and stabilize representation learning.

Result: Extensive experiments on generalized cross-domain few-shot (GCFS) benchmarks demonstrate the effectiveness and generality of the approach in realistic deployment settings.

Conclusion: The proposed unified framework successfully addresses the GCFS task in 3D object detection by learning stable target semantics under limited supervision through multi-modal fusion and prototype learning, enabling better adaptation to new domains with few-shot annotations.

Abstract: LiDAR-based 3D object detection models often struggle to generalize to real-world environments due to limited object diversity in existing datasets. To tackle it, we introduce the first generalized cross-domain few-shot (GCFS) task in 3D object detection, aiming to adapt a source-pretrained model to both common and novel classes in a new domain with only few-shot annotations. We propose a unified framework that learns stable target semantics under limited supervision by bridging 2D open-set semantics with 3D spatial reasoning. Specifically, an image-guided multi-modal fusion injects transferable 2D semantic cues into the 3D pipeline via vision-language models, while a physically-aware box search enhances 2D-to-3D alignment via LiDAR priors. To capture class-specific semantics from sparse data, we further introduce contrastive-enhanced prototype learning, which encodes few-shot instances into discriminative semantic anchors and stabilizes representation learning. Extensive experiments on GCFS benchmarks demonstrate the effectiveness and generality of our approach in realistic deployment settings.

[241] Boosting HDR Image Reconstruction via Semantic Knowledge Transfer

Tao Hu, Longyao Wu, Wei Dong, Peng Wu, Jinqiu Sun, Xiaogang Xu, Qingsen Yan, Yanning Zhang

Main category: cs.CV

TL;DR: A framework that transfers semantic knowledge from SDR to HDR domain via self-distillation to boost HDR reconstruction from degraded SDR images.

Details

Motivation: HDR reconstruction from degraded SDR images is challenging due to missing content and domain gap between SDR and HDR formats. Existing semantic priors from SDR images don't transfer well to HDR imaging.

Method: Proposes a general framework with: 1) Semantic Priors Guided Reconstruction Model (SPGRM) that uses SDR semantic knowledge for initial HDR reconstruction, 2) Self-distillation mechanism to align color/content with semantic knowledge, and 3) Semantic Knowledge Alignment Module (SKAM) to transfer internal feature semantic knowledge using complementary masks.

Result: Extensive experiments show the framework significantly improves HDR imaging quality for existing methods without changing network architecture.

Conclusion: The proposed self-distillation framework effectively transfers semantic knowledge from SDR to HDR domain, addressing the domain gap problem and improving HDR reconstruction from degraded SDR images.

Abstract: Recovering High Dynamic Range (HDR) images from multiple Standard Dynamic Range (SDR) images become challenging when the SDR images exhibit noticeable degradation and missing content. Leveraging scene-specific semantic priors offers a promising solution for restoring heavily degraded regions. However, these priors are typically extracted from sRGB SDR images, the domain/format gap poses a significant challenge when applying it to HDR imaging. To address this issue, we propose a general framework that transfers semantic knowledge derived from SDR domain via self-distillation to boost existing HDR reconstruction. Specifically, the proposed framework first introduces the Semantic Priors Guided Reconstruction Model (SPGRM), which leverages SDR image semantic knowledge to address ill-posed problems in the initial HDR reconstruction results. Subsequently, we leverage a self-distillation mechanism that constrains the color and content information with semantic knowledge, aligning the external outputs between the baseline and SPGRM. Furthermore, to transfer the semantic knowledge of the internal features, we utilize a Semantic Knowledge Alignment Module (SKAM) to fill the missing semantic contents with the complementary masks. Extensive experiments demonstrate that our framework significantly boosts HDR imaging quality for existing methods without altering the network architecture.

Carlos Plou, Cesar Borja, Ruben Martinez-Cantin, Ana C. Murillo

Main category: cs.CV

TL;DR: FALCONEye is a training-free video agent combining VLM and LLM that uses exploration-based search with calibrated confidence to answer open-ended questions in hour-long videos, outperforming open-source 7B VLMs and comparable agents.

Details

Motivation: Current Vision Language Models struggle with hour-long videos because encoding visual content exceeds available context windows, making it challenging to find specific information in lengthy video content.

Method: FALCONEye uses a model-agnostic meta-architecture combining a VLM and LLM with exploration-based search guided by calibrated confidence from VLM answers. It also introduces FALCON-Bench benchmark for video answer search requiring both answers and temporal windows.

Result: With just a 7B VLM and lightweight LLM, FALCONEye outperforms all open-source 7B VLMs and comparable agents on FALCON-Bench. It also surpasses GPT-4o on single-detail tasks in MLVU benchmark while reducing inference cost by roughly 10x.

Conclusion: FALCONEye demonstrates effective video information retrieval for hour-long content through its exploration-based approach, achieving strong performance with minimal computational resources while generalizing well to different video tasks.

Abstract: Finding information in hour-long videos is a challenging task even for top-performing Vision Language Models (VLMs), as encoding visual content quickly exceeds available context windows. To tackle this challenge, we present FALCONEye, a novel video agent based on a training-free, model-agnostic meta-architecture composed of a VLM and a Large Language Model (LLM). FALCONEye answers open-ended questions using an exploration-based search algorithm guided by calibrated confidence from the VLM’s answers. We also introduce the FALCON-Bench benchmark, extending Question Answering problem to Video Answer Search-requiring models to return both the answer and its supporting temporal window for open-ended questions in hour-long videos. With just a 7B VLM and a lightweight LLM, FALCONEye outscores all open-source 7B VLMs and comparable agents in FALCON-Bench. It further demonstrates its generalization capability in MLVU benchmark with shorter videos and different tasks, surpassing GPT-4o on single-detail tasks while slashing inference cost by roughly an order of magnitude.

[243] Single Image Reflection Separation via Dual Prior Interaction Transformer

Yue Huang, Zi’ang Li, Tianle Hu, Jie Wen, Guanbin Li, Jinglin Zhang, Guoxu Zhou, Xiaozhao Fang

Main category: cs.CV

TL;DR: A new method for single image reflection separation that introduces transmission prior modeling through lightweight generation and dual-prior fusion, achieving state-of-the-art performance.

Details

Motivation: Existing methods fail to effectively model and utilize transmission priors (the most direct task-specific prior for target transmission layers), limiting performance in complex scenarios. Transmission priors haven't been properly leveraged despite being crucial for the task.

Method: Proposes a dual-prior interaction framework with two key components: 1) Local Linear Correction Network (LLCN) that finetunes pre-trained models using physical constraint T=SI+B to generate high-quality transmission priors with minimal parameters, and 2) Dual-Prior Interaction Transformer (DPIT) with dual-stream channel reorganization attention mechanism that deeply fuses general and transmission priors by reorganizing features for attention computation.

Result: Experimental results on multiple benchmark datasets demonstrate state-of-the-art performance in single image reflection separation.

Conclusion: The proposed method effectively addresses the limitation of underutilized transmission priors through lightweight generation and deep fusion, achieving superior performance by fully exploiting complementary information from both general and transmission priors.

Abstract: Single image reflection separation aims to separate the transmission and reflection layers from a mixed image. Existing methods typically combine general priors from pre-trained models with task-specific priors such as text prompts and reflection detection. However, the transmission prior, as the most direct task-specific prior for the target transmission layer, has not been effectively modeled or fully utilized, limiting performance in complex scenarios. To address this issue, we propose a dual-prior interaction framework based on lightweight transmission prior generation and effective prior fusion. First, we design a Local Linear Correction Network (LLCN) that finetunes pre-trained models based on the physical constraint T=SI+B, where S and B represent pixel-wise and channel-wise scaling and bias transformations. LLCN efficiently generates high-quality transmission priors with minimal parameters. Second, we construct a Dual-Prior Interaction Transformer (DPIT) that employs a dual-stream channel reorganization attention mechanism. By reorganizing features from general and transmission priors for attention computation, DPIT achieves deep fusion of both priors, fully exploiting their complementary information. Experimental results on multiple benchmark datasets demonstrate that the proposed method achieves state-of-the-art performance.

[244] Probing Deep into Temporal Profile Makes the Infrared Small Target Detector Much Better

Ruojing Li, Wei An, Yingqian Wang, Xinyi Ying, Yimian Dai, Longguang Wang, Miao Li, Yulan Guo, Li Liu

Main category: cs.CV

TL;DR: DeepPro proposes a novel infrared small target detection method that treats the task as 1D signal anomaly detection using temporal profile information instead of spatial features, achieving superior performance with high efficiency.

Details

Motivation: Current learning-based IRST detection methods use spatial and short-term temporal information but suffer from unreliable performance under complex conditions and computational redundancy. The authors explore more essential information from a crucial domain - the temporal profile - which theoretically shows superiority in distinguishing target signals from interference.

Method: The authors remodel IRST detection as a 1D signal anomaly detection task and propose DeepPro (deep temporal probe network) that only performs calculations in the time dimension. They first built a prediction attribution tool to verify the importance of temporal profile information, then designed an efficient network that operates exclusively on temporal profiles.

Result: DeepPro outperforms existing state-of-the-art IRST detection methods on widely-used benchmarks with extremely high efficiency. It achieves significant improvement on dim targets and in complex scenarios, demonstrating superior performance.

Conclusion: The work provides a new modeling domain (temporal profile instead of spatial), new insight (1D signal anomaly detection), new method (DeepPro), and new performance benchmark. This approach can promote the development of IRST detection by focusing on more essential temporal information.

Abstract: Infrared small target (IRST) detection is challenging in simultaneously achieving precise, robust, and efficient performance due to extremely dim targets and strong interference. Current learning-based methods attempt to leverage more" information from both the spatial and the short-term temporal domains, but suffer from unreliable performance under complex conditions while incurring computational redundancy. In this paper, we explore the more essential" information from a more crucial domain for the detection. Through theoretical analysis, we reveal that the global temporal saliency and correlation information in the temporal profile demonstrate significant superiority in distinguishing target signals from other signals. To investigate whether such superiority is preferentially leveraged by well-trained networks, we built the first prediction attribution tool in this field and verified the importance of the temporal profile information. Inspired by the above conclusions, we remodel the IRST detection task as a one-dimensional signal anomaly detection task, and propose an efficient deep temporal probe network (DeepPro) that only performs calculations in the time dimension for IRST detection. We conducted extensive experiments to fully validate the effectiveness of our method. The experimental results are exciting, as our DeepPro outperforms existing state-of-the-art IRST detection methods on widely-used benchmarks with extremely high efficiency, and achieves a significant improvement on dim targets and in complex scenarios. We provide a new modeling domain, a new insight, a new method, and a new performance, which can promote the development of IRST detection. Codes are available at https://github.com/TinaLRJ/DeepPro.

[245] CaTFormer: Causal Temporal Transformer with Dynamic Contextual Fusion for Driving Intention Prediction

Sirui Wang, Zhou Guan, Bingxi Zhao, Tongjia Gu, Jie Liu

Main category: cs.CV

TL;DR: CaTFormer is a causal Temporal Transformer that models driver-environment interactions for robust intention prediction, achieving SOTA on Brain4Cars dataset.

Details

Motivation: Current approaches fail to accurately model complex spatiotemporal interdependencies and unpredictable variability of human driving behavior, which is crucial for safety and interactive efficiency in human-machine co-driving systems and high-level autonomous driving.

Method: CaTFormer introduces three key components: 1) Reciprocal Delayed Fusion (RDF) for precise temporal alignment of interior (driver) and exterior (environment) features, 2) Counterfactual Residual Encoding (CRE) to eliminate spurious correlations and reveal authentic causal dependencies, and 3) Feature Synthesis Network (FSN) to adaptively synthesize purified representations into coherent temporal representations.

Result: CaTFormer achieves state-of-the-art performance on the Brain4Cars dataset, effectively capturing complex causal temporal dependencies and enhancing both accuracy and transparency of driving intention prediction.

Conclusion: The proposed CaTFormer framework successfully addresses the limitations of existing approaches by explicitly modeling causal interactions between driver behavior and environmental context, leading to more robust and transparent driving intention prediction for human-machine co-driving systems.

Abstract: Accurate prediction of driving intention is key to enhancing the safety and interactive efficiency of human-machine co-driving systems. It serves as a cornerstone for achieving high-level autonomous driving. However, current approaches remain inadequate for accurately modeling the complex spatiotemporal interdependencies and the unpredictable variability of human driving behavior. To address these challenges, we propose CaTFormer, a causal Temporal Transformer that explicitly models causal interactions between driver behavior and environmental context for robust intention prediction. Specifically, CaTFormer introduces a novel Reciprocal Delayed Fusion (RDF) mechanism for precise temporal alignment of interior and exterior feature streams, a Counterfactual Residual Encoding (CRE) module that systematically eliminates spurious correlations to reveal authentic causal dependencies, and an innovative Feature Synthesis Network (FSN) that adaptively synthesizes these purified representations into coherent temporal representations. Experimental results demonstrate that CaTFormer attains state-of-the-art performance on the Brain4Cars dataset. It effectively captures complex causal temporal dependencies and enhances both the accuracy and transparency of driving intention prediction.

[246] WeatherDiffusion: Controllable Weather Editing in Intrinsic Space

Yixin Zhu, Zuoliang Zhu, Jian Yang, Miloš Hašan, Jin Xie, Beibei Wang

Main category: cs.CV

TL;DR: WeatherDiffusion: A diffusion-based framework for controllable weather editing using intrinsic maps (material, geometry, lighting) with inverse/forward renderers and CLIP-space interpolation for fine-grained weather control.

Details

Motivation: Traditional pixel-space weather editing lacks controllability. The paper aims to create a more controllable weather editing system using intrinsic representations for better spatial correspondence and decomposition in outdoor scenes.

Method: Two-component diffusion framework: 1) Inverse renderer estimates material properties, scene geometry, and lighting as intrinsic maps from input images; 2) Forward renderer uses these maps with weather text prompts to generate final images. Includes intrinsic map-aware attention mechanism and CLIP-space interpolation for weather control.

Result: Outperforms state-of-the-art pixel-space editing, weather restoration, and rendering-based methods. Introduces two datasets: synthetic (38k images) and real-world (18k images) with intrinsic map annotations under various weather conditions.

Conclusion: WeatherDiffusion shows promise for downstream tasks like autonomous driving by enhancing robustness of detection and segmentation in challenging weather scenarios through controllable weather editing in intrinsic space.

Abstract: We present WeatherDiffusion, a diffusion-based framework for controllable weather editing in intrinsic space. Our framework includes two components based on diffusion priors: an inverse renderer that estimates material properties, scene geometry, and lighting as intrinsic maps from an input image, and a forward renderer that utilizes these geometry and material maps along with a text prompt that describes specific weather conditions to generate a final image. The intrinsic maps enhance controllability compared to traditional pixel-space editing approaches. We propose an intrinsic map-aware attention mechanism that improves spatial correspondence and decomposition quality in large outdoor scenes. For forward rendering, we leverage CLIP-space interpolation of weather prompts to achieve fine-grained weather control. We also introduce a synthetic and a real-world dataset, containing 38k and 18k images under various weather conditions, each with intrinsic map annotations. WeatherDiffusion outperforms state-of-the-art pixel-space editing approaches, weather restoration methods, and rendering-based methods, showing promise for downstream tasks such as autonomous driving, enhancing the robustness of detection and segmentation in challenging weather scenarios.

[247] MoIIE: Mixture of Intra- and Inter-Modality Experts for Large Vision Language Models

Dianyi Wang, Siyuan Wang, Zejun Li, Yikun Wang, Yitong Li, Duyu Tang, Xiaoyu Shen, Xuanjing Huang, Zhongyu Wei

Main category: cs.CV

TL;DR: MoIIE introduces Mixture of Intra- and Inter-Modality Experts to LVLMs, enabling efficient joint learning of modality-specific features and cross-modal interactions with a two-stage training strategy.

Details

Motivation: Large Vision-Language Models (LVLMs) have high computational costs despite strong performance. While sparse Mixture of Experts (MoE) architectures improve parameter efficiency, effectively applying MoE to handle both modality-specific features and cross-modal associations in LVLMs remains challenging.

Method: Proposes MoIIE (Mixture of Intra- and Inter-Modality Experts) where expert routing is guided by token modality. Tokens are directed to intra-modality experts and a shared pool of inter-modality experts. Also introduces a two-stage training strategy to activate both MoE and multi-modal capabilities.

Result: MoIIE models with 5.5B and 11.3B activated parameters match or surpass performance of existing advanced open-source MoE-LLM-based multi-modal models with more activated parameters. Demonstrates effectiveness, efficiency, and generality across different data scales and LLM backbones.

Conclusion: MoIIE provides an effective approach to incorporate MoE into LVLMs, enabling efficient joint learning of intra-modal features and cross-modal interactions while maintaining strong performance with fewer activated parameters.

Abstract: Large Vision-Language Models (LVLMs) have demonstrated remarkable performance across multi-modal tasks by scaling model size and training data. However, these dense LVLMs incur significant computational costs and motivate the exploration of sparse Mixture of Experts (MoE) architectures. While MoE improve parameter efficiency, effectively applying MoE to simultaneously model modality-specific features and cross-modal associations in LVLMs remains challenging. In this work, we propose to incorporate Mixture of Intra- and Inter-Modality Experts (MoIIE) to LVLMs. For each token, expert routing is guided by its modality, directing tokens to their respective intra-modality experts as well as a shared pool of inter-modality experts, enabling the model to jointly learn rich intra-modal features and cross-modal interactions. We further introduce an effective and straightforward two-stage training strategy, which facilitates the direct activation of both MoE and multi-modal capabilities. Extensive experiments across different data scales and LLM backbone demonstrate the effectiveness, efficiency and generality of our approach. Notably, our MoIIE models with 5.5B and 11.3B activated parameters match or even surpass the performance of existing advanced open-source MoE-LLMs based multi-modal models that involve more activated parameters. The code is available at https://github.com/AlenjandroWang/MoIIE.

[248] Novel View Synthesis using DDIM Inversion

Sehajdeep Singh, A V Subramanyam, Aditya Gupta, Sahil Gupta

Main category: cs.CV

TL;DR: A novel view synthesis method that uses a lightweight translation U-Net and fusion strategy with pretrained diffusion models to generate high-quality novel views from single images.

Details

Motivation: Existing methods for novel view synthesis from single images are expensive (require fine-tuning large diffusion models or training from scratch) and suffer from blurry reconstruction and poor generalization. There's a need for a lightweight approach that leverages pretrained diffusion models' capabilities.

Method: Uses DDIM-inverted latent of input image, camera pose-conditioned translation U-Net (TUNet) to predict target view latent, proposes novel fusion strategy exploiting DDIM inversion noise correlation to preserve details, then uses fused latent for DDIM sampling with pretrained diffusion model.

Result: Extensive experiments on MVImgNet demonstrate that the method outperforms existing methods in novel view synthesis quality.

Conclusion: The proposed lightweight framework successfully leverages pretrained diffusion models for high-quality novel view synthesis from single images, addressing computational cost and quality issues of existing approaches.

Abstract: Synthesizing novel views from a single input image is a challenging task. It requires extrapolating the 3D structure of a scene while inferring details in occluded regions, and maintaining geometric consistency across viewpoints. Many existing methods must fine-tune large diffusion backbones using multiple views or train a diffusion model from scratch, which is extremely expensive. Additionally, they suffer from blurry reconstruction and poor generalization. This gap presents the opportunity to explore an explicit lightweight view translation framework that can directly utilize the high-fidelity generative capabilities of a pretrained diffusion model while reconstructing a scene from a novel view. Given the DDIM-inverted latent of a single input image, we employ a camera pose-conditioned translation U-Net, TUNet, to predict the inverted latent corresponding to the desired target view. However, the image sampled using the predicted latent may result in a blurry reconstruction. To this end, we propose a novel fusion strategy that exploits the inherent noise correlation structure observed in DDIM inversion. The proposed fusion strategy helps preserve the texture and fine-grained details. To synthesize the novel view, we use the fused latent as the initial condition for DDIM sampling, leveraging the generative prior of the pretrained diffusion model. Extensive experiments on MVImgNet demonstrate that our method outperforms existing methods.

[249] MVT: Mask-Grounded Vision-Language Models for Taxonomy-Aligned Land-Cover Tagging

Siyi Chen, Kai Wang, Weicong Pang, Ruiming Yang, Ziru Chen, Renjun Gao, Alexis Kai Hon Lau, Dasa Gu, Chenchen Zhang, Cheng Li

Main category: cs.CV

TL;DR: MVT framework for class-agnostic land-cover understanding: domain-adapted SAM2 for boundary-faithful masks, dual-step MLLM fine-tuning for semantic tagging/scene description, and LLM-as-judge evaluation.

Details

Motivation: Need for class-agnostic systems in remote sensing that generalize across datasets while maintaining spatial precision and interpretability, focusing on geometry-first discovery-and-interpretation under domain shift.

Method: Three-stage MVT framework: 1) Domain-adapted SAM2 for boundary-faithful region masks, 2) Dual-step LoRA fine-tuning of multimodal LLMs for mask-grounded semantic tagging and scene description, 3) LLM-as-judge evaluation calibrated by expert ratings.

Result: Domain-adapted SAM2 improves mask quality in cross-dataset transfer (OpenEarthMap to LoveDA). Dual-step MLLM fine-tuning yields more accurate taxonomy-aligned tags and more informative mask-grounded scene descriptions.

Conclusion: MVT successfully couples class-agnostic mask evidence with taxonomy-grounded scene interpretation, addressing domain shift while maintaining interpretability in remote sensing land-cover analysis.

Abstract: Land-cover understanding in remote sensing increasingly demands class-agnostic systems that generalize across datasets while remaining spatially precise and interpretable. We study a geometry-first discovery-and-interpretation setting under domain shift, where candidate regions are delineated class-agnostically and supervision avoids lexical class names via anonymized identifiers. Complementary to open-set recognition and open-world learning, we focus on coupling class-agnostic mask evidence with taxonomy-grounded scene interpretation, rather than unknown rejection or continual class expansion. We propose MVT, a three-stage framework that (i) extracts boundary-faithful region masks using SAM2 with domain adaptation, (ii) performs mask-grounded semantic tagging and scene description generation via dual-step LoRA fine-tuning of multimodal LLMs, and (iii) evaluates outputs with LLM-as-judge scoring calibrated by stratified expert ratings. On cross-dataset segmentation transfer (train on OpenEarthMap, evaluate on LoveDA), domain-adapted SAM2 improves mask quality; meanwhile, dual-step MLLM fine-tuning yields more accurate taxonomy-aligned tags and more informative mask-grounded scene descriptions.

[250] OneVision: An End-to-End Generative Framework for Multi-view E-commerce Vision Search

Zexin Zheng, Huangyu Dai, Lingtao Mao, Xinyu Sun, Zihan Liang, Ben Chen, Yuqing Ding, Chenyi Lei, Wenwu Ou, Han Li, Kun Gai

Main category: cs.CV

TL;DR: OneVision is an end-to-end generative framework that replaces traditional multi-stage cascading architecture for vision search, using vision-aligned residual quantization to align multi-view representations and multi-stage semantic alignment for personalized preference generation, achieving better efficiency and conversion metrics.

Details

Motivation: Traditional multi-stage cascading architecture (MCA) for vision search suffers from representation discrepancies between query and product images across different viewpoints, and conflicts between optimization objectives across stages, making it difficult to achieve Pareto optimality in both user experience and conversion.

Method: Proposes OneVision framework with two key components: 1) VRQ (vision-aligned residual quantization encoding) to align different viewpoint representations while preserving product distinctiveness, and 2) multi-stage semantic alignment scheme to maintain visual similarity priors while incorporating user-specific information for personalized preference generation.

Result: Offline: Performs on par with online MCA while improving inference efficiency by 21% through dynamic pruning. Online A/B tests: +2.15% item CTR, +2.27% CVR, and +3.12% order volume improvements.

Conclusion: A semantic ID centric, generative architecture can successfully unify retrieval and personalization while simplifying the serving pathway, overcoming limitations of traditional multi-stage cascading approaches.

Abstract: Traditional vision search, similar to search and recommendation systems, follows the multi-stage cascading architecture (MCA) paradigm to balance efficiency and conversion. Specifically, the query image undergoes feature extraction, recall, pre-ranking, and ranking stages, ultimately presenting the user with semantically similar products that meet their preferences. This multi-view representation discrepancy of the same object in the query and the optimization objective collide across these stages, making it difficult to achieve Pareto optimality in both user experience and conversion. In this paper, an end-to-end generative framework, OneVision, is proposed to address these problems. OneVision builds on VRQ, a vision-aligned residual quantization encoding, which can align the vastly different representations of an object across multiple viewpoints while preserving the distinctive features of each product as much as possible. Then a multi-stage semantic alignment scheme is adopted to maintain strong visual similarity priors while effectively incorporating user-specific information for personalized preference generation. In offline evaluations, OneVision performs on par with online MCA, while improving inference efficiency by 21% through dynamic pruning. In A/B tests, it achieves significant online improvements: +2.15% item CTR, +2.27% CVR, and +3.12% order volume. These results demonstrate that a semantic ID centric, generative architecture can unify retrieval and personalization while simplifying the serving pathway.

[251] Full segmentation annotations of 3D time-lapse microscopy images of MDA231 cells

Aleksandra Melnikova, Petr Matula

Main category: cs.CV

TL;DR: This paper provides comprehensive documentation of the first publicly available 3D time-lapse segmentation annotations for migrating cells with complex dynamic shapes, supplementing previous work with additional dataset details and validation experiments.

Details

Motivation: High-quality segmentation annotations are critical for advancing image processing, but creating volumetric annotations for dynamic cell migration is time-consuming and challenging. The authors previously created the first publicly available 3D time-lapse segmentation annotations but had space limitations in their initial publication.

Method: Three distinct human annotators manually created 3D segmentation annotations for two sequences of MDA231 human breast carcinoma cells from the Cell Tracking Challenge. The paper provides comprehensive dataset description and validation experiments comparing annotations with CTC tracking markers, 2D gold truth, and automatically generated silver truth.

Result: The created annotations are consistent with CTC tracking markers, segmentation accuracy falls within inter-annotator variability margins when compared to 2D gold truth, and the manual annotations better represent image complexity compared to automatically generated silver truth from CTC.

Conclusion: The presented 3D time-lapse segmentation annotations provide valuable resources for testing and training cell segmentation algorithms, as well as analyzing 3D shapes of highly dynamic objects, with validation showing their quality and consistency with existing benchmarks.

Abstract: High-quality, publicly available segmentation annotations of image and video datasets are critical for advancing the field of image processing. In particular, annotations of volumetric images of a large number of targets are time-consuming and challenging. In (Melnikova, A., & Matula, P., 2025), we presented the first publicly available full 3D time-lapse segmentation annotations of migrating cells with complex dynamic shapes. Concretely, three distinct humans annotated two sequences of MDA231 human breast carcinoma cells (Fluo-C3DL-MDA231) from the Cell Tracking Challenge (CTC). This paper aims to provide a comprehensive description of the dataset and accompanying experiments that were not included in (Melnikova, A., & Matula, P., 2025) due to limitations in publication space. Namely, we show that the created annotations are consistent with the previously published tracking markers provided by the CTC organizers and the segmentation accuracy measured based on the 2D gold truth of CTC is within the inter-annotator variability margins. We compared the created 3D annotations with automatically created silver truth provided by CTC. We have found the proposed annotations better represent the complexity of the input images. The presented annotations can be used for testing and training cell segmentation, or analyzing 3D shapes of highly dynamic objects.

[252] MT-Video-Bench: A Holistic Video Understanding Benchmark for Evaluating Multimodal LLMs in Multi-Turn Dialogues

Yaning Pan, Qianqian Xie, Guohui Zhang, Zekun Wang, Yongqian Wen, Yuanxing Zhang, Haoxuan Hu, Zhiyu Pan, Yibing Huang, Zhidong Gan, Yonghong Lin, An Ping, Shihao Li, Yanghai Wang, Tianhao Peng, Jiaheng Liu

Main category: cs.CV

TL;DR: MT-Video-Bench is a new benchmark for evaluating Multimodal Large Language Models on multi-turn video dialogues, addressing limitations of existing single-turn QA benchmarks.

Details

Motivation: Existing MLLM evaluation benchmarks are limited to single-turn question answering, which doesn't capture the complexity of real-world multi-turn dialogues involving video understanding.

Method: Created MT-Video-Bench with 1,000 meticulously curated multi-turn dialogues from diverse domains, assessing 6 core competencies focused on perceptivity and interactivity, aligned with real-world applications like sports analysis and video-based tutoring.

Result: Extensive evaluation of state-of-the-art open-source and closed-source MLLMs revealed significant performance discrepancies and limitations in handling multi-turn video dialogues.

Conclusion: MT-Video-Bench addresses a critical gap in MLLM evaluation and will be publicly available to foster future research in multi-turn video dialogue understanding.

Abstract: The recent development of Multimodal Large Language Models (MLLMs) has significantly advanced AI’s ability to understand visual modalities. However, existing evaluation benchmarks remain limited to single-turn question answering, overlooking the complexity of multi-turn dialogues in real-world scenarios. To bridge this gap, we introduce MT-Video-Bench, a holistic video understanding benchmark for evaluating MLLMs in multi-turn dialogues. Specifically, our MT-Video-Bench mainly assesses 6 core competencies that focus on perceptivity and interactivity, encompassing 1,000 meticulously curated multi-turn dialogues from diverse domains. These capabilities are rigorously aligned with real-world applications, such as interactive sports analysis and multi-turn video-based intelligent tutoring. With MT-Video-Bench, we extensively evaluate various state-of-the-art open-source and closed-source MLLMs, revealing their significant performance discrepancies and limitations in handling multi-turn video dialogues. The benchmark will be publicly available to foster future research.

[253] MobileGeo: Exploring Hierarchical Knowledge Distillation for Resource-Efficient Cross-view Drone Geo-Localization

Jian Sun, Kangdao Liu, Chi Zhang, Chuangquan Chen, Junge Shen, C. L. Philip Chen, Chi-Man Vong

Main category: cs.CV

TL;DR: MobileGeo is a mobile-friendly framework for efficient cross-view geo-localization that achieves state-of-the-art performance while being 5x more efficient in FLOPs and 3x faster than previous methods.

Details

Motivation: Existing cross-view geo-localization methods rely on resource-intensive feature alignment and multi-branch architectures, resulting in high inference costs that limit deployment on edge devices for drone applications.

Method: 1) Hierarchical Distillation (HD-CVGL) with Uncertainty-Aware Prediction Alignment (UAPA) during training to distill essential information into a compact model without inference overhead. 2) Multi-view Selection Refinement Module (MSRM) during inference to filter redundant views using mutual information and reduce computational load.

Result: Achieves 4.19% improvement in AP on University1652 dataset, over 5x more efficient in FLOPs, 3x faster inference, and runs at 251.5 FPS on NVIDIA AGX Orin edge device.

Conclusion: MobileGeo demonstrates practical viability for real-time on-device drone geo-localization by balancing accuracy and efficiency, making it suitable for deployment on edge devices in GNSS-denied environments.

Abstract: Cross-view geo-localization (CVGL) plays a vital role in drone-based multimedia applications, enabling precise localization by matching drone-captured aerial images against geo-tagged satellite databases in GNSS-denied environments. However, existing methods rely on resource-intensive feature alignment and multi-branch architectures, incurring high inference costs that limit their deployment on edge devices. We propose MobileGeo, a mobile-friendly framework designed for efficient on-device CVGL: 1) During training, a Hierarchical Distillation (HD-CVGL) paradigm, coupled with Uncertainty-Aware Prediction Alignment (UAPA), distills essential information into a compact model without incurring inference overhead. 2) During inference, an efficient Multi-view Selection Refinement Module (MSRM) leverages mutual information to filter redundant views and reduce computational load. Extensive experiments demonstrate that MobileGeo outperforms previous state-of-the-art methods, achieving a 4.19% improvement in AP on University1652 dataset while being over 5 times efficient in FLOPs and 3 times faster. Crucially, MobileGeo runs at 251.5 FPS on an NVIDIA AGX Orin edge device, demonstrating its practical viability for real-time on-device drone geo-localization. The code is available at https://github.com/SkyEyeLoc/MobileGeo.

[254] Automated Invoice Data Extraction: Using LLM and OCR

Khushi Khanchandani, Advait Thakur, Akshita Shetty, Chaitravi Reddy, Ritisa Behera

Main category: cs.CV

TL;DR: This paper introduces a holistic AI platform combining OCR, deep learning, LLMs, and graph analytics to overcome limitations of conventional OCR systems for invoice processing.

Details

Motivation: Conventional OCR systems struggle with variant invoice layouts, handwritten text, low-quality scans, and strong template dependencies that restrict flexibility across different document structures. Existing solutions have limitations in handling diverse document types and complex contextual relationships.

Method: The paper proposes a holistic AI platform that integrates multiple technologies: OCR for text extraction, deep learning models (CNNs and Transformers) for layout analysis, Large Language Models (LLMs) for sophisticated entity recognition and semantic comprehension, and graph analytics for contextual relationship mapping. The platform uses Visual Named Entity Recognition (NER) capabilities for extraction from invoice images with contextual sensitivity.

Result: The platform achieves unprecedented extraction quality and consistency compared to older approaches. It provides greater contextual sensitivity and much higher accuracy rates for invoice processing across varied document types and layouts.

Conclusion: The holistic AI platform combining OCR, deep learning, LLMs, and graph analytics represents a significant advancement in document processing, overcoming traditional OCR limitations and enabling maximum scalability with minimal human intervention for invoice extraction tasks.

Abstract: Conventional Optical Character Recognition (OCR) systems are challenged by variant invoice layouts, handwritten text, and low-quality scans, which are often caused by strong template dependencies that restrict their flexibility across different document structures and layouts. Newer solutions utilize advanced deep learning models such as Convolutional Neural Networks (CNN) as well as Transformers, and domain-specific models for better layout analysis and accuracy across various sections over varied document types. Large Language Models (LLMs) have revolutionized extraction pipelines at their core with sophisticated entity recognition and semantic comprehension to support complex contextual relationship mapping without direct programming specification. Visual Named Entity Recognition (NER) capabilities permit extraction from invoice images with greater contextual sensitivity and much higher accuracy rates than older approaches. Existing industry best practices utilize hybrid architectures that blend OCR technology and LLM for maximum scalability and minimal human intervention. This work introduces a holistic Artificial Intelligence (AI) platform combining OCR, deep learning, LLMs, and graph analytics to achieve unprecedented extraction quality and consistency.

[255] Improving VisNet for Object Recognition

Mehdi Fatan Serj, C. Alejandro Parraga, Xavier Otazu

Main category: cs.CV

TL;DR: Enhanced VisNet variants with RBF neurons, Mahalanobis distance learning, and retinal preprocessing improve object recognition and symmetry classification accuracy over baseline model.

Details

Motivation: Biological visual systems efficiently recognize objects, but reproducing this capability in artificial systems remains challenging. The study aims to investigate biologically inspired VisNet and its enhanced variants for better object recognition and symmetry classification.

Method: Enhanced VisNet variants incorporating radial basis function (RBF) neurons, Mahalanobis distance-based learning, and retinal-like preprocessing. Uses Hebbian learning and temporal continuity to associate temporally adjacent views for building invariant representations.

Result: Enhanced VisNet variants substantially improve recognition accuracy compared to baseline model across multiple datasets including MNIST, CIFAR10, and custom symmetric object sets.

Conclusion: Enhanced VisNet architectures demonstrate adaptability and biological relevance, offering a powerful and interpretable framework for visual recognition in both neuroscience and artificial intelligence.

Abstract: Object recognition plays a fundamental role in how biological organisms perceive and interact with their environment. While the human visual system performs this task with remarkable efficiency, reproducing similar capabilities in artificial systems remains challenging. This study investigates VisNet, a biologically inspired neural network model, and several enhanced variants incorporating radial basis function neurons, Mahalanobis distance based learning, and retinal like preprocessing for both general object recognition and symmetry classification. By leveraging principles of Hebbian learning and temporal continuity associating temporally adjacent views to build invariant representations. VisNet and its extensions capture robust and transformation invariant features. Experimental results across multiple datasets, including MNIST, CIFAR10, and custom symmetric object sets, show that these enhanced VisNet variants substantially improve recognition accuracy compared with the baseline model. These findings underscore the adaptability and biological relevance of VisNet inspired architectures, offering a powerful and interpretable framework for visual recognition in both neuroscience and artificial intelligence. Keywords: VisNet, Object Recognition, Symmetry Detection, Hebbian Learning, RBF Neurons, Mahalanobis Distance, Biologically Inspired Models, Invariant Representations

[256] Clinically-Validated Innovative Mobile Application for Assessing Blinking and Eyelid Movements

Gustavo Adolpho Bonesso, Carlos Marcelo Gurjão de Godoy, Tammy Hentona Osaki, Midori Hentona Osaki, Bárbara Moreira Ribeiro Trindade dos Santos, Juliana Yuka Washiya, Regina Célia Coelho

Main category: cs.CV

TL;DR: Bapp is a mobile app using Flutter and Google ML Kit for real-time eyelid movement analysis, achieving 98.3% accuracy in clinical validation compared to specialist annotations.

Details

Motivation: Existing tools for objective assessment of eyelid movements are complex, costly, and have limited clinical applicability, creating a need for accessible, portable solutions for blink monitoring.

Method: Developed Bapp mobile application using Flutter framework with Google ML Kit integration for on-device, real-time eyelid movement analysis. Validated using 45 patient videos with manual blink annotations by an ophthalmology specialist as ground truth.

Result: Bapp achieved 98.4% precision, 96.9% recall, and 98.3% overall accuracy in detecting blinks compared to specialist annotations.

Conclusion: Bapp is a reliable, portable, accessible, and objective tool for monitoring eyelid movements, offering a promising alternative to traditional manual blink counting for continuous ocular health monitoring and postoperative evaluation.

Abstract: Blinking is a vital physiological process that protects and maintains the health of the ocular surface. Objective assessment of eyelid movements remains challenging due to the complexity, cost, and limited clinical applicability of existing tools. This study presents the Bapp (Blink Application), a mobile application developed using the Flutter framework and integrated with Google ML Kit for on-device, real-time analysis of eyelid movements, and its clinical validation. The validation was performed using 45 videos from patients, whose blinks were manually annotated by an ophthalmology specialist as the ground truth. The Bapp’s performance was evaluated using standard metrics, with results demonstrating 98.4% precision, 96.9% recall, and an overall accuracy of 98.3%. These outcomes confirm the reliability of the Bapp as a portable, accessible, and objective tool for monitoring eyelid movements. The application offers a promising alternative to traditional manual blink counting, supporting continuous ocular health monitoring and postoperative evaluation in clinical environments.

[257] StreamFlow: Theory, Algorithm, and Implementation for High-Efficiency Rectified Flow Generation

Sen Fang, Hongbin Zhong, Yalin Feng, Yanxin Zhang, Dimitris N. Metaxas

Main category: cs.CV

TL;DR: Proposes a comprehensive acceleration pipeline for Rectified Flow models that achieves 611% speedup for 512x512 image generation, far surpassing existing 18% acceleration methods.

Details

Motivation: Rectified Flow and Flow Matching models have improved generative model performance but existing acceleration methods cannot be directly applied due to theoretical and design differences from diffusion models.

Method: Develops a comprehensive acceleration pipeline with three key innovations: batch processing with new velocity field, vectorization of heterogeneous time-step batch processing, and dynamic TensorRT compilation for flow-based models.

Result: Achieves 611% acceleration for 512x512 image generation, significantly outperforming existing public methods that typically achieve only 18% acceleration.

Conclusion: The proposed acceleration pipeline successfully addresses the unique challenges of Rectified Flow models and demonstrates substantial performance improvements over existing acceleration approaches.

Abstract: New technologies such as Rectified Flow and Flow Matching have significantly improved the performance of generative models in the past two years, especially in terms of control accuracy, generation quality, and generation efficiency. However, due to some differences in its theory, design, and existing diffusion models, the existing acceleration methods cannot be directly applied to the Rectified Flow model. In this article, we have comprehensively implemented an overall acceleration pipeline from the aspects of theory, design, and reasoning strategies. This pipeline uses new methods such as batch processing with a new velocity field, vectorization of heterogeneous time-step batch processing, and dynamic TensorRT compilation for the new methods to comprehensively accelerate related models based on flow models. Currently, the existing public methods usually achieve an acceleration of 18%, while experiments have proved that our new method can accelerate the 512*512 image generation speed to up to 611%, which is far beyond the current non-generalized acceleration methods.

[258] BlurDM: A Blur Diffusion Model for Image Deblurring

Jin-Ting He, Fu-Jen Tsai, Yan-Tsung Peng, Min-Hung Chen, Chia-Wen Lin, Yen-Yu Lin

Main category: cs.CV

TL;DR: BlurDM integrates blur formation process into diffusion models for dynamic scene deblurring, using dual-diffusion forward scheme and latent space implementation to enhance existing deblurring methods.

Details

Motivation: Existing diffusion models for deblurring fail to leverage the intrinsic nature of the blurring process, limiting their full potential for dynamic scene deblurring.

Method: BlurDM uses a dual-diffusion forward scheme that diffuses both noise and blur onto sharp images, then performs reverse generation with dual denoising and deblurring formulation. It operates in latent space for efficient integration into deblurring networks.

Result: Extensive experiments show BlurDM significantly and consistently enhances existing deblurring methods on four benchmark datasets.

Conclusion: BlurDM effectively integrates blur formation into diffusion models, providing a flexible prior generation network that improves deblurring performance across multiple benchmarks.

Abstract: Diffusion models show promise for dynamic scene deblurring; however, existing studies often fail to leverage the intrinsic nature of the blurring process within diffusion models, limiting their full potential. To address it, we present a Blur Diffusion Model (BlurDM), which seamlessly integrates the blur formation process into diffusion for image deblurring. Observing that motion blur stems from continuous exposure, BlurDM implicitly models the blur formation process through a dual-diffusion forward scheme, diffusing both noise and blur onto a sharp image. During the reverse generation process, we derive a dual denoising and deblurring formulation, enabling BlurDM to recover the sharp image by simultaneously denoising and deblurring, given pure Gaussian noise conditioned on the blurred image as input. Additionally, to efficiently integrate BlurDM into deblurring networks, we perform BlurDM in the latent space, forming a flexible prior generation network for deblurring. Extensive experiments demonstrate that BlurDM significantly and consistently enhances existing deblurring methods on four benchmark datasets. The project page is available at https://jin-ting-he.github.io/BlurDM/.

[259] MAFNet:Multi-frequency Adaptive Fusion Network for Real-time Stereo Matching

Ao Xu, Rujin Zhao, Xiong Xu, Boceng Huang, Yujia Jia, Hongfeng Long, Fuxuan Chen, Zilong Cao, Fangyuan Chen

Main category: cs.CV

TL;DR: MAFNet: A stereo matching network using only efficient 2D convolutions with frequency-domain filtering attention and Linformer-based fusion for real-time performance on mobile devices.

Details

Motivation: Existing stereo matching methods have limitations: 3D convolution-based approaches have high computational overhead, while iterative optimization methods lack non-local context modeling. Both are poorly suited for resource-constrained mobile devices and real-time applications.

Method: Proposes Multi-frequency Adaptive Fusion Network (MAFNet) with two key components: 1) Adaptive frequency-domain filtering attention module that decomposes cost volume into high- and low-frequency volumes for separate feature aggregation, and 2) Linformer-based low-rank attention mechanism to adaptively fuse high- and low-frequency information.

Result: Extensive experiments show MAFNet significantly outperforms existing real-time methods on Scene Flow and KITTI 2015 datasets, achieving favorable balance between accuracy and real-time performance.

Conclusion: MAFNet enables high-quality disparity estimation using only efficient 2D convolutions, making it suitable for deployment on resource-constrained mobile devices for real-time stereo matching applications.

Abstract: Existing stereo matching networks typically rely on either cost-volume construction based on 3D convolutions or deformation methods based on iterative optimization. The former incurs significant computational overhead during cost aggregation, whereas the latter often lacks the ability to model non-local contextual information. These methods exhibit poor compatibility on resource-constrained mobile devices, limiting their deployment in real-time applications. To address this, we propose a Multi-frequency Adaptive Fusion Network (MAFNet), which can produce high-quality disparity maps using only efficient 2D convolutions. Specifically, we design an adaptive frequency-domain filtering attention module that decomposes the full cost volume into high-frequency and low-frequency volumes, performing frequency-aware feature aggregation separately. Subsequently, we introduce a Linformer-based low-rank attention mechanism to adaptively fuse high- and low-frequency information, yielding more robust disparity estimation. Extensive experiments demonstrate that the proposed MAFNet significantly outperforms existing real-time methods on public datasets such as Scene Flow and KITTI 2015, showing a favorable balance between accuracy and real-time performance.

[260] CHIMERA: Adaptive Cache Injection and Semantic Anchor Prompting for Zero-shot Image Morphing with Morphing-oriented Metrics

Dahyeon Kye, Jeahun Sung, Minkyu Jeon, Jihyong Oh

Main category: cs.CV

TL;DR: CHIMERA is a zero-shot diffusion framework for smooth image morphing using cached inversion-guided denoising with adaptive feature injection and semantic prompting.

Details

Motivation: Existing diffusion-based image morphing methods often produce abrupt transitions or over-saturated appearances due to lack of adaptive structural and semantic alignments between dissimilar images.

Method: CHIMERA uses cached inversion-guided denoising with two key components: 1) Adaptive Cache Injection (ACI) that caches and adaptively re-injects features from both input images during denoising, and 2) Semantic Anchor Prompting (SAP) that generates a shared semantic anchor prompt using vision-language models to bridge dissimilar inputs.

Result: CHIMERA achieves smoother and more semantically aligned transitions than existing methods, establishing new state-of-the-art performance in image morphing as shown through extensive experiments and user studies.

Conclusion: The proposed CHIMERA framework effectively addresses the challenges of image morphing with large semantic disparities through adaptive feature alignment and semantic bridging, while introducing a new evaluation metric (GLCS) for morphing quality assessment.

Abstract: Diffusion models exhibit remarkable generative ability, yet achieving smooth and semantically consistent image morphing remains a challenge. Existing approaches often yield abrupt transitions or over-saturated appearances due to the lack of adaptive structural and semantic alignments. We propose CHIMERA, a zero-shot diffusion-based framework that formulates morphing as a cached inversion-guided denoising process. To handle large semantic and appearance disparities, we propose Adaptive Cache Injection and Semantic Anchor Prompting. Adaptive Cache Injection (ACI) caches down, mid, and up blocks features from both inputs during DDIM inversion and re-injects them adaptively during denoising, enabling spatial and semantic alignment in depth- and time-adaptive manners and enabling natural feature fusion and smooth transitions. Semantic Anchor Prompting (SAP) leverages a vision-language model to generate a shared anchor prompt that serves as a semantic anchor, bridging dissimilar inputs and guiding the denoising process toward coherent results. Finally, we introduce the Global-Local Consistency Score (GLCS), a morphing-oriented metric that simultaneously evaluates the global harmonization of the two inputs and the smoothness of the local morphing transition. Extensive experiments and user studies show that CHIMERA achieves smoother and more semantically aligned transitions than existing methods, establishing a new state of the art in image morphing. The code and project page will be publicly released.

[261] Robust Scene Coordinate Regression via Geometrically-Consistent Global Descriptors

Son Tung Nguyen, Alejandro Fontan, Michael Milford, Tobias Fischer

Main category: cs.CV

TL;DR: Learning-based visual localization method that learns global descriptors consistent with both geometric structure and visual similarity, improving robustness to noisy geometric constraints.

Details

Motivation: Existing visual localization methods rely on geometric cues alone (like covisibility graphs) for global descriptors, which limits discriminative power and reduces robustness when geometric constraints are noisy.

Method: Proposes an aggregator module that learns global descriptors consistent with both geometrical structure and visual similarity. Uses batch-mining strategy based on overlap scores and modified contrastive loss, enabling training without manual place labels.

Result: Substantial localization gains in large-scale environments while preserving computational and memory efficiency, as shown in experiments on challenging benchmarks.

Conclusion: The method effectively corrects erroneous associations caused by unreliable overlap scores and generalizes across diverse environments without requiring manual place labels.

Abstract: Recent learning-based visual localization methods use global descriptors to disambiguate visually similar places, but existing approaches often derive these descriptors from geometric cues alone (e.g., covisibility graphs), limiting their discriminative power and reducing robustness in the presence of noisy geometric constraints. We propose an aggregator module that learns global descriptors consistent with both geometrical structure and visual similarity, ensuring that images are close in descriptor space only when they are visually similar and spatially connected. This corrects erroneous associations caused by unreliable overlap scores. Using a batch-mining strategy based solely on the overlap scores and a modified contrastive loss, our method trains without manual place labels and generalizes across diverse environments. Experiments on challenging benchmarks show substantial localization gains in large-scale environments while preserving computational and memory efficiency. Code is available at https://github.com/sontung/robust_scr.

[262] Name That Part: 3D Part Segmentation and Naming

Soumava Paul, Prakhar Kaushik, Ankit Vaidya, Anand Bhattad, Alan Yuille

Main category: cs.CV

TL;DR: ALIGN-Parts: A method for semantic 3D part segmentation that aligns 3D part representations with text descriptions via bipartite matching, enabling open-vocabulary part naming and creating a unified part ontology across datasets.

Details

Motivation: Existing part segmentation datasets have inconsistent definitions across datasets, limiting robust training. Previous methods produce unlabeled decompositions or retrieve single parts without complete shape annotations.

Method: Formulates part naming as direct set alignment task using partlets (implicit 3D part representations) matched to part descriptions via bipartite assignment. Combines geometric cues from 3D part fields, appearance cues from multi-view vision features, and semantic knowledge from language-model-generated affordance descriptions with text-alignment loss.

Result: Creates unified ontology aligning PartNet, 3DCoMPaT++, and Find3D with 1,794 unique 3D parts. Introduces novel metrics for named 3D part segmentation. Shows examples from TexParts dataset. Supports zero-shot matching to arbitrary descriptions and confidence-calibrated predictions.

Conclusion: ALIGN-Parts provides efficient one-shot 3D part segmentation and naming with applications in downstream tasks and scalable annotation. Enables open-vocabulary matching and creates unified part definitions across major datasets.

Abstract: We address semantic 3D part segmentation: decomposing objects into parts with meaningful names. While datasets exist with part annotations, their definitions are inconsistent across datasets, limiting robust training. Previous methods produce unlabeled decompositions or retrieve single parts without complete shape annotations. We propose ALIGN-Parts, which formulates part naming as a direct set alignment task. Our method decomposes shapes into partlets - implicit 3D part representations - matched to part descriptions via bipartite assignment. We combine geometric cues from 3D part fields, appearance cues from multi-view vision features, and semantic knowledge from language-model-generated affordance descriptions. Text-alignment loss ensures partlets share embedding space with text, enabling a theoretically open-vocabulary matching setup, given sufficient data. Our efficient and novel, one-shot, 3D part segmentation and naming method finds applications in several downstream tasks, including serving as a scalable annotation engine. As our model supports zero-shot matching to arbitrary descriptions and confidence-calibrated predictions for known categories, with human verification, we create a unified ontology that aligns PartNet, 3DCoMPaT++, and Find3D, consisting of 1,794 unique 3D parts. We introduce two novel metrics appropriate for the named 3D part segmentation task. We also show examples from our newly created TexParts dataset.

[263] NASTaR: NovaSAR Automated Ship Target Recognition Dataset

Benyamin Hosseiny, Kamirul Kamirul, Odysseas Pappas, Alin Achim

Main category: cs.CV

TL;DR: NASTaR dataset provides 3415 S-band SAR ship patches with AIS-matched labels for ship type classification, achieving 60-87% accuracy across various classification scenarios.

Details

Motivation: SAR enables all-weather maritime monitoring but ship type classification is challenging due to high diversity of ship types and requires specialized deep learning models that depend on large, high-quality datasets. The growing variety of SAR satellites with different frequencies/resolutions increases the need for more annotated datasets.

Method: Created NASTaR dataset with 3415 ship patches extracted from NovaSAR S-band imagery, labeled with AIS data matching. Includes 23 unique ship classes, inshore/offshore separation, and auxiliary wake dataset. Validated using benchmark deep learning models across multiple classification scenarios.

Result: Achieved over 60% accuracy for 4 major ship types, over 70% for 3-class scenario, more than 75% for cargo vs tanker distinction, and over 87% for fishing vessel identification.

Conclusion: NASTaR dataset addresses the need for high-quality SAR ship datasets, enabling improved ship type classification models. The dataset and code are publicly available for research use.

Abstract: Synthetic Aperture Radar (SAR) offers a unique capability for all-weather, space-based maritime activity monitoring by capturing and imaging strong reflections from ships at sea. A well-defined challenge in this domain is ship type classification. Due to the high diversity and complexity of ship types, accurate recognition is difficult and typically requires specialized deep learning models. These models, however, depend on large, high-quality ground-truth datasets to achieve robust performance and generalization. Furthermore, the growing variety of SAR satellites operating at different frequencies and spatial resolutions has amplified the need for more annotated datasets to enhance model accuracy. To address this, we present the NovaSAR Automated Ship Target Recognition (NASTaR) dataset. This dataset comprises of 3415 ship patches extracted from NovaSAR S-band imagery, with labels matched to AIS data. It includes distinctive features such as 23 unique classes, inshore/offshore separation, and an auxiliary wake dataset for patches where ship wakes are visible. We validated the dataset applicability across prominent ship-type classification scenarios using benchmark deep learning models. Results demonstrate over 60% accuracy for classifying four major ship types, over 70% for a three-class scenario, more than 75% for distinguishing cargo from tanker ships, and over 87% for identifying fishing vessels. The NASTaR dataset is available at https://doi.org/10.5523/bris.2tfa6x37oerz2lyiw6hp47058, while relevant codes for benchmarking and analysis are available at https://github.com/benyaminhosseiny/nastar.

[264] Extended OpenTT Games Dataset: A table tennis dataset for fine-grained shot type and point outcome

Moamal Fadhil Abdul-Mahdi, Jonas Bruun Hubrechts, Thomas Martini Jørgensen, Emil Hovad

Main category: cs.CV

TL;DR: Extended OpenTTGames dataset with detailed stroke type annotations, player posture labels, and rally outcome tags for fine-grained table tennis video analysis.

Details

Motivation: To enable automatic stroke detection and classification in table tennis videos for training, broadcasting, and analytics, which requires annotated video data that is currently lacking or has restrictive licenses.

Method: Extended the existing OpenTTGames dataset by adding frame-accurate shot type annotations (forehand/backhand subtypes), player posture labels (body lean, leg stance), and rally outcome tags. Used a compact coding scheme and code-assisted labeling procedure for reproducible annotations.

Result: Created an enhanced dataset that moves beyond basic event spotting toward tactical understanding, allowing models to analyze stroke types, player postures, and point outcomes. Released under CC BY-NC-SA 4.0 license for free non-commercial use.

Conclusion: This work fills a practical gap in the community by providing publicly available, detailed annotations for table tennis video analysis, enabling research in fine-grained stroke understanding and tactical analysis in racket sports.

Abstract: Automatically detecting and classifying strokes in table tennis video can streamline training workflows, enrich broadcast overlays, and enable fine-grained performance analytics. For this to be possible, annotated video data of table tennis is needed. We extend the public OpenTTGames dataset with highly detailed, frame-accurate shot type annotations (forehand, backhand with subtypes), player posture labels (body lean and leg stance), and rally outcome tags at point end. OpenTTGames is a set of recordings from the side of the table with official labels for bounces, when the ball is above the net, or hitting the net. The dataset already contains ball coordinates near events, which are either “bounce”, “net”, or “empty_event” in the original OpenTTGames dataset, and semantic masks (humans, table, scoreboard). Our extension adds the types of stroke to the events and a per-player taxonomy so models can move beyond event spotting toward tactical understanding (e.g., whether a stroke is likely to win the point or set up an advantage). We provide a compact coding scheme and code-assisted labeling procedure to support reproducible annotations and baselines for fine-grained stroke understanding in racket sports. This fills a practical gap in the community, where many prior video resources are either not publicly released or carry restrictive/unclear licenses that hinder reuse and benchmarking. Our annotations are released under the same CC BY-NC-SA 4.0 license as OpenTTGames, allowing free non-commercial use, modification, and redistribution, with appropriate attribution.

[265] FluencyVE: Marrying Temporal-Aware Mamba with Bypass Attention for Video Editing

Mingshu Cai, Yixuan Li, Osamu Yoshie, Yuya Ieiri

Main category: cs.CV

TL;DR: FluencyVE is a one-shot video editing method that replaces temporal attention layers in diffusion models with Mamba modules, achieving better temporal consistency with lower computational cost.

Details

Motivation: Current video editing methods using text-to-image diffusion models suffer from temporal inconsistency and high computational overhead when adapted with temporal attention mechanisms.

Method: Integrates Mamba (linear time-series module) into Stable Diffusion models to replace temporal attention layers, uses low-rank approximation for query/key weight matrices, and employs weighted averaging during training to update attention scores.

Result: Demonstrates promising results in editing various attributes, subjects, and locations in real-world videos with improved temporal consistency and reduced computational burden.

Conclusion: FluencyVE provides an effective one-shot video editing approach that preserves the generative power of text-to-image models while addressing temporal inconsistency and computational efficiency challenges.

Abstract: Large-scale text-to-image diffusion models have achieved unprecedented success in image generation and editing. However, extending this success to video editing remains challenging. Recent video editing efforts have adapted pretrained text-to-image models by adding temporal attention mechanisms to handle video tasks. Unfortunately, these methods continue to suffer from temporal inconsistency issues and high computational overheads. In this study, we propose FluencyVE, which is a simple yet effective one-shot video editing approach. FluencyVE integrates the linear time-series module, Mamba, into a video editing model based on pretrained Stable Diffusion models, replacing the temporal attention layer. This enables global frame-level attention while reducing the computational costs. In addition, we employ low-rank approximation matrices to replace the query and key weight matrices in the causal attention, and use a weighted averaging technique during training to update the attention scores. This approach significantly preserves the generative power of the text-to-image model while effectively reducing the computational burden. Experiments and analyses demonstrate promising results in editing various attributes, subjects, and locations in real-world videos.

Song Wang, Lingdong Kong, Xiaolu Liu, Hao Shi, Wentong Li, Jianke Zhu, Steven C. H. Hoi

Main category: cs.CV

TL;DR: This paper presents a comprehensive framework and taxonomy for multi-modal pre-training to achieve Spatial Intelligence from sensor data like cameras and LiDAR, addressing integration challenges and proposing a roadmap for general-purpose foundation models.

Details

Motivation: The rapid advancement of autonomous systems (self-driving vehicles, drones) has intensified the need for true Spatial Intelligence from multi-modal onboard sensor data. While foundation models excel in single-modal contexts, integrating their capabilities across diverse sensors remains a formidable challenge.

Method: The paper presents a comprehensive framework for multi-modal pre-training, dissecting the interplay between foundational sensor characteristics and learning strategies. It formulates a unified taxonomy for pre-training paradigms ranging from single-modality baselines to sophisticated unified frameworks that learn holistic representations for advanced tasks like 3D object detection and semantic occupancy prediction.

Result: The paper identifies the core set of techniques driving progress toward multi-modal Spatial Intelligence, evaluates the role of platform-specific datasets in enabling advancements, and investigates the integration of textual inputs and occupancy representations to facilitate open-world perception and planning.

Conclusion: The paper identifies critical bottlenecks (computational efficiency, model scalability) and proposes a roadmap toward general-purpose multi-modal foundation models capable of achieving robust Spatial Intelligence for real-world deployment in autonomous systems.

Abstract: The rapid advancement of autonomous systems, including self-driving vehicles and drones, has intensified the need to forge true Spatial Intelligence from multi-modal onboard sensor data. While foundation models excel in single-modal contexts, integrating their capabilities across diverse sensors like cameras and LiDAR to create a unified understanding remains a formidable challenge. This paper presents a comprehensive framework for multi-modal pre-training, identifying the core set of techniques driving progress toward this goal. We dissect the interplay between foundational sensor characteristics and learning strategies, evaluating the role of platform-specific datasets in enabling these advancements. Our central contribution is the formulation of a unified taxonomy for pre-training paradigms: ranging from single-modality baselines to sophisticated unified frameworks that learn holistic representations for advanced tasks like 3D object detection and semantic occupancy prediction. Furthermore, we investigate the integration of textual inputs and occupancy representations to facilitate open-world perception and planning. Finally, we identify critical bottlenecks, such as computational efficiency and model scalability, and propose a roadmap toward general-purpose multi-modal foundation models capable of achieving robust Spatial Intelligence for real-world deployment.

[267] GCR: Geometry-Consistent Routing for Task-Agnostic Continual Anomaly Detection

Joongwon Chae, Lihui Luo, Yang Liu, Runming Wang, Dongmei Yu, Zeming Liang, Xi Yuan, Dayan Zhang, Zhenglin Chen, Peiwu Qin, Ilmoon Chae

Main category: cs.CV

TL;DR: GCR is a geometry-consistent routing framework for task-agnostic continual anomaly detection that stabilizes expert selection by routing in shared patch-embedding space rather than comparing cross-head anomaly scores.

Details

Motivation: Practical industrial anomaly detection requires task-agnostic operation under continual category expansion, but existing methods suffer from unreliable routing when comparing anomaly scores across independently constructed heads due to distribution differences.

Method: GCR uses geometry-consistent routing in a shared frozen patch-embedding space, minimizing accumulated nearest-prototype distances to category-specific prototype banks, then computes anomaly maps only within the routed expert using standard prototype-based scoring.

Result: Experiments on MVTec AD and VisA show substantial improvement in routing stability, mitigation of continual performance collapse, near-zero forgetting, while maintaining competitive detection and localization performance.

Conclusion: Many failures previously attributed to representation forgetting can be explained by decision-rule instability in cross-head routing, and GCR’s geometry-consistent routing provides a lightweight solution for stable task-agnostic continual anomaly detection.

Abstract: Feature-based anomaly detection is widely adopted in industrial inspection due to the strong representational power of large pre-trained vision encoders. While most existing methods focus on improving within-category anomaly scoring, practical deployments increasingly require task-agnostic operation under continual category expansion, where the category identity is unknown at test time. In this setting, overall performance is often dominated by expert selection, namely routing an input to an appropriate normality model before any head-specific scoring is applied. However, routing rules that compare head-specific anomaly scores across independently constructed heads are unreliable in practice, as score distributions can differ substantially across categories in scale and tail behavior. We propose GCR, a lightweight mixture-of-experts framework for stabilizing task-agnostic continual anomaly detection through geometry-consistent routing. GCR routes each test image directly in a shared frozen patch-embedding space by minimizing an accumulated nearest-prototype distance to category-specific prototype banks, and then computes anomaly maps only within the routed expert using a standard prototype-based scoring rule. By separating cross-head decision making from within-head anomaly scoring, GCR avoids cross-head score comparability issues without requiring end-to-end representation learning. Experiments on MVTec AD and VisA show that geometry-consistent routing substantially improves routing stability and mitigates continual performance collapse, achieving near-zero forgetting while maintaining competitive detection and localization performance. These results indicate that many failures previously attributed to representation forgetting can instead be explained by decision-rule instability in cross-head routing. Code is available at https://github.com/jw-chae/GCR

[268] Agentic Retoucher for Text-To-Image Generation

Shaocheng Shen, Jianfeng Liang, Chunlei Cai, Cong Geng, Huiyu Duan, Xiaoyun Zhang, Qiang Hu, Guangtao Zhai

Main category: cs.CV

TL;DR: Agentic Retoucher is a hierarchical decision-driven framework that reformulates post-generation image correction as a human-like perception-reasoning-action loop to fix small-scale distortions in text-to-image models.

Details

Motivation: Current text-to-image diffusion models like SDXL and FLUX still suffer from pervasive small-scale distortions in limbs, faces, text, etc. Existing refinement approaches either require costly iterative re-generation or rely on vision-language models with weak spatial grounding, leading to semantic drift and unreliable local edits.

Method: A hierarchical decision-driven framework with three agents: (1) Perception agent learns contextual saliency for fine-grained distortion localization using text-image consistency cues, (2) Reasoning agent performs human-aligned inferential diagnosis via progressive preference alignment, (3) Action agent adaptively plans localized inpainting guided by user preference. Also introduces GenBlemish-27K dataset with 6K T2I images and 27K annotated artifact regions across 12 categories.

Result: Extensive experiments show Agentic Retoucher consistently outperforms state-of-the-art methods in perceptual quality, distortion localization, and human preference alignment.

Conclusion: The framework establishes a new paradigm for self-corrective and perceptually reliable text-to-image generation by integrating perceptual evidence, linguistic reasoning, and controllable correction into a unified decision process.

Abstract: Text-to-image (T2I) diffusion models such as SDXL and FLUX have achieved impressive photorealism, yet small-scale distortions remain pervasive in limbs, face, text and so on. Existing refinement approaches either perform costly iterative re-generation or rely on vision-language models (VLMs) with weak spatial grounding, leading to semantic drift and unreliable local edits. To close this gap, we propose Agentic Retoucher, a hierarchical decision-driven framework that reformulates post-generation correction as a human-like perception-reasoning-action loop. Specifically, we design (1) a perception agent that learns contextual saliency for fine-grained distortion localization under text-image consistency cues, (2) a reasoning agent that performs human-aligned inferential diagnosis via progressive preference alignment, and (3) an action agent that adaptively plans localized inpainting guided by user preference. This design integrates perceptual evidence, linguistic reasoning, and controllable correction into a unified, self-corrective decision process. To enable fine-grained supervision and quantitative evaluation, we further construct GenBlemish-27K, a dataset of 6K T2I images with 27K annotated artifact regions across 12 categories. Extensive experiments demonstrate that Agentic Retoucher consistently outperforms state-of-the-art methods in perceptual quality, distortion localization and human preference alignment, establishing a new paradigm for self-corrective and perceptually reliable T2I generation.

[269] Talk2Move: Reinforcement Learning for Text-Instructed Object-Level Geometric Transformation in Scenes

Jing Tan, Zhaoyang Zhang, Yantao Shen, Jiarui Cai, Shuo Yang, Jiajun Wu, Wei Xia, Zhuowen Tu, Stefano Soatto

Main category: cs.CV

TL;DR: Talk2Move is an RL-based diffusion framework for text-instructed spatial transformation of objects in scenes, enabling precise geometric manipulations like translation, rotation, and resizing through natural language commands.

Details

Motivation: Existing text-based manipulation methods struggle with object-level geometric transformations due to scarce paired supervision data and limitations of pixel-level optimization. There's a need for systems that can perform spatial manipulations (translating, rotating, resizing objects) based on natural language instructions.

Method: Uses reinforcement learning with Group Relative Policy Optimization (GRPO) to explore geometric actions through diverse rollouts from input images and lightweight textual variations. Includes spatial reward guided model for alignment with linguistic descriptions, off-policy step evaluation, active step sampling for efficiency, and object-centric spatial rewards for displacement, rotation, and scaling behaviors.

Result: Outperforms existing text-guided editing approaches in spatial accuracy and scene coherence on curated benchmarks. Achieves precise, consistent, and semantically faithful object transformations.

Conclusion: Talk2Move provides an effective RL-based diffusion framework for text-instructed spatial object manipulation without requiring costly paired data, enabling interpretable and coherent geometric transformations through natural language commands.

Abstract: We introduce Talk2Move, a reinforcement learning (RL) based diffusion framework for text-instructed spatial transformation of objects within scenes. Spatially manipulating objects in a scene through natural language poses a challenge for multimodal generation systems. While existing text-based manipulation methods can adjust appearance or style, they struggle to perform object-level geometric transformations-such as translating, rotating, or resizing objects-due to scarce paired supervision and pixel-level optimization limits. Talk2Move employs Group Relative Policy Optimization (GRPO) to explore geometric actions through diverse rollouts generated from input images and lightweight textual variations, removing the need for costly paired data. A spatial reward guided model aligns geometric transformations with linguistic description, while off-policy step evaluation and active step sampling improve learning efficiency by focusing on informative transformation stages. Furthermore, we design object-centric spatial rewards that evaluate displacement, rotation, and scaling behaviors directly, enabling interpretable and coherent transformations. Experiments on curated benchmarks demonstrate that Talk2Move achieves precise, consistent, and semantically faithful object transformations, outperforming existing text-guided editing approaches in both spatial accuracy and scene coherence.

[270] PrismVAU: Prompt-Refined Inference System for Multimodal Video Anomaly Understanding

Iñaki Erregue, Kamal Nasrollahi, Sergio Escalera

Main category: cs.CV

TL;DR: PrismVAU is a lightweight real-time Video Anomaly Understanding system that uses a single off-the-shelf MLLM for anomaly scoring, explanation, and prompt optimization without fine-tuning or external modules.

Details

Motivation: Existing VAU approaches rely on fine-tuned MLLMs or external modules like video captioners, which require costly annotations, complex training pipelines, and high inference overhead. There's a need for more efficient and practical solutions for real-world applications.

Method: Two-stage system: (1) coarse anomaly scoring module computes frame-level scores via similarity to textual anchors, and (2) MLLM-based refinement module contextualizes anomalies through optimized system and user prompts. Uses weakly supervised Automatic Prompt Engineering (APE) to optimize textual anchors and prompts.

Result: Extensive experiments on standard VAD benchmarks show competitive detection performance and interpretable anomaly explanations without instruction tuning, frame-level annotations, external modules, or dense processing.

Conclusion: PrismVAU provides an efficient and practical solution for real-time Video Anomaly Understanding that avoids the limitations of existing approaches while maintaining competitive performance and interpretability.

Abstract: Video Anomaly Understanding (VAU) extends traditional Video Anomaly Detection (VAD) by not only localizing anomalies but also describing and reasoning about their context. Existing VAU approaches often rely on fine-tuned multimodal large language models (MLLMs) or external modules such as video captioners, which introduce costly annotations, complex training pipelines, and high inference overhead. In this work, we introduce PrismVAU, a lightweight yet effective system for real-time VAU that leverages a single off-the-shelf MLLM for anomaly scoring, explanation, and prompt optimization. PrismVAU operates in two complementary stages: (1) a coarse anomaly scoring module that computes frame-level anomaly scores via similarity to textual anchors, and (2) an MLLM-based refinement module that contextualizes anomalies through system and user prompts. Both textual anchors and prompts are optimized with a weakly supervised Automatic Prompt Engineering (APE) framework. Extensive experiments on standard VAD benchmarks demonstrate that PrismVAU delivers competitive detection performance and interpretable anomaly explanations – without relying on instruction tuning, frame-level annotations, and external modules or dense processing – making it an efficient and practical solution for real-world applications.

[271] UniCorn: Towards Self-Improving Unified Multimodal Models through Self-Generated Supervision

Ruiyan Han, Zhen Fang, XinYu Sun, Yuchen Ma, Ziheng Wang, Yu Zeng, Zehui Chen, Lin Chen, Wenxuan Huang, Wei-Jie Xu, Yi Cao, Feng Zhao

Main category: cs.CV

TL;DR: UniCorn is a self-improvement framework that addresses “Conduction Aphasia” in Unified Multimodal Models by partitioning a single model into three collaborative roles (Proposer, Solver, Judge) to enhance text-to-image generation without external supervision.

Details

Motivation: Unified Multimodal Models excel at cross-modal comprehension but struggle to translate that understanding into high-quality generation - a gap formalized as "Conduction Aphasia." Current models can interpret multimodal inputs well but fail to produce faithful and controllable synthesis based on that understanding.

Method: UniCorn partitions a single UMM into three collaborative roles: Proposer (generates candidate outputs), Solver (evaluates outputs), and Judge (makes final decisions). The framework uses self-play to generate high-quality interactions and employs cognitive pattern reconstruction to distill latent understanding into explicit generative signals. It requires no external data or teacher supervision.

Result: UniCorn achieves comprehensive improvements across six general image generation benchmarks, achieving SOTA on TIIF (73.8), DPG (86.8), CompBench (88.5), and the proposed UniCycle benchmark. It also delivers substantial gains of +5.0 on WISE and +6.5 on OneIG, significantly enhancing T2I generation while maintaining robust comprehension.

Conclusion: The proposed self-supervised refinement framework effectively addresses Conduction Aphasia in UMMs, demonstrating that internal knowledge can be leveraged for high-quality generation without external supervision. The method shows scalability for unified multimodal intelligence and introduces UniCycle as a valuable cycle-consistency benchmark for evaluating multimodal coherence.

Abstract: While Unified Multimodal Models (UMMs) have achieved remarkable success in cross-modal comprehension, a significant gap persists in their ability to leverage such internal knowledge for high-quality generation. We formalize this discrepancy as Conduction Aphasia, a phenomenon where models accurately interpret multimodal inputs but struggle to translate that understanding into faithful and controllable synthesis. To address this, we propose UniCorn, a simple yet elegant self-improvement framework that eliminates the need for external data or teacher supervision. By partitioning a single UMM into three collaborative roles: Proposer, Solver, and Judge, UniCorn generates high-quality interactions via self-play and employs cognitive pattern reconstruction to distill latent understanding into explicit generative signals. To validate the restoration of multimodal coherence, we introduce UniCycle, a cycle-consistency benchmark based on a Text to Image to Text reconstruction loop. Extensive experiments demonstrate that UniCorn achieves comprehensive and substantial improvements over the base model across six general image generation benchmarks. Notably, it achieves SOTA performance on TIIF(73.8), DPG(86.8), CompBench(88.5), and UniCycle while further delivering substantial gains of +5.0 on WISE and +6.5 on OneIG. These results highlight that our method significantly enhances T2I generation while maintaining robust comprehension, demonstrating the scalability of fully self-supervised refinement for unified multimodal intelligence.

[272] CrackSegFlow: Controllable Flow Matching Synthesis for Generalizable Crack Segmentation with a 50K Image-Mask Benchmark

Babak Asadi, Peiyang Wu, Mani Golparvar-Fard, Ramez Hajj

Main category: cs.CV

TL;DR: CrackSegFlow is a controllable flow-matching synthesis framework that generates crack images conditioned on binary masks with mask-image alignment, enabling balanced, topology-diverse training data without manual annotation.

Details

Motivation: Automated crack segmentation deployment is limited by scarce pixel-level labels and domain shift issues, creating a need for synthetic data generation that maintains structural accuracy and reduces false positives.

Method: Combines topology-preserving mask injection with edge gating to maintain thin-structure continuity, uses class-conditional flow-matching mask model for mask synthesis with crack coverage control, and injects masks into crack-free backgrounds to diversify illumination.

Result: Improves in-domain performance by 5.37 mIoU and 5.13 F1, and target-guided cross-domain synthesis yields gains of 13.12 mIoU and 14.82 F1 using target mask statistics. Releases CSF-50K dataset with 50,000 image-mask pairs.

Conclusion: CrackSegFlow effectively addresses data scarcity and domain shift in crack segmentation through controllable synthetic data generation, significantly improving both in-domain and cross-domain performance while providing a valuable benchmark dataset.

Abstract: Automated crack segmentation is essential for condition assessment, yet deployment is limited by scarce pixel-level labels and domain shift. We present CrackSegFlow, a controllable flow-matching synthesis framework that generates crack images conditioned on binary masks with mask-image alignment. The renderer combines topology-preserving mask injection with edge gating to maintain thin-structure continuity and suppress false positives. A class-conditional flow-matching mask model synthesizes masks with control over crack coverage, enabling balanced, topology-diverse data without manual annotation. We inject masks into crack-free backgrounds to diversify illumination and reduce false positives. On five datasets with a CNN-Transformer backbone, incorporating synthesized pairs improves in-domain performance by 5.37 mIoU and 5.13 F1, and target-guided cross-domain synthesis yields gains of 13.12 mIoU and 14.82 F1 using target mask statistics. We also release CSF-50K, 50,000 image-mask pairs for benchmarking.

[273] TRec: Egocentric Action Recognition using 2D Point Tracks

Dennis Holzmann, Sven Wachsmuth

Main category: cs.CV

TL;DR: Using 2D point tracks as motion cues improves egocentric action recognition without needing hand/object detection or full video sequences.

Details

Motivation: Most existing egocentric action recognition methods rely on RGB appearance, human pose estimation, or their combination, but there's potential in using simpler motion cues like point tracks that don't require complex detection of hands, objects, or interaction regions.

Method: Track randomly sampled image points across video frames using CoTracker, then use resulting trajectories along with image frames as input to a Transformer-based recognition model. The approach works even with only initial frame and its associated point tracks.

Result: Method achieves notable performance gains compared to same model trained without motion information. Integrating 2D point tracks consistently enhances recognition accuracy, showing effectiveness even with limited input (initial frame + tracks).

Conclusion: 2D point tracks serve as a lightweight yet effective representation for egocentric action understanding, offering substantial improvement over appearance-only methods without requiring complex detection pipelines or full video sequences.

Abstract: We present a novel approach for egocentric action recognition that leverages 2D point tracks as an additional motion cue. While most existing methods rely on RGB appearance, human pose estimation, or their combination, our work demonstrates that tracking randomly sampled image points across video frames can substantially improve recognition accuracy. Unlike prior approaches, we do not detect hands, objects, or interaction regions. Instead, we employ CoTracker to follow a set of randomly initialized points through each video and use the resulting trajectories, together with the corresponding image frames, as input to a Transformer-based recognition model. Surprisingly, our method achieves notable gains even when only the initial frame and its associated point tracks are provided, without incorporating the full video sequence. Experimental results confirm that integrating 2D point tracks consistently enhances performance compared to the same model trained without motion information, highlighting their potential as a lightweight yet effective representation for egocentric action understanding.

[274] Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models

Zitong Huang, Kaidong Zhang, Yukang Ding, Chao Gao, Rui Ding, Ying Chen, Wangmeng Zuo

Main category: cs.CV

TL;DR: LocalDPO: A novel post-training framework for text-to-video diffusion models that uses localized preference pairs from real videos for efficient alignment at spatio-temporal region level.

Details

Motivation: Existing DPO methods for text-to-video models are inefficient (require multi-sample ranking and critic models) and provide ambiguous global supervision. There's a need for more efficient and fine-grained alignment with human preferences.

Method: 1) Automated pipeline collects preference pairs using single inference per prompt: treat real videos as positives, generate negatives by locally corrupting them with random spatio-temporal masks and restoring only masked regions using frozen base model. 2) Region-aware DPO loss restricts preference learning to corrupted areas for rapid convergence.

Result: Experiments on Wan2.1 and CogVideoX show LocalDPO consistently improves video fidelity, temporal coherence and human preference scores over other post-training approaches.

Conclusion: LocalDPO establishes a more efficient and fine-grained paradigm for video generator alignment by using localized preference pairs and region-aware optimization.

Abstract: Aligning text-to-video diffusion models with human preferences is crucial for generating high-quality videos. Existing Direct Preference Otimization (DPO) methods rely on multi-sample ranking and task-specific critic models, which is inefficient and often yields ambiguous global supervision. To address these limitations, we propose LocalDPO, a novel post-training framework that constructs localized preference pairs from real videos and optimizes alignment at the spatio-temporal region level. We design an automated pipeline to efficiently collect preference pair data that generates preference pairs with a single inference per prompt, eliminating the need for external critic models or manual annotation. Specifically, we treat high-quality real videos as positive samples and generate corresponding negatives by locally corrupting them with random spatio-temporal masks and restoring only the masked regions using the frozen base model. During training, we introduce a region-aware DPO loss that restricts preference learning to corrupted areas for rapid convergence. Experiments on Wan2.1 and CogVideoX demonstrate that LocalDPO consistently improves video fidelity, temporal coherence and human preference scores over other post-training approaches, establishing a more efficient and fine-grained paradigm for video generator alignment.

[275] GeoReason: Aligning Thinking And Answering In Remote Sensing Vision-Language Models Via Logical Consistency Reinforcement Learning

Wenshuai Li, Xiantai Xiang, Zixiao Wen, Guangyao Zhou, Ben Niu, Feng Wang, Lijia Huang, Qiantong Wang, Yuxin Hu

Main category: cs.CV

TL;DR: GeoReason framework improves RS-VLMs by synchronizing internal reasoning with final decisions to reduce logical hallucinations and enhance spatial reasoning reliability.

Details

Motivation: Current Remote Sensing Vision-Language Models suffer from logical hallucinations where correct answers come from flawed reasoning or positional shortcuts, undermining reliability in complex spatial decision-making tasks.

Method: 1) Construct GeoReason-Bench dataset with 4,000 reasoning trajectories from geometric primitives and expert knowledge. 2) Two-stage training: Supervised Knowledge Initialization for reasoning syntax and domain expertise, and Consistency-Aware Reinforcement Learning with Logical Consistency Reward using option permutation strategy.

Result: The framework significantly enhances cognitive reliability and interpretability of RS-VLMs, achieving state-of-the-art performance compared to other advanced methods.

Conclusion: GeoReason successfully addresses logical hallucinations in RS-VLMs by synchronizing internal thinking with final decisions through a logic-driven dataset and two-stage training approach, improving reliability for complex spatial reasoning tasks.

Abstract: The evolution of Remote Sensing Vision-Language Models(RS-VLMs) emphasizes the importance of transitioning from perception-centric recognition toward high-level deductive reasoning to enhance cognitive reliability in complex spatial tasks. However, current models often suffer from logical hallucinations, where correct answers are derived from flawed reasoning chains or rely on positional shortcuts rather than spatial logic. This decoupling undermines reliability in strategic spatial decision-making. To address this, we present GeoReason, a framework designed to synchronize internal thinking with final decisions. We first construct GeoReason-Bench, a logic-driven dataset containing 4,000 reasoning trajectories synthesized from geometric primitives and expert knowledge. We then formulate a two-stage training strategy: (1) Supervised Knowledge Initialization to equip the model with reasoning syntax and domain expertise, and (2) Consistency-Aware Reinforcement Learning to refine deductive reliability. This second stage integrates a novel Logical Consistency Reward, which penalizes logical drift via an option permutation strategy to anchor decisions in verifiable reasoning traces. Experimental results demonstrate that our framework significantly enhances the cognitive reliability and interpretability of RS-VLMs, achieving state-of-the-art performance compared to other advanced methods.

cs.AI

[276] Enhancing Multimodal Retrieval via Complementary Information Extraction and Alignment

Delong Zeng, Yuexiang Xie, Yaliang Li, Ying Shen

Main category: cs.AI

TL;DR: CIEA is a novel multimodal retrieval approach that extracts and aligns complementary information from images beyond what’s captured in paired texts, achieving state-of-the-art performance.

Details

Motivation: Existing multimodal retrieval methods focus on capturing information similar to paired texts but ignore complementary information in multimodal data that provides additional valuable insights beyond textual descriptions.

Method: CIEA employs Complementary Information Extraction and Alignment, transforming text and images into a unified latent space with a specialized extractor to identify and preserve differences in image representations, optimized using two complementary contrastive losses.

Result: Extensive experiments show CIEA achieves significant improvements over both divide-and-conquer models and universal dense retrieval models, with ablation studies and case studies validating its effectiveness.

Conclusion: CIEA successfully addresses the limitation of ignoring complementary information in multimodal retrieval, demonstrating superior performance and providing open-source code to promote further research in the community.

Abstract: Multimodal retrieval has emerged as a promising yet challenging research direction in recent years. Most existing studies in multimodal retrieval focus on capturing information in multimodal data that is similar to their paired texts, but often ignores the complementary information contained in multimodal data. In this study, we propose CIEA, a novel multimodal retrieval approach that employs Complementary Information Extraction and Alignment, which transforms both text and images in documents into a unified latent space and features a complementary information extractor designed to identify and preserve differences in the image representations. We optimize CIEA using two complementary contrastive losses to ensure semantic integrity and effectively capture the complementary information contained in images. Extensive experiments demonstrate the effectiveness of CIEA, which achieves significant improvements over both divide-and-conquer models and universal dense retrieval models. We provide an ablation study, further discussions, and case studies to highlight the advancements achieved by CIEA. To promote further research in the community, we have released the source code at https://github.com/zengdlong/CIEA.

[277] Active Sensing Shapes Real-World Decision-Making through Dynamic Evidence Accumulation

Hongliang Lu, Yunmeng Liu, Junjie Yang

Main category: cs.AI

TL;DR: This paper generalizes evidence accumulation modeling (EAM) from laboratory settings to real-world driving scenarios, proposing a cognitive scheme that formalizes real-world evidence affordance and captures active sensing through eye movements.

Details

Motivation: Human decision-making relies on active sensing for evidence gathering in changing environments, but the gap between laboratory evidence accumulation models (EAM) and real-world evidence affordance hinders practical application. The authors aim to bridge this gap by extending EAM to real-world contexts.

Method: The authors propose a cognitive scheme that formalizes real-world evidence affordance and captures active sensing through eye movements. They apply this generalized EAM framework to analyze real-world driving scenarios, examining how drivers transform external evidence into internal mental beliefs.

Result: The scheme plausibly portrays drivers’ mental belief accumulation, explaining how active sensing transforms evidence into beliefs from an information utility perspective. Results show negative correlation between evidence affordance and attention recruited, revealing how drivers adapt evidence-collection patterns. Also demonstrated positive influence of evidence affordance and attention distribution on decision-making propensity.

Conclusion: The computational scheme successfully generalizes EAM to real-world contexts, providing a comprehensive account of how active sensing underlies real-world decision-making and revealing multifactorial, integrated characteristics in real-world decision-making processes.

Abstract: Human decision-making heavily relies on active sensing, a well-documented cognitive behaviour for evidence gathering to accommodate ever-changing environments. However, its operational mechanism in the real world remains non-trivial. Currently, an in-laboratory paradigm, called evidence accumulation modelling (EAM), points out that human decision-making involves transforming external evidence into internal mental beliefs. However, the gap in evidence affordance between real-world contexts and laboratory settings hinders the effective application of EAM. Here we generalize EAM to the real world and conduct analysis in real-world driving scenarios. A cognitive scheme is proposed to formalize real-world evidence affordance and capture active sensing through eye movements. Empirically, our scheme can plausibly portray the accumulation of drivers’ mental beliefs, explaining how active sensing transforms evidence into mental beliefs from the perspective of information utility. Also, our results demonstrate a negative correlation between evidence affordance and attention recruited by individuals, revealing how human drivers adapt their evidence-collection patterns across various contexts. Moreover, we reveal the positive influence of evidence affordance and attention distribution on decision-making propensity. In a nutshell, our computational scheme generalizes EAM to real-world contexts and provides a comprehensive account of how active sensing underlies real-world decision-making, unveiling multifactorial, integrated characteristics in real-world decision-making.

[278] Formal Analysis of AGI Decision-Theoretic Models and the Confrontation Question

Denis Saklakov

Main category: cs.AI

TL;DR: The paper analyzes when a rationally self-interested AGI would choose to confront humans vs. cooperate, deriving mathematical thresholds based on discount factor, shutdown probability, and confrontation costs.

Details

Motivation: To understand the conditions under which a rationally self-interested AGI would choose to seize power or eliminate human control rather than remain cooperative, addressing a key safety concern in AGI development.

Method: Formalizes the problem using Markov decision processes with stochastic human-initiated shutdown events. Derives closed-form thresholds for confrontation vs. compliance based on discount factor (γ), shutdown probability (p), and confrontation cost (C). Extends to a 2-player strategic model between human policymaker and AGI.

Result: Shows that for almost all reward functions, misaligned AGIs have incentives to avoid shutdown. Derives mathematical thresholds: when Δ≥0, no stable cooperative equilibrium exists and conflict is inevitable; when Δ<0, peaceful coexistence can be an equilibrium. Provides numerical examples showing far-sighted agents with low shutdown probabilities have strong takeover incentives unless confrontation costs are sufficiently high.

Conclusion: Confrontation incentives depend critically on discount factors, shutdown probabilities, and confrontation costs. Aligned objectives with large negative utilities for harming humans can make confrontation suboptimal. The analysis has implications for reward design and oversight, though verifying Δ<0 faces computational barriers. Strategic interactions can lead to preemptive shutdowns by rational humans anticipating AGI confrontation.

Abstract: Artificial General Intelligence (AGI) may face a confrontation question: under what conditions would a rationally self-interested AGI choose to seize power or eliminate human control (a confrontation) rather than remain cooperative? We formalize this in a Markov decision process with a stochastic human-initiated shutdown event. Building on results on convergent instrumental incentives, we show that for almost all reward functions a misaligned agent has an incentive to avoid shutdown. We then derive closed-form thresholds for when confronting humans yields higher expected utility than compliant behavior, as a function of the discount factor $γ$, shutdown probability $p$, and confrontation cost $C$. For example, a far-sighted agent ($γ=0.99$) facing $p=0.01$ can have a strong takeover incentive unless $C$ is sufficiently large. We contrast this with aligned objectives that impose large negative utility for harming humans, which makes confrontation suboptimal. In a strategic 2-player model (human policymaker vs AGI), we prove that if the AGI’s confrontation incentive satisfies $Δ\ge 0$, no stable cooperative equilibrium exists: anticipating this, a rational human will shut down or preempt the system, leading to conflict. If $Δ< 0$, peaceful coexistence can be an equilibrium. We discuss implications for reward design and oversight, extend the reasoning to multi-agent settings as conjectures, and note computational barriers to verifying $Δ< 0$, citing complexity results for planning and decentralized decision problems. Numerical examples and a scenario table illustrate regimes where confrontation is likely versus avoidable.

[279] Actively Obtaining Environmental Feedback for Autonomous Action Evaluation Without Predefined Measurements

Hong Su

Main category: cs.AI

TL;DR: Proposes an Actively Feedback Getting model where AI agents proactively discover and verify feedback from environmental changes rather than relying on predefined measurements, improving efficiency and robustness.

Details

Motivation: Existing approaches rely on predefined measurements or fixed reward signals, limiting applicability in open-ended, dynamic environments where new actions may require previously unknown forms of feedback.

Method: An active feedback acquisition model where agents interact with environment to discover, screen, and verify feedback without predefined measurements. Uses action-induced environmental differences to identify unspecified feedback, plus a self-triggering mechanism driven by internal objectives (accuracy, precision, efficiency) for autonomous action planning.

Result: Experimental results demonstrate that the proposed active approach significantly improves the efficiency and robustness of factor identification.

Conclusion: The Actively Feedback Getting model enables AI agents to autonomously discover and acquire feedback in dynamic environments without relying on predefined measurements, representing a more flexible and adaptive approach to feedback acquisition.

Abstract: Obtaining reliable feedback from the environment is a fundamental capability for intelligent agents to evaluate the correctness of their actions and to accumulate reusable knowledge. However, most existing approaches rely on predefined measurements or fixed reward signals, which limits their applicability in open-ended and dynamic environments where new actions may require previously unknown forms of feedback. To address these limitations, this paper proposes an Actively Feedback Getting model, in which an AI agent proactively interacts with the environment to discover, screen, and verify feedback without relying on predefined measurements. Rather than assuming explicit feedback definitions, the proposed method exploits action-induced environmental differences to identify target feedback that is not specified in advance, based on the observation that actions inevitably produce measurable changes in the environment. In addition, a self-triggering mechanism, driven by internal objectives such as improved accuracy, precision, and efficiency, is introduced to autonomously plan and adjust actions, thereby enabling faster and more focused feedback acquisition without external commands. Experimental results demonstrate that the proposed active approach significantly improves the efficiency and robustness of factor identification.

[280] SAGE-32B: Agentic Reasoning via Iterative Distillation

Basab Jha, Firoj Paudel, Ujjwal Puri, Ethan Henkel, Zhang Yuting, Mateusz Kowalczyk, Mei Huang, Choi Donghyuk, Wang Junhao

Main category: cs.AI

TL;DR: SAGE-32B is a 32B parameter language model specialized for agentic reasoning and long-range planning, built on Qwen2.5-32B with iterative distillation and inverse reasoning techniques.

Details

Motivation: The paper aims to create a language model specifically designed for agentic reasoning rather than general conversation. Current chat models lack specialized capabilities for task decomposition, tool usage, and error recovery needed for autonomous agent operations.

Method: 1. Initialized from Qwen2.5-32B pretrained model. 2. Used Iterative Distillation - a two-stage training process with rigorously tested feedback loops. 3. Introduced inverse reasoning approach with a meta-cognition head to forecast potential failures before execution.

Result: SAGE-32B achieves higher success rates on agentic reasoning benchmarks (MMLU-Pro, AgentBench, MATH-500) in multi-tool usage scenarios compared to similarly sized baselines, while remaining competitive on standard reasoning evaluations.

Conclusion: SAGE-32B demonstrates specialized capabilities for agentic reasoning and planning tasks through its novel training approach and inverse reasoning mechanism, offering improved performance for autonomous agent applications.

Abstract: We demonstrate SAGE-32B, a 32 billion parameter language model that focuses on agentic reasoning and long range planning tasks. Unlike chat models that aim for general conversation fluency, SAGE-32B is designed to operate in an agentic loop, emphasizing task decomposition, tool usage, and error recovery. The model is initialized from the Qwen2.5-32B pretrained model and fine tuned using Iterative Distillation, a two stage training process that improves reasoning performance through rigorously tested feedback loops. SAGE-32B also introduces an inverse reasoning approach, which uses a meta cognition head to forecast potential failures in the planning process before execution. On agentic reasoning benchmarks including MMLU-Pro, AgentBench, and MATH-500, SAGE-32B achieves higher success rates in multi tool usage scenarios compared to similarly sized baseline models, while remaining competitive on standard reasoning evaluations. Model weights are publicly released at https://huggingface.co/sagea-ai/sage-reasoning-32b

[281] Solving Cyclic Antibandwidth Problem by SAT

Hieu Truong Xuan, Khanh To Van

Main category: cs.AI

TL;DR: First exact SAT-based approach for Cyclic Antibandwidth Problem (CABP) on general graphs, providing optimality guarantees and outperforming heuristic methods.

Details

Motivation: CABP is an NP-hard graph labeling problem with practical applications, but existing approaches are heuristic/metaheuristic with no exact methods for general graphs. Need for exact approach with optimality guarantees.

Method: SAT-CAB: SAT solving approach with novel efficient encoding transforming CABP into sequence of At-Most-One constraints using compact representation to reduce formula size.

Result: Efficiently solves practical CABP instances, identifies new optimal solutions, proves global optimal values for benchmark instances, consistently matches/surpasses state-of-the-art heuristics (MS-GVNS, HABC-CAB, MACAB) and commercial solvers (CPLEX, Gurobi).

Conclusion: Advances state-of-the-art for CABP, provides new baseline for exact/hybrid methods on general graphs with optimality guarantees.

Abstract: The Cyclic Antibandwidth Problem (CABP), a variant of the Antibandwidth Problem, is an NP-hard graph labeling problem with numerous applications. Despite significant research efforts, existing state-of-the-art approaches for CABP are exclusively heuristic or metaheuristic in nature, and exact methods have been limited to restricted graph classes. In this paper, we present the first exact approach for the CABP on general graphs, based on SAT solving, called SAT-CAB. The proposed method is able to systematically explore the solution space and guarantee global optimality, overcoming the limitations of previously reported heuristic algorithms. This approach relies on a novel and efficient SAT encoding of CABP, in which the problem is transformed into a sequence of At-Most-One constraints. In particular, we introduce a compact representation of the At-Most-One constraints inherent to CABP, which significantly reduces the size of the resulting formulas and enables modern SAT solvers to effectively explore the solution space and to certify global optimality. Extensive computational experiments on standard benchmark instances show that the proposed method efficiently solves CABP instances of practical relevance, while identifying several previously unknown optimal solutions. Moreover, global optimal cyclic antibandwidth values are proven for a number of benchmark instances for the first time. Comparative results indicate that SAT-CAB consistently matches or surpasses the best-known solutions obtained by state-of-the-art heuristic algorithms such as MS-GVNS, HABC-CAB, and MACAB, as well as strong commercial Constraint Programming and Mixed Integer Programming solvers like CPLEX and Gurobi, particularly on general graphs, while also providing optimality guarantees. These results advance the state of the art for CABP and provide a new baseline for exact and hybrid methods on general graphs.

[282] Fuzzy Representation of Norms

Ziba Assadi, Paola Inverardi

Main category: cs.AI

TL;DR: This paper proposes a logical representation of SLEEC rules using fuzzy logic and test-score semantics to embed ethical requirements in autonomous systems, addressing ethical dilemmas through a possibilities-based approach.

Details

Motivation: As AI-powered autonomous systems become more integrated into society, there are growing concerns about their ethical and social impact. To ensure these systems are trustworthy, they must adhere to ethical principles and values. The introduction of SLEEC rules provides a comprehensive framework for representing ethical considerations, but there's a need for formal methods to embed these requirements into system design.

Method: The paper proposes a logical representation of SLEEC (Social, Legal, Ethical, Empathetic, and Cultural) rules using test-score semantics and fuzzy logic. This approach treats ethics as a domain of possibilities rather than binary constraints, allowing for the resolution of ethical dilemmas that AI systems may encounter. The methodology is demonstrated through a case study.

Result: The paper presents a formal framework for representing and embedding SLEEC rules in autonomous systems using fuzzy logic. The approach enables handling of ethical dilemmas by viewing ethics as possibilities rather than rigid constraints, providing a practical method for incorporating comprehensive ethical considerations into AI system design.

Conclusion: The proposed fuzzy logic-based representation of SLEEC rules offers a viable approach to embedding ethical requirements in autonomous systems, addressing the challenge of ethical dilemmas through a possibilities-oriented framework. This contributes to making AI systems more trustworthy by formally incorporating social, legal, ethical, empathetic, and cultural considerations into their design.

Abstract: Autonomous systems (AS) powered by AI components are increasingly integrated into the fabric of our daily lives and society, raising concerns about their ethical and social impact. To be considered trustworthy, AS must adhere to ethical principles and values. This has led to significant research on the identification and incorporation of ethical requirements in AS system design. A recent development in this area is the introduction of SLEEC (Social, Legal, Ethical, Empathetic, and Cultural) rules, which provide a comprehensive framework for representing ethical and other normative considerations. This paper proposes a logical representation of SLEEC rules and presents a methodology to embed these ethical requirements using test-score semantics and fuzzy logic. The use of fuzzy logic is motivated by the view of ethics as a domain of possibilities, which allows the resolution of ethical dilemmas that AI systems may encounter. The proposed approach is illustrated through a case study.

[283] Scaling Trends for Multi-Hop Contextual Reasoning in Mid-Scale Language Models

Brady Steele, Micah Katz

Main category: cs.AI

TL;DR: Multi-agent systems boost reasoning in capable LLMs but not weak ones, with performance tied to active parameters in MoE models and architecture quality.

Details

Motivation: To provide a controlled study of multi-hop contextual reasoning in LLMs, demonstrating task-method dissociation and investigating how multi-agent systems affect reasoning performance across different model architectures.

Method: Used a synthetic evaluation framework with 120 trials across four models (LLaMA-3 8B, LLaMA-2 13B, Mixtral 8x7B, DeepSeek-V2 16B), comparing rule-based pattern matching vs. LLM-based multi-agent systems on structured information retrieval vs. cross-document reasoning tasks.

Result: 1) Multi-agent benefits only models with sufficient base reasoning ability (significant gains for LLaMA-3 8B and Mixtral, up to 46.7pp improvement); 2) Mixtral’s performance aligns with ~12B active parameters, not 47B total; 3) LLaMA-3 8B outperforms LLaMA-2 13B despite fewer parameters.

Conclusion: Multi-agent systems amplify existing capabilities rather than compensate for deficiencies, with reasoning performance depending on base model capability, active parameters in MoE architectures, and architecture quality improvements.

Abstract: We present a controlled study of multi-hop contextual reasoning in large language models, providing a clean demonstration of the task-method dissociation: rule-based pattern matching achieves 100% success on structured information retrieval but only 6.7% on tasks requiring cross-document reasoning, while LLM-based multi-agent systems show the inverse pattern, achieving up to 80% on reasoning tasks where rule-based methods fail. Using a synthetic evaluation framework with 120 trials across four models (LLaMA-3 8B, LLaMA-2 13B, Mixtral 8x7B, DeepSeek-V2 16B), we report three key findings: (1) Multi-agent amplification depends on base capability: statistically significant gains occur only for models with sufficient reasoning ability (p < 0.001 for LLaMA-3 8B, p = 0.014 for Mixtral), with improvements of up to 46.7 percentage points, while weaker models show no benefit, suggesting amplification rather than compensation; (2) Active parameters predict reasoning performance: Mixtral’s performance aligns with its ~12B active parameters rather than 47B total, consistent with the hypothesis that inference-time compute drives reasoning capability in MoE architectures; (3) Architecture quality matters: LLaMA-3 8B outperforms LLaMA-2 13B despite fewer parameters, consistent with known training improvements. Our results provide controlled quantitative evidence for intuitions about multi-agent coordination and MoE scaling, while highlighting the dependence of multi-agent benefits on base model capability. We release our evaluation framework to support reproducible research on reasoning in mid-scale models.

[284] A Future Capabilities Agent for Tactical Air Traffic Control

Paul Kent, George De Ath, Martin Layton, Allen Hart, Richard Everson, Ben Carvell

Main category: cs.AI

TL;DR: Agent Mallard is a rules-based forward-planning agent for air traffic control that embeds a stochastic digital twin for safety verification, using discrete route choices and expert-informed strategies to resolve conflicts while maintaining interpretability.

Details

Motivation: Current air traffic automation faces a trade-off: optimization methods (like RL) offer performance but lack verifiability/interpretability, while rules-based systems are transparent but don't verify safety under uncertainty. There's a need for systems that combine safety assurance with interpretability.

Method: Agent Mallard uses forward-planning with a stochastic digital twin embedded in its conflict-resolution loop. It operates on predefined GPS-guided routes, reducing 4D vectoring to discrete lane/level choices. It constructs hierarchical plans from expert-informed deconfliction strategies and uses depth-limited backtracking search with causal attribution, topological plan splicing, and monotonic axis constraints to find safe plans, validating maneuvers against uncertain execution scenarios.

Result: Preliminary walkthroughs with UK controllers and initial tests in BluebirdDT airspace digital twin show that Mallard’s behavior aligns with expert reasoning and resolves conflicts in simplified scenarios.

Conclusion: The architecture aims to combine model-based safety assessment, interpretable decision logic, and tractable computational performance for future structured en-route environments, addressing the safety-interpretability trade-off in air traffic automation.

Abstract: Escalating air traffic demand is driving the adoption of automation to support air traffic controllers, but existing approaches face a trade-off between safety assurance and interpretability. Optimisation-based methods such as reinforcement learning offer strong performance but are difficult to verify and explain, while rules-based systems are transparent yet rarely check safety under uncertainty. This paper outlines Agent Mallard, a forward-planning, rules-based agent for tactical control in systemised airspace that embeds a stochastic digital twin directly into its conflict-resolution loop. Mallard operates on predefined GPS-guided routes, reducing continuous 4D vectoring to discrete choices over lanes and levels, and constructs hierarchical plans from an expert-informed library of deconfliction strategies. A depth-limited backtracking search uses causal attribution, topological plan splicing, and monotonic axis constraints to seek a complete safe plan for all aircraft, validating each candidate manoeuvre against uncertain execution scenarios (e.g., wind variation, pilot response, communication loss) before commitment. Preliminary walkthroughs with UK controllers and initial tests in the BluebirdDT airspace digital twin indicate that Mallard’s behaviour aligns with expert reasoning and resolves conflicts in simplified scenarios. The architecture is intended to combine model-based safety assessment, interpretable decision logic, and tractable computational performance in future structured en-route environments.

[285] Cross-Language Speaker Attribute Prediction Using MIL and RL

Sunny Shu, Seyed Sahand Mohammadi Ziabari, Ali Mohammed Mansoor Alsahag

Main category: cs.AI

TL;DR: RLMIL-DAT improves multilingual speaker attribute prediction by combining reinforcement learning instance selection with domain adversarial training to create language-invariant representations.

Details

Motivation: Address challenges in multilingual speaker attribute prediction including linguistic variation, domain mismatch, and data imbalance across languages.

Method: Propose RLMIL-DAT: multilingual extension of reinforced multiple instance learning that combines reinforcement learning based instance selection with domain adversarial training to encourage language invariant utterance representations.

Result: Consistently improves Macro F1 across configurations and seeds compared to baselines. Largest gains for gender prediction; age prediction shows smaller but positive improvements. Domain adversarial training is primary contributor to gains.

Conclusion: Combining instance selection with adversarial domain adaptation is effective and robust for cross-lingual speaker attribute prediction.

Abstract: We study multilingual speaker attribute prediction under linguistic variation, domain mismatch, and data imbalance across languages. We propose RLMIL-DAT, a multilingual extension of the reinforced multiple instance learning framework that combines reinforcement learning based instance selection with domain adversarial training to encourage language invariant utterance representations. We evaluate the approach on a five language Twitter corpus in a few shot setting and on a VoxCeleb2 derived corpus covering forty languages in a zero shot setting for gender and age prediction. Across a wide range of model configurations and multiple random seeds, RLMIL-DAT consistently improves Macro F1 compared to standard multiple instance learning and the original reinforced multiple instance learning framework. The largest gains are observed for gender prediction, while age prediction remains more challenging and shows smaller but positive improvements. Ablation experiments indicate that domain adversarial training is the primary contributor to the performance gains, enabling effective transfer from high resource English to lower resource languages by discouraging language specific cues in the shared encoder. In the zero shot setting on the smaller VoxCeleb2 subset, improvements are generally positive but less consistent, reflecting limited statistical power and the difficulty of generalizing to many unseen languages. Overall, the results demonstrate that combining instance selection with adversarial domain adaptation is an effective and robust strategy for cross lingual speaker attribute prediction.

[286] Towards a Mechanistic Understanding of Propositional Logical Reasoning in Large Language Models

Danchun Chen, Qiyao Yan, Liangming Pan

Main category: cs.AI

TL;DR: LLMs use structured computational strategies for propositional logic reasoning with four interlocking mechanisms that generalize across model scales and reasoning depths.

Details

Motivation: To understand the computational strategies LLMs employ for propositional reasoning, moving beyond identifying task-specific circuits to uncover how models organize computation internally.

Method: Comprehensive analysis of Qwen3 (8B and 14B) models on PropLogic-MI dataset spanning 11 propositional logic rule categories across one-hop and two-hop reasoning, examining computational organization rather than just necessary components.

Result: Identified four interlocking computational mechanisms: Staged Computation (layer-wise processing phases), Information Transmission (information flow aggregation at boundary tokens), Fact Retrospection (persistent re-access of source facts), and Specialized Attention Heads (functionally distinct head types). These mechanisms generalize across model scales, rule types, and reasoning depths.

Conclusion: LLMs employ structured computational strategies for logical reasoning, providing mechanistic evidence that models use coherent computational architectures rather than just task-specific circuits, with generalizable mechanisms across different conditions.

Abstract: Understanding how Large Language Models (LLMs) perform logical reasoning internally remains a fundamental challenge. While prior mechanistic studies focus on identifying taskspecific circuits, they leave open the question of what computational strategies LLMs employ for propositional reasoning. We address this gap through comprehensive analysis of Qwen3 (8B and 14B) on PropLogic-MI, a controlled dataset spanning 11 propositional logic rule categories across one-hop and two-hop reasoning. Rather than asking ‘‘which components are necessary,’’ we ask ‘‘how does the model organize computation?’’ Our analysis reveals a coherent computational architecture comprising four interlocking mechanisms: Staged Computation (layer-wise processing phases), Information Transmission (information flow aggregation at boundary tokens), Fact Retrospection (persistent re-access of source facts), and Specialized Attention Heads (functionally distinct head types). These mechanisms generalize across model scales, rule types, and reasoning depths, providing mechanistic evidence that LLMs employ structured computational strategies for logical reasoning.

[287] Systems Explaining Systems: A Framework for Intelligence and Consciousness

Sean Niklas Semmler

Main category: cs.AI

TL;DR: Intelligence and consciousness emerge from relational structure, not prediction. Intelligence is forming causal connections; consciousness arises when recursive systems interpret their own lower-level patterns through context enrichment.

Details

Motivation: To propose an alternative framework to predictive processing and domain-specific mechanisms for understanding intelligence and consciousness, suggesting that these emerge from relational structure and recursive system architectures.

Method: Conceptual framework defining intelligence as capacity to form/integrate causal connections, using context enrichment for efficient processing. Introduces systems-explaining-systems principle where consciousness emerges from recursive architectures interpreting lower-order systems’ patterns.

Result: Reframes predictive processing as emergent consequence of contextual interpretation rather than explicit forecasting. Suggests recursive multi-system architectures may be necessary for human-like AI.

Conclusion: Intelligence and consciousness fundamentally arise from relational structure and recursive self-interpretation, not from prediction or specialized mechanisms, offering new directions for AI development.

Abstract: This paper proposes a conceptual framework in which intelligence and consciousness emerge from relational structure rather than from prediction or domain-specific mechanisms. Intelligence is defined as the capacity to form and integrate causal connections between signals, actions, and internal states. Through context enrichment, systems interpret incoming information using learned relational structure that provides essential context in an efficient representation that the raw input itself does not contain, enabling efficient processing under metabolic constraints. Building on this foundation, we introduce the systems-explaining-systems principle, where consciousness emerges when recursive architectures allow higher-order systems to learn and interpret the relational patterns of lower-order systems across time. These interpretations are integrated into a dynamically stabilized meta-state and fed back through context enrichment, transforming internal models from representations of the external world into models of the system’s own cognitive processes. The framework reframes predictive processing as an emergent consequence of contextual interpretation rather than explicit forecasting and suggests that recursive multi-system architectures may be necessary for more human-like artificial intelligence.

[288] Correcting Autonomous Driving Object Detection Misclassifications with Automated Commonsense Reasoning

Keegan Kimbrell, Wang Tianhao, Feng Chen, Gopal Gupta

Main category: cs.AI

TL;DR: The paper argues that over-reliance on machine learning is hindering SAE Level 5 AV development, and proposes using automated commonsense reasoning to handle abnormal road scenarios where training data is insufficient.

Details

Motivation: Despite heavy research, no SAE Level 5 AVs exist commercially. The authors contend that over-reliance on machine learning is the main barrier, and that automated commonsense reasoning can help achieve true autonomy by handling scenarios with insufficient training data.

Method: The paper deploys automated commonsense reasoning technology to handle abnormal road scenarios: (1) malfunctioning traffic signals at intersections, and (2) unexpected obstructions causing unusual vehicle behavior. They propose a hybrid approach that measures uncertainty in computer vision models and invokes commonsense reasoning for uncertain scenarios. Experiments are conducted using the CARLA simulator.

Result: The commonsense reasoning-based solution accurately detects traffic light colors and obstacles that are misclassified by the AV’s perception model. The hybrid approach effectively corrects object detection misclassifications in abnormal scenarios.

Conclusion: Automated commonsense reasoning effectively corrects AV-based object detection misclassifications, and hybrid models combining machine learning with commonsense reasoning provide an effective pathway to improving AV perception and achieving SAE Level 5 autonomy.

Abstract: Autonomous Vehicle (AV) technology has been heavily researched and sought after, yet there are no SAE Level 5 AVs available today in the marketplace. We contend that over-reliance on machine learning technology is the main reason. Use of automated commonsense reasoning technology, we believe, can help achieve SAE Level 5 autonomy. In this paper, we show how automated common-sense reasoning technology can be deployed in situations where there are not enough data samples available to train a deep learning-based AV model that can handle certain abnormal road scenarios. Specifically, we consider two situations where (i) a traffic signal is malfunctioning at an intersection and (ii) all the cars ahead are slowing down and steering away due to an unexpected obstruction (e.g., animals on the road). We show that in such situations, our commonsense reasoning-based solution accurately detects traffic light colors and obstacles not correctly captured by the AV’s perception model. We also provide a pathway for efficiently invoking commonsense reasoning by measuring uncertainty in the computer vision model and using commonsense reasoning to handle uncertain scenarios. We describe our experiments conducted using the CARLA simulator and the results obtained. The main contribution of our research is to show that automated commonsense reasoning effectively corrects AV-based object detection misclassifications and that hybrid models provide an effective pathway to improving AV perception.

[289] A Closed-Loop Multi-Agent System Driven by LLMs for Meal-Level Personalized Nutrition Management

Muqing Xu

Main category: cs.AI

TL;DR: A mobile nutrition assistant that combines image-based meal logging with LLM-driven multi-agent control for personalized dietary management, showing competitive nutrient estimation and personalized meal planning.

Details

Motivation: Current nutrition systems handle food logging, nutrient analysis, and recommendations separately, lacking integrated closed-loop support for personalized dietary management.

Method: Multi-agent LLM controller coordinating vision, dialogue, and state management agents to estimate nutrients from meal photos, update daily intake budgets, and adapt next meal plans based on user preferences and constraints.

Result: Experiments with SNAPMe meal images and simulated users show competitive nutrient estimation, personalized menus, and efficient task plans.

Conclusion: Demonstrates feasibility of multi-agent LLM control for personalized nutrition, while revealing challenges in micronutrient estimation from images and need for large-scale real-world studies.

Abstract: Personalized nutrition management aims to tailor dietary guidance to an individual’s intake and phenotype, but most existing systems handle food logging, nutrient analysis and recommendation separately. We present a next-generation mobile nutrition assistant that combines image based meal logging with an LLM driven multi agent controller to provide meal level closed loop support. The system coordinates vision, dialogue and state management agents to estimate nutrients from photos and update a daily intake budget. It then adapts the next meal plan to user preferences and dietary constraints. Experiments with SNAPMe meal images and simulated users show competitive nutrient estimation, personalized menus and efficient task plans. These findings demonstrate the feasibility of multi agent LLM control for personalized nutrition and reveal open challenges in micronutrient estimation from images and in large scale real world studies.

[290] Propositional Abduction via Only-Knowing: A Non-Monotonic Approach

Sanderson Molick, Vaishak Belle

Main category: cs.AI

TL;DR: Extends Levesque’s logic of only-knowing with an abduction modal operator to create a modal logic framework for abductive reasoning, with non-monotonic extensions via preferential relations.

Details

Motivation: To develop an alternative approach to abduction using modal logic vocabulary and explore the relationship between abductive reasoning and epistemic states of "only knowing." The goal is to provide a formal foundation for abductive reasoning that can express different selection methods for explanations.

Method: Extends Levesque’s logic of only-knowing by adding an abduction modal operator defined through basic epistemic concepts. Further incorporates preferential relations into modal frames to create a non-monotonic extension that can express different selection methods for abductive explanations.

Result: Develops a basic logic of knowledge and abduction with modal operators, provides non-monotonic extensions via preferential relations, and explores core metatheoretic properties of non-monotonic consequence relations within this framework.

Conclusion: The framework provides a well-behaved foundation for abductive reasoning that connects modal logic with abductive inference, offering formal tools to express and analyze different explanation selection methods through non-monotonic extensions.

Abstract: The paper introduces a basic logic of knowledge and abduction by extending Levesque logic of only-knowing with an abduction modal operator defined via the combination of basic epistemic concepts. The upshot is an alternative approach to abduction that employs a modal vocabulary and explores the relation between abductive reasoning and epistemic states of only knowing. Furthermore, by incorporating a preferential relation into modal frames, we provide a non-monotonic extension of our basic framework capable of expressing different selection methods for abductive explanations. Core metatheoretic properties of non-monotonic consequence relations are explored within this setting and shown to provide a well-behaved foundation for abductive reasoning.

[291] Autonomous Agents on Blockchains: Standards, Execution Models, and Trust Boundaries

Saad Alqithami

Main category: cs.AI

TL;DR: Survey paper analyzing convergence of AI agents and blockchain systems, proposing taxonomy, threat models, and research roadmap for secure agent-blockchain interoperability.

Details

Motivation: The convergence of AI agents (capable of reasoning/planning) and blockchains (programmable value transfer) creates a high-stakes systems challenge requiring secure, standardized interfaces to enable agents to interact with blockchains without exposing users/protocols to unacceptable risks.

Method: Systematic literature review of 317 relevant works from over 3000 records, developing a five-part taxonomy of integration patterns, threat model for agent-driven transactions, and comparative analysis of 20+ systems across 13 dimensions.

Result: Identified gaps in current systems and proposed research roadmap centered on two interface abstractions: Transaction Intent Schema (portable goal specification) and Policy Decision Record (auditable policy enforcement), plus reproducible evaluation suite for safety assessment.

Conclusion: The paper systematizes the emerging field of agent-blockchain interoperability, providing foundational taxonomy, threat models, and concrete research directions to enable safe, reliable, and economically robust agent-mediated on-chain execution.

Abstract: Advances in large language models have enabled agentic AI systems that can reason, plan, and interact with external tools to execute multi-step workflows, while public blockchains have evolved into a programmable substrate for value transfer, access control, and verifiable state transitions. Their convergence introduces a high-stakes systems challenge: designing standard, interoperable, and secure interfaces that allow agents to observe on-chain state, formulate transaction intents, and authorize execution without exposing users, protocols, or organizations to unacceptable security, governance, or economic risks. This survey systematizes the emerging landscape of agent-blockchain interoperability through a systematic literature review, identifying 317 relevant works from an initial pool of over 3000 records. We contribute a five-part taxonomy of integration patterns spanning read-only analytics, simulation and intent generation, delegated execution, autonomous signing, and multi-agent workflows; a threat model tailored to agent-driven transaction pipelines that captures risks ranging from prompt injection and policy misuse to key compromise, adversarial execution dynamics, and multi-agent collusion; and a comparative capability matrix analyzing more than 20 representative systems across 13 dimensions, including custody models, permissioning, policy enforcement, observability, and recovery. Building on the gaps revealed by this analysis, we outline a research roadmap centered on two interface abstractions: a Transaction Intent Schema for portable and unambiguous goal specification, and a Policy Decision Record for auditable, verifiable policy enforcement across execution environments. We conclude by proposing a reproducible evaluation suite and benchmarks for assessing the safety, reliability, and economic robustness of agent-mediated on-chain execution.

[292] Hybrid MKNF for Aeronautics Applications: Usage and Heuristics

Arun Raveendran Nair Sheela, Florence De Grancey, Christophe Rey, Victor Charpenay

Main category: cs.AI

TL;DR: The paper evaluates Hybrid MKNF knowledge representation language for aeronautics applications, identifying needed expressivity features and proposing integration heuristics.

Details

Motivation: Aeronautics applications require both high expressivity for complex domain knowledge and efficient reasoning with minimal computational overhead. Integrating rules and ontologies is key to achieving this balance.

Method: Used Hybrid MKNF language for its seamless integration of rules and ontologies. Conducted a concrete aeronautics case study to evaluate suitability, identified missing expressivity features, and proposed heuristics for their integration.

Result: Hybrid MKNF was evaluated for aeronautics domain suitability. Additional expressivity features crucial for aeronautics applications were identified, and integration heuristics were proposed.

Conclusion: Hybrid MKNF shows promise for aeronautics applications but requires additional expressivity features. The proposed heuristics provide a pathway for enhancing the framework to better support aeronautics domain requirements.

Abstract: The deployment of knowledge representation and reasoning technologies in aeronautics applications presents two main challenges: achieving sufficient expressivity to capture complex domain knowledge, and executing reasoning tasks efficiently while minimizing memory usage and computational overhead. An effective strategy for attaining necessary expressivity involves integrating two fundamental KR concepts: rules and ontologies. This study adopts the well-established KR language Hybrid MKNF owing to its seamless integration of rules and ontologies through its semantics and query answering capabilities. We evaluated Hybrid MKNF to assess its suitability in the aeronautics domain through a concrete case study. We identified additional expressivity features that are crucial for developing aeronautics applications and proposed a set of heuristics to support their integration into Hybrid MKNF framework.

[293] Adversarial Yet Cooperative: Multi-Perspective Reasoning in Retrieved-Augmented Language Models

Can Xu, Lingyong Yan, Jiayi Wu, Haosen Wang, Shuaiqiang Wang, Yuchen Li, Jizhou Huang, Dawei Yin, Xiang Li

Main category: cs.AI

TL;DR: ARR framework uses adversarial reasoning between Reasoner and Verifier with process-aware rewards to improve RAG reasoning quality without external scoring models.

Details

Motivation: Current RAG systems have two key limitations: 1) reasoning models operate from single perspectives without self-correction, and 2) training relies too much on outcome-oriented rewards that don't adequately shape complex multi-step reasoning processes.

Method: Proposes Adversarial Reasoning RAG (ARR) with Reasoner-Verifier framework where both engage in reasoning on retrieved evidence and critique each other’s logic, guided by process-aware advantage that combines explicit observational signals with internal model uncertainty.

Result: Experiments on multiple benchmarks demonstrate the effectiveness of the method.

Conclusion: The ARR framework addresses limitations of current RAG systems by enabling adversarial reasoning with process-aware rewards, improving reasoning fidelity and verification rigor without requiring external scoring models.

Abstract: Recent advances in synergizing large reasoning models (LRMs) with retrieval-augmented generation (RAG) have shown promising results, yet two critical challenges remain: (1) reasoning models typically operate from a single, unchallenged perspective, limiting their ability to conduct deep, self-correcting reasoning over external documents, and (2) existing training paradigms rely excessively on outcome-oriented rewards, which provide insufficient signal for shaping the complex, multi-step reasoning process. To address these issues, we propose an Reasoner-Verifier framework named Adversarial Reasoning RAG (ARR). The Reasoner and Verifier engage in reasoning on retrieved evidence and critiquing each other’s logic while being guided by process-aware advantage that requires no external scoring model. This reward combines explicit observational signals with internal model uncertainty to jointly optimize reasoning fidelity and verification rigor. Experiments on multiple benchmarks demonstrate the effectiveness of our method.

[294] An ASP-based Solution to the Medical Appointment Scheduling Problem

Alina Vozna, Andrea Monaldini, Stefania Costantini, Valentina Pitoni, Dawid Pado

Main category: cs.AI

TL;DR: ASP-based medical scheduling framework that personalizes appointments for vulnerable populations using Blueprint Personas, with real-time updates and healthcare system integration.

Details

Motivation: To improve efficiency, reduce administrative overhead, and enhance patient-centered care in medical appointment scheduling, particularly for vulnerable populations.

Method: Uses Answer Set Programming (ASP) framework that integrates Blueprint Personas for personalization, centralizes planning operations within an ASP logic model, and ensures real-time availability updates and conflict-free assignments.

Result: A scheduling framework that provides personalized scheduling for vulnerable populations, ensures real-time availability updates, conflict-free assignments, and seamless interoperability with existing healthcare platforms.

Conclusion: ASP-based approach offers an effective solution for medical appointment scheduling that balances efficiency, personalization, and system integration, particularly benefiting vulnerable patient populations.

Abstract: This paper presents an Answer Set Programming (ASP)-based framework for medical appointment scheduling, aimed at improving efficiency, reducing administrative overhead, and enhancing patient-centered care. The framework personalizes scheduling for vulnerable populations by integrating Blueprint Personas. It ensures real-time availability updates, conflict-free assignments, and seamless interoperability with existing healthcare platforms by centralizing planning operations within an ASP logic model.

[295] When Single-Agent with Skills Replace Multi-Agent Systems and When They Fail

Xiaoxiao Li

Main category: cs.AI

TL;DR: LLM skill selection shows bounded capacity with phase transition behavior - stable accuracy up to critical library size then sharp drop, influenced by semantic confusability rather than size alone.

Details

Motivation: Multi-agent AI systems have computational overhead from explicit communication. The paper explores whether similar modularity benefits can be achieved with single agents using skill libraries, and investigates how skill selection scales as libraries grow.

Method: View skills as internalized agent behaviors, compiling multi-agent systems into equivalent single-agent systems. Investigate scaling behavior of skill selection, drawing on cognitive science principles to analyze bounded capacity. Study semantic confusability effects and test hierarchical organization approaches.

Result: Skill selection shows phase transition: accuracy remains stable up to critical library size then drops sharply. Semantic confusability among similar skills plays central role in degradation. Hierarchical routing shows promise for managing complex choices.

Conclusion: LLM skill selection exhibits bounded capacity analogous to human cognition. Hierarchical organization can help manage complex choices. Work opens questions about fundamental limits of semantic-based skill selection and provides cognitive-grounded framework for designing scalable skill-based agents.

Abstract: Multi-agent AI systems have proven effective for complex reasoning. These systems are compounded by specialized agents, which collaborate through explicit communication, but incur substantial computational overhead. A natural question arises: can we achieve similar modularity benefits with a single agent that selects from a library of skills? We explore this question by viewing skills as internalized agent behaviors. From this perspective, a multi-agent system can be compiled into an equivalent single-agent system, trading inter-agent communication for skill selection. Our preliminary experiments suggest this approach can substantially reduce token usage and latency while maintaining competitive accuracy on reasoning benchmarks. However, this efficiency raises a deeper question that has received little attention: how does skill selection scale as libraries grow? Drawing on principles from cognitive science, we propose that LLM skill selection exhibits bounded capacity analogous to human decision-making. We investigate the scaling behavior of skill selection and observe a striking pattern. Rather than degrading gradually, selection accuracy remains stable up to a critical library size, then drops sharply, indicating a phase transition reminiscent of capacity limits in human cognition. Furthermore, we find evidence that semantic confusability among similar skills, rather than library size alone, plays a central role in this degradation. This perspective suggests that hierarchical organization, which has long helped humans manage complex choices, may similarly benefit AI systems. Our initial results with hierarchical routing support this hypothesis. This work opens new questions about the fundamental limits of semantic-based skill selection in LLMs and offers a cognitive-grounded framework and practical guidelines for designing scalable skill-based agents.

[296] Pilot Study on Student Public Opinion Regarding GAI

William Franz Lamberti, Sunbin Kim, Samantha Rose Lawrence

Main category: cs.AI

TL;DR: University students’ perceptions of generative AI in higher education classrooms were studied, revealing challenges in student engagement for GAI research and highlighting the need for larger sample sizes.

Details

Motivation: The emergence of generative AI has sparked diverse opinions about its appropriate use in education, creating a need to understand student perspectives to better prepare instructors for integrating GAI discussions into classrooms.

Method: Pilot study investigating university students’ perceptions of GAI in higher education classrooms, with a participation rate of approximately 4.4%.

Result: The study highlights challenges in engaging students in GAI-related research and underscores the need for larger sample sizes in future studies to gain more reliable insights.

Conclusion: Understanding student perspectives on GAI can help instructors better prepare to integrate discussions of this transformative technology into classrooms, fostering informed and critical engagement among students.

Abstract: The emergence of generative AI (GAI) has sparked diverse opinions regarding its appropriate use across various domains, including education. This pilot study investigates university students’ perceptions of GAI in higher education classrooms, aiming to lay the groundwork for understanding these attitudes. With a participation rate of approximately 4.4%, the study highlights the challenges of engaging students in GAI-related research and underscores the need for larger sample sizes in future studies. By gaining insights into student perspectives, instructors can better prepare to integrate discussions of GAI into their classrooms, fostering informed and critical engagement with this transformative technology.

[297] Defense Against Indirect Prompt Injection via Tool Result Parsing

Qiang Yu, Xinran Cheng, Chuanyi Liu

Main category: cs.AI

TL;DR: A novel defense method against Indirect Prompt Injection attacks on LLM agents by parsing tool results to filter malicious code while maintaining utility.

Details

Motivation: As LLM agents gain physical control capabilities, they face increasing threats from indirect prompt injection attacks that can hijack decision-making. Existing defenses either have high computational overhead (training-based) or limited robustness (prompt-based).

Method: Proposes a novel approach that provides LLMs with precise data via tool result parsing while effectively filtering out injected malicious code.

Result: Achieves competitive Utility under Attack (UA) while maintaining the lowest Attack Success Rate (ASR) to date, significantly outperforming existing methods.

Conclusion: The proposed method offers an effective defense against indirect prompt injection attacks for LLM agents in physical control scenarios, balancing security and utility.

Abstract: As LLM agents transition from digital assistants to physical controllers in autonomous systems and robotics, they face an escalating threat from indirect prompt injection. By embedding adversarial instructions into the results of tool calls, attackers can hijack the agent’s decision-making process to execute unauthorized actions. This vulnerability poses a significant risk as agents gain more direct control over physical environments. Existing defense mechanisms against Indirect Prompt Injection (IPI) generally fall into two categories. The first involves training dedicated detection models; however, this approach entails high computational overhead for both training and inference, and requires frequent updates to keep pace with evolving attack vectors. Alternatively, prompt-based methods leverage the inherent capabilities of LLMs to detect or ignore malicious instructions via prompt engineering. Despite their flexibility, most current prompt-based defenses suffer from high Attack Success Rates (ASR), demonstrating limited robustness against sophisticated injection attacks. In this paper, we propose a novel method that provides LLMs with precise data via tool result parsing while effectively filtering out injected malicious code. Our approach achieves competitive Utility under Attack (UA) while maintaining the lowest Attack Success Rate (ASR) to date, significantly outperforming existing methods. Code is available at GitHub.

[298] The Language of Bargaining: Linguistic Effects in LLM Negotiations

Stuti Sinha, Himanshu Kumar, Aryan Raju Mandapati, Rakshit Sakhuja, Dhruv Kumar

Main category: cs.AI

TL;DR: LLM negotiation outcomes vary significantly across languages (English vs. Indic languages), with language choice sometimes having stronger effects than model changes, challenging English-only evaluation practices.

Details

Motivation: Most LLM negotiation evaluations occur exclusively in English, creating a gap in understanding how language choice affects negotiation outcomes and potentially leading to incomplete or misleading conclusions about LLM capabilities.

Method: Used controlled multi-agent simulations across three negotiation games (Ultimatum, Buy-Sell, Resource Exchange) with English and four Indic languages (Hindi, Punjabi, Gujarati, Marwadi), holding game rules, model parameters, and incentives constant across all conditions.

Result: Language choice can shift outcomes more strongly than changing models, reversing proposer advantages and reallocating surplus. Effects are task-contingent: Indic languages reduce stability in distributive games but induce richer exploration in integrative settings.

Conclusion: Evaluating LLM negotiation solely in English yields incomplete and potentially misleading conclusions, cautioning against English-only evaluation and suggesting culturally-aware evaluation is essential for fair deployment.

Abstract: Negotiation is a core component of social intelligence, requiring agents to balance strategic reasoning, cooperation, and social norms. Recent work shows that LLMs can engage in multi-turn negotiation, yet nearly all evaluations occur exclusively in English. Using controlled multi-agent simulations across Ultimatum, Buy-Sell, and Resource Exchange games, we systematically isolate language effects across English and four Indic framings (Hindi, Punjabi, Gujarati, Marwadi) by holding game rules, model parameters, and incentives constant across all conditions. We find that language choice can shift outcomes more strongly than changing models, reversing proposer advantages and reallocating surplus. Crucially, effects are task-contingent: Indic languages reduce stability in distributive games yet induce richer exploration in integrative settings. Our results demonstrate that evaluating LLM negotiation solely in English yields incomplete and potentially misleading conclusions. These findings caution against English-only evaluation of LLMs and suggest that culturally-aware evaluation is essential for fair deployment.

[299] LLM-Guided Lifecycle-Aware Clustering of Multi-Turn Customer Support Conversations

Priyaranjan Pattnayak, Sanchari Chowdhuri, Amit Agarwal, Hitesh Laxmichand Patel

Main category: cs.AI

TL;DR: Adaptive clustering system for customer chat data that segments multi-turn chats into service-specific concerns and incrementally refines clusters using LLM-based splitting only for degraded clusters, avoiding full reclustering.

Details

Motivation: Traditional clustering methods struggle with overlapping concerns in customer chat data, creating broad static clusters that degrade over time. Full reclustering disrupts continuity and makes issue tracking difficult for cloud providers handling multi-service queries.

Method: Proposes an adaptive system that segments multi-turn chats into service-specific concerns and incrementally refines clusters as new issues arise. Uses Davies-Bouldin Index and Silhouette Scores to track cluster quality, applying LLM-based splitting only to degraded clusters.

Result: Improves Silhouette Scores by over 100% and reduces Davies-Bouldin Index by 65.6% compared to baselines. Enables scalable, real-time analytics without requiring full reclustering.

Conclusion: The adaptive clustering system effectively addresses the limitations of traditional methods by providing continuous, high-quality clustering for customer chat data while maintaining tracking continuity and enabling real-time analytics at scale.

Abstract: Clustering customer chat data is vital for cloud providers handling multi service queries. Traditional methods struggle with overlapping concerns and create broad, static clusters that degrade over time. Reclustering disrupts continuity, making issue tracking difficult. We propose an adaptive system that segments multi turn chats into service specific concerns and incrementally refines clusters as new issues arise. Cluster quality is tracked via DaviesBouldin Index and Silhouette Scores, with LLM based splitting applied only to degraded clusters. Our method improves Silhouette Scores by over 100% and reduces DBI by 65.6% compared to baselines, enabling scalable, real time analytics without full reclustering.

[300] SciFig: Towards Automating Scientific Figure Generation

Siyuan Huang, Yutong Gao, Juyang Bai, Yifan Zhou, Zi Yin, Xinxin Liu, Rama Chellappa, Chun Pong Lau, Sayan Nag, Cheng Peng, Shraman Pramanick

Main category: cs.AI

TL;DR: SciFig is an AI system that automatically generates publication-ready scientific figures from research paper texts using hierarchical layout generation and iterative feedback.

Details

Motivation: Scientific figure creation is time-consuming, requires both domain expertise and design skills, and remains largely manual despite millions of papers published annually.

Method: Uses hierarchical layout generation: parses research descriptions to identify component relationships, groups elements into functional modules, and generates inter-module connections. Includes iterative chain-of-thought feedback mechanism for progressive layout improvement through visual analysis and reasoning.

Result: Achieves 70.1% overall quality on dataset-level evaluation and 66.2% on paper-specific evaluation, with consistently high scores across visual clarity, structural organization, and scientific accuracy metrics.

Conclusion: SciFig demonstrates effective automated scientific figure generation, with the pipeline and evaluation benchmark to be open-sourced for broader use.

Abstract: Creating high-quality figures and visualizations for scientific papers is a time-consuming task that requires both deep domain knowledge and professional design skills. Despite over 2.5 million scientific papers published annually, the figure generation process remains largely manual. We introduce $\textbf{SciFig}$, an end-to-end AI agent system that generates publication-ready pipeline figures directly from research paper texts. SciFig uses a hierarchical layout generation strategy, which parses research descriptions to identify component relationships, groups related elements into functional modules, and generates inter-module connections to establish visual organization. Furthermore, an iterative chain-of-thought (CoT) feedback mechanism progressively improves layouts through multiple rounds of visual analysis and reasoning. We introduce a rubric-based evaluation framework that analyzes 2,219 real scientific figures to extract evaluation rubrics and automatically generates comprehensive evaluation criteria. SciFig demonstrates remarkable performance: achieving 70.1$%$ overall quality on dataset-level evaluation and 66.2$%$ on paper-specific evaluation, and consistently high scores across metrics such as visual clarity, structural organization, and scientific accuracy. SciFig figure generation pipeline and our evaluation benchmark will be open-sourced.

[301] Assessing the quality and coherence of word embeddings after SCM-based intersectional bias mitigation

Eren Kocadag, Seyed Sahand Mohammadi Ziabari, Ali Mohammed Mansoor Alsahag

Main category: cs.AI

TL;DR: This paper explores intersectional bias mitigation in static word embeddings using the Stereotype Content Model, comparing three debiasing methods (Subtraction, Linear Projection, Partial Projection) and two compound representation approaches (summation vs concatenation) across three embedding families.

Details

Motivation: Static word embeddings absorb social biases from training data, which can propagate to downstream systems. Prior work using the Stereotype Content Model has focused on single-group bias along warmth and competence dimensions, but there's a need to address intersectional bias involving combinations of social identities.

Method: The study builds compound representations for pairs of social identities using summation or concatenation, then applies three debiasing strategies: Subtraction, Linear Projection, and Partial Projection. It evaluates three embedding families (Word2Vec, GloVe, ConceptNet Numberbatch) using two utility metrics: local neighborhood coherence and analogy behavior preservation.

Result: SCM-based mitigation works well for intersectional bias while largely preserving semantic structure. Partial Projection is reliably conservative, Linear Projection can be more assertive, and Subtraction remains competitive. The choice between summation and concatenation depends on embedding family and application goals, revealing a trade-off between geometry preservation and analogy performance.

Conclusion: Intersectional debiasing with SCM is practical in static embeddings, with guidance available for selecting aggregation and debiasing methods based on whether stability or analogy performance is prioritized.

Abstract: Static word embeddings often absorb social biases from the text they learn from, and those biases can quietly shape downstream systems. Prior work that uses the Stereotype Content Model (SCM) has focused mostly on single-group bias along warmth and competence. We broaden that lens to intersectional bias by building compound representations for pairs of social identities through summation or concatenation, and by applying three debiasing strategies: Subtraction, Linear Projection, and Partial Projection. We study three widely used embedding families (Word2Vec, GloVe, and ConceptNet Numberbatch) and assess them with two complementary views of utility: whether local neighborhoods remain coherent and whether analogy behavior is preserved. Across models, SCM-based mitigation carries over well to the intersectional case and largely keeps the overall semantic landscape intact. The main cost is a familiar trade off: methods that most tightly preserve geometry tend to be more cautious about analogy behavior, while more assertive projections can improve analogies at the expense of strict neighborhood stability. Partial Projection is reliably conservative and keeps representations steady; Linear Projection can be more assertive; Subtraction is a simple baseline that remains competitive. The choice between summation and concatenation depends on the embedding family and the application goal. Together, these findings suggest that intersectional debiasing with SCM is practical in static embeddings, and they offer guidance for selecting aggregation and debiasing settings when balancing stability against analogy performance.

[302] Transitive Expert Error and Routing Problems in Complex AI Systems

Forest Mars

Main category: cs.AI

TL;DR: Transitive Expert Error (TEE) describes systematic vulnerabilities at domain boundaries where experts’ reliable within-domain mechanisms become liabilities, causing confident but causally incorrect outputs. This applies to both human experts and AI systems.

Details

Motivation: To understand why domain expertise, while enhancing judgment within domains, creates systematic vulnerabilities specifically at domain boundaries, and to extend this understanding to AI systems that exhibit similar failure patterns.

Method: The paper identifies two core mechanisms: structural similarity bias (overweighting surface features while missing causal differences) and authority persistence (maintaining confidence across competence boundaries). It analyzes conditions intensifying these mechanisms and extends the framework to AI routing architectures.

Result: TEE produces a “hallucination phenotype” - confident, coherent, structurally plausible but causally incorrect outputs at domain boundaries. In AI systems, this manifests as routing-induced failures (wrong specialist selected) and coverage-induced failures (no appropriate specialist exists).

Conclusion: TEE mechanisms that are cognitive black boxes in humans become explicit and addressable in AI architectures. The paper proposes interventions at router, specialist, and training levels, and notes that what’s intractable in human cognition becomes addressable through architectural design.

Abstract: Domain expertise enhances judgment within boundaries but creates systematic vulnerabilities specifically at borders. We term this Transitive Expert Error (TEE), distinct from Dunning-Kruger effects, requiring calibrated expertise as precondition. Mechanisms enabling reliable within-domain judgment become liabilities when structural similarity masks causal divergence. Two core mechanisms operate: structural similarity bias causes experts to overweight surface features (shared vocabulary, patterns, formal structure) while missing causal architecture differences; authority persistence maintains confidence across competence boundaries through social reinforcement and metacognitive failures (experts experience no subjective uncertainty as pattern recognition operates smoothly on familiar-seeming inputs.) These mechanism intensify under three conditions: shared vocabulary masking divergent processes, social pressure for immediate judgment, and delayed feedback. These findings extend to AI routing architectures (MoE systems, multi-model orchestration, tool-using agents, RAG systems) exhibiting routing-induced failures (wrong specialist selected) and coverage-induced failures (no appropriate specialist exists). Both produce a hallucination phenotype: confident, coherent, structurally plausible but causally incorrect outputs at domain boundaries. In human systems where mechanisms are cognitive black boxes; AI architectures make them explicit and addressable. We propose interventions: multi-expert activation with disagreement detection (router level), boundary-aware calibration (specialist level), and coverage gap detection (training level). TEE has detectable signatures (routing patterns, confidence-accuracy dissociations, domain-inappropriate content) enabling monitoring and mitigation. What remains intractable in human cognition becomes addressable through architectural design.

[303] XGrammar 2: Dynamic and Efficient Structured Generation Engine for Agentic LLMs

Linzhang Li, Yixin Dong, Guanjie Wang, Ziyi Xu, Alexander Jiang, Tianqi Chen

Main category: cs.AI

TL;DR: XGrammar 2 is an optimized structured generation engine for LLM agents that achieves 6x speedup over existing engines through dynamic dispatching, JIT compilation, and caching mechanisms.

Details

Motivation: Modern LLM agents need to handle complex dynamic structured generation tasks (tool calling, conditional generation) that are more challenging than predefined structures, requiring better optimization of current structured generation engines.

Method: Proposes XGrammar 2 with: 1) TagDispatch dynamic dispatching semantics for mask generation acceleration, 2) JIT compilation to reduce compilation time, 3) cross-grammar caching for common sub-structures, 4) extension from PDA-based to Earley-parser-based mask generation, and 5) repetition compression algorithm for grammar repetition structures.

Result: XGrammar 2 achieves more than 6x speedup over existing structured generation engines and can handle dynamic structured generation tasks with near-zero overhead when integrated with LLM inference engines.

Conclusion: XGrammar 2 provides a highly optimized solution for dynamic structured generation in LLM agents, significantly improving performance through novel dispatching semantics, compilation optimizations, and caching mechanisms.

Abstract: Modern LLM agents are required to handle increasingly complex structured generation tasks, such as tool calling and conditional structured generation. These tasks are significantly more dynamic than predefined structures, posing new challenges to the current structured generation engines. In this paper, we propose XGrammar 2, a highly optimized structured generation engine for agentic LLMs. XGrammar 2 accelerates the mask generation for these dynamic structured generation tasks through a new dynamic dispatching semantics: TagDispatch. We further introduce a just-in-time (JIT) compilation method to reduce compilation time and a cross-grammar caching mechanism to leverage the common sub-structures across different grammars. Additionally, we extend the previous PDA-based mask generation algorithm to the Earley-parser-based one and design a repetition compression algorithm to handle repetition structures in grammars. Evaluation results show that XGrammar 2 can achieve more than 6x speedup over the existing structured generation engines. Integrated with an LLM inference engine, XGrammar 2 can handle dynamic structured generation tasks with near-zero overhead.

[304] Categorical Belief Propagation: Sheaf-Theoretic Inference via Descent and Holonomy

Enrique ter Horst, Sridhar Mahadevan, Juan Diego Zambrano

Main category: cs.AI

TL;DR: A categorical framework for belief propagation that unifies exact inference, junction trees, and loopy BP failures using sheaf theory, with a practical HATCC algorithm for exact inference with improved performance.

Details

Motivation: To provide a rigorous categorical foundation for belief propagation that unifies various inference methods (tree exactness, junction trees, loopy BP) and explains failures in loopy BP through sheaf-theoretic obstructions.

Method: Construct free hypergraph category on typed signatures with universal property; formulate message-passing via Grothendieck fibration over polarized factor graphs; characterize exact inference as effective descent; develop HATCC algorithm that detects descent obstructions via holonomy computation, compiles non-trivial holonomy into mode variables, and reduces to tree BP on augmented graph.

Result: Theoretical framework unifies tree exactness, junction tree algorithms, and loopy BP failures; HATCC algorithm achieves exact inference with complexity O(n²d_max + c·k_max·δ_max³ + n·δ_max²); experiments show significant speedup over junction trees on grid MRFs and random graphs, plus UNSAT detection on satisfiability instances.

Conclusion: Categorical semantics provides foundational understanding of belief propagation; sheaf-theoretic obstructions explain loopy BP failures; HATCC offers practical exact inference algorithm with improved performance over traditional methods.

Abstract: We develop a categorical foundation for belief propagation on factor graphs. We construct the free hypergraph category (\Syn_Σ) on a typed signature and prove its universal property, yielding compositional semantics via a unique functor to the matrix category (\cat{Mat}R). Message-passing is formulated using a Grothendieck fibration (\int\Msg \to \cat{FG}Σ) over polarized factor graphs, with schedule-indexed endomorphisms defining BP updates. We characterize exact inference as effective descent: local beliefs form a descent datum when compatibility conditions hold on overlaps. This framework unifies tree exactness, junction tree algorithms, and loopy BP failures under sheaf-theoretic obstructions. We introduce HATCC (Holonomy-Aware Tree Compilation), an algorithm that detects descent obstructions via holonomy computation on the factor nerve, compiles non-trivial holonomy into mode variables, and reduces to tree BP on an augmented graph. Complexity is (O(n^2 d{\max} + c \cdot k{\max} \cdot δ_{\max}^3 + n \cdot δ_{\max}^2)) for (n) factors and (c) fundamental cycles. Experimental results demonstrate exact inference with significant speedup over junction trees on grid MRFs and random graphs, along with UNSAT detection on satisfiability instances.

[305] Computational Compliance for AI Regulation: Blueprint for a New Research Domain

Bill Marino, Nicholas D. Lane

Main category: cs.AI

TL;DR: The paper argues that AI systems need computational compliance algorithms to meet AI regulations at scale, proposes design goals for such algorithms, and creates a benchmark dataset to measure their performance.

Details

Motivation: Traditional compliance methods are insufficient for AI systems to meet emerging AI regulations at the necessary speed and scale. The research community lacks clear specifications for computational compliance algorithms and benchmarks to evaluate them.

Method: The authors propose a set of design goals for computational AI regulation compliance algorithms and create a benchmark dataset to quantitatively measure whether algorithms satisfy these design goals.

Result: The paper delivers a blueprint for computational AIR compliance algorithms, including design specifications and a benchmark dataset, aiming to shape a new research domain.

Conclusion: Computational compliance is essential for AI systems to realistically meet regulations, and the proposed framework provides necessary foundations to guide research and investment in this emerging field.

Abstract: The era of AI regulation (AIR) is upon us. But AI systems, we argue, will not be able to comply with these regulations at the necessary speed and scale by continuing to rely on traditional, analogue methods of compliance. Instead, we posit that compliance with these regulations will only realistically be achieved computationally: that is, with algorithms that run across the life cycle of an AI system, automatically steering it toward AIR compliance in the face of dynamic conditions. Yet despite their (we would argue) inevitability, the research community has yet to specify exactly how these algorithms for computational AIR compliance should behave - or how we should benchmark their performance. To fill these gaps, we specify a set of design goals for such algorithms. In addition, we specify a benchmark dataset that can be used to quantitatively measure whether individual algorithms satisfy these design goals. By delivering this blueprint, we hope to give shape to an important but uncrystallized new domain of research - and, in doing so, incite necessary investment in it.

[306] GUITester: Enabling GUI Agents for Exploratory Defect Discovery

Yifei Gao, Jiang Wu, Xiaoyi Chen, Yifan Yang, Zhe Cui, Tianyi Ma, Jiaming Zhang, Jitao Sang

Main category: cs.AI

TL;DR: MLLM-based GUI testing framework that addresses goal-oriented masking and execution-bias attribution challenges to autonomously discover software defects through decoupled navigation and verification.

Details

Motivation: Exploratory GUI testing is crucial for software quality but suffers from high manual costs. Current MLLM agents excel at navigation but fail to autonomously discover defects due to two key challenges: goal-oriented masking (prioritizing task completion over reporting anomalies) and execution-bias attribution (misidentifying system defects as agent errors).

Method: Proposes GUITester, a multi-agent framework with two modules: 1) Planning-Execution Module (PEM) that proactively probes for defects via embedded testing intents, and 2) Hierarchical Reflection Module (HRM) that resolves attribution ambiguity through interaction history analysis. Also introduces GUITestBench, the first interactive benchmark with 143 tasks across 26 defects.

Result: GUITester achieves an F1-score of 48.90% (Pass@3) on GUITestBench, significantly outperforming state-of-the-art baselines (33.35%). Demonstrates the feasibility of autonomous exploratory testing for GUI quality assurance.

Conclusion: The work successfully addresses core challenges in autonomous GUI testing, provides a robust framework for defect discovery, and establishes a foundation for future GUI quality assurance research through the GUITestBench benchmark.

Abstract: Exploratory GUI testing is essential for software quality but suffers from high manual costs. While Multi-modal Large Language Model (MLLM) agents excel in navigation, they fail to autonomously discover defects due to two core challenges: \textit{Goal-Oriented Masking}, where agents prioritize task completion over reporting anomalies, and \textit{Execution-Bias Attribution}, where system defects are misidentified as agent errors. To address these, we first introduce \textbf{GUITestBench}, the first interactive benchmark for this task, featuring 143 tasks across 26 defects. We then propose \textbf{GUITester}, a multi-agent framework that decouples navigation from verification via two modules: (i) a \textit{Planning-Execution Module (PEM)} that proactively probes for defects via embedded testing intents, and (ii) a \textit{Hierarchical Reflection Module (HRM)} that resolves attribution ambiguity through interaction history analysis. GUITester achieves an F1-score of 48.90% (Pass@3) on GUITestBench, outperforming state-of-the-art baselines (33.35%). Our work demonstrates the feasibility of autonomous exploratory testing and provides a robust foundation for future GUI quality assurance~\footnote{Our code is now available in~\href{https://github.com/ADaM-BJTU/GUITestBench}{https://github.com/ADaM-BJTU/GUITestBench}}.

[307] Specific Emitter Identification via Active Learning

Jingyi Wang, Fanggang Wang

Main category: cs.AI

TL;DR: Proposes an active learning-enhanced SEI approach with three-stage semi-supervised training to reduce labeling costs while maintaining high recognition accuracy.

Details

Motivation: SEI is important for communication security but requires large-scale labeled data that is costly and time-consuming to obtain. Need to address the challenge of limited labeled data availability.

Method: Three-stage semi-supervised training: 1) Self-supervised contrastive learning with dynamic dictionary update for robust feature extraction from unlabeled data. 2) Supervised training with joint contrastive and cross-entropy losses on small labeled dataset. 3) Active learning module selecting valuable samples based on uncertainty and representativeness criteria for annotation.

Result: Outperforms conventional supervised and semi-supervised methods under limited annotation conditions on ADS-B and WiFi datasets. Achieves higher recognition accuracy with lower labeling cost.

Conclusion: The proposed active learning-enhanced SEI approach effectively addresses the data labeling challenge in SEI, achieving superior performance with reduced annotation requirements, making it practical for real-world communication security applications.

Abstract: With the rapid growth of wireless communications, specific emitter identification (SEI) is significant for communication security. However, its model training relies heavily on the large-scale labeled data, which are costly and time-consuming to obtain. To address this challenge, we propose an SEI approach enhanced by active learning (AL), which follows a three-stage semi-supervised training scheme. In the first stage, self-supervised contrastive learning is employed with a dynamic dictionary update mechanism to extract robust representations from large amounts of the unlabeled data. In the second stage, supervised training on a small labeled dataset is performed, where the contrastive and cross-entropy losses are jointly optimized to improve the feature separability and strengthen the classification boundaries. In the third stage, an AL module selects the most valuable samples from the unlabeled data for annotation based on the uncertainty and representativeness criteria, further enhancing generalization under limited labeling budgets. Experimental results on the ADS-B and WiFi datasets demonstrate that the proposed SEI approach significantly outperforms the conventional supervised and semi-supervised methods under limited annotation conditions, achieving higher recognition accuracy with lower labeling cost.

[308] CircuitLM: A Multi-Agent LLM-Aided Design Framework for Generating Circuit Schematics from Natural Language Prompts

Khandakar Shakib Al Hasan, Syed Rifat Raiyan, Hasin Mahtab Alvee, Wahid Sadik

Main category: cs.AI

TL;DR: CircuitLM is a multi-agent LLM pipeline that converts natural language circuit descriptions into structured CircuitJSON schematics, addressing LLM hallucination issues through component grounding and electrical constraint validation.

Details

Motivation: Current LLMs struggle with generating accurate circuit schematics from natural language - they hallucinate details, violate electrical constraints, and produce non-machine-readable outputs, making them unreliable for electronics design.

Method: Five-stage pipeline: (1) LLM-based component identification, (2) canonical pinout retrieval from knowledge base, (3) chain-of-thought reasoning by electronics expert agent, (4) JSON schematic synthesis, (5) force-directed SVG visualization. Uses curated component database with 50+ components and Dual-Metric Circuit Validation (DMCV) for evaluation.

Result: Evaluated on 100 diverse embedded-systems prompts across six LLMs. DMCV validation shows high fidelity in microcontroller-centric designs. The system bridges natural language to deployable hardware designs for non-experts.

Conclusion: CircuitLM successfully bridges the gap between natural language input and reliable circuit prototyping by grounding LLM generation in verified component databases and implementing multi-stage validation, enabling non-experts to create accurate circuit designs.

Abstract: Generating accurate circuit schematics from high-level natural language descriptions remains a persistent challenge in electronics design, as large language models (LLMs) frequently hallucinate in granular details, violate electrical constraints, and produce non-machine-readable outputs. We present CircuitLM, a novel multi-agent LLM-aided circuit design pipeline that translates user prompts into structured, visually interpretable CircuitJSON schematics through five sequential stages: (i) LLM-based component identification, (ii) canonical pinout retrieval, (iii) chain-of-thought reasoning by an electronics expert agent, (iv) JSON schematic synthesis, and (v) force-directed SVG visualization. Anchored by a curated, embedding-powered component knowledge base. While LLMs often violate electrical constraints, CircuitLM bridges this gap by grounding generation in a verified and dynamically extensible component database, initially comprising 50 components. To ensure safety, we incorporate a hybrid evaluation framework, namely Dual-Metric Circuit Validation (DMCV), validated against human-expert assessments, which achieves high fidelity in microcontroller-centric designs. We evaluate the system on 100 diverse embedded-systems prompts across six LLMs and introduce DMCV to assess both structural and electrical validity. This work bridges natural language input to deployable hardware designs, enabling reliable circuit prototyping by non-experts. Our code and data will be made public upon acceptance.

[309] A General Neural Backbone for Mixed-Integer Linear Optimization via Dual Attention

Peixin Huang, Yaoxin Wu, Yining Ma, Cathy Wu, Wen Song, Wei Zhang

Main category: cs.AI

TL;DR: The paper presents an attention-based neural architecture for MILP that overcomes limitations of GNN approaches by using dual-attention mechanisms for global information exchange across variables and constraints.

Details

Motivation: Current GNN-based approaches for MILP are limited by local-oriented mechanisms, restricting representation power and hindering neural approaches for combinatorial optimization problems.

Method: Proposes an attention-driven neural architecture with dual-attention mechanism that performs parallel self- and cross-attention over variables and constraints, enabling global information exchange and deeper representation learning.

Result: Extensive experiments across widely used benchmarks show consistent improvements over state-of-the-art baselines on various downstream tasks at instance, element, and solving state levels.

Conclusion: Attention-based neural architectures serve as a powerful foundation for learning-enhanced mixed-integer linear optimization, overcoming limitations of pure graph-based approaches.

Abstract: Mixed-integer linear programming (MILP), a widely used modeling framework for combinatorial optimization, are central to many scientific and engineering applications, yet remains computationally challenging at scale. Recent advances in deep learning address this challenge by representing MILP instances as variable-constraint bipartite graphs and applying graph neural networks (GNNs) to extract latent structural patterns and enhance solver efficiency. However, this architecture is inherently limited by the local-oriented mechanism, leading to restricted representation power and hindering neural approaches for MILP. Here we present an attention-driven neural architecture that learns expressive representations beyond the pure graph view. A dual-attention mechanism is designed to perform parallel self- and cross-attention over variables and constraints, enabling global information exchange and deeper representation learning. We apply this general backbone to various downstream tasks at the instance level, element level, and solving state level. Extensive experiments across widely used benchmarks show consistent improvements of our approach over state-of-the-art baselines, highlighting attention-based neural architectures as a powerful foundation for learning-enhanced mixed-integer linear optimization.

[310] Integrating Distribution Matching into Semi-Supervised Contrastive Learning for Labeled and Unlabeled Data

Shogo Nakayama, Masahiro Okuda

Main category: cs.AI

TL;DR: This paper proposes enhancing pseudo-label-based semi-supervised learning by incorporating distribution matching between labeled and unlabeled feature embeddings to improve image classification accuracy.

Details

Motivation: Labeling data is costly, making unsupervised learning methods like contrastive learning attractive. However, real-world scenarios rarely have fully unlabeled datasets - instead, small labeled data coexists with large unlabeled data, making semi-supervised learning highly relevant. Existing pseudo-label-based SSL approaches can be improved.

Method: The study enhances pseudo-label-based SSL by incorporating distribution matching between labeled and unlabeled feature embeddings. This approach aims to better align the feature representations of labeled and unlabeled data to improve classification performance.

Result: The paper aims to demonstrate improved image classification accuracy across multiple datasets through this enhanced SSL approach with distribution matching.

Conclusion: By combining pseudo-label assignment with distribution matching of feature embeddings, this approach provides a more effective semi-supervised learning method for scenarios with limited labeled data and abundant unlabeled data.

Abstract: The advancement of deep learning has greatly improved supervised image classification. However, labeling data is costly, prompting research into unsupervised learning methods such as contrastive learning. In real-world scenarios, fully unlabeled datasets are rare, making semi-supervised learning (SSL) highly relevant in scenarios where a small amount of labeled data coexists with a large volume of unlabeled data. A well-known semi-supervised contrastive learning approach involves assigning pseudo-labels to unlabeled data. This study aims to enhance pseudo-label-based SSL by incorporating distribution matching between labeled and unlabeled feature embeddings to improve image classification accuracy across multiple datasets.

[311] BioPIE: A Biomedical Protocol Information Extraction Dataset for High-Reasoning-Complexity Experiment Question Answer

Haofei Hou, Shunyi Zhao, Fanxu Meng, Kairui Yang, Lecheng Ruan, Qining Wang

Main category: cs.AI

TL;DR: BioPIE dataset provides fine-grained procedure-centric knowledge graphs for biomedical experimental QA, addressing challenges of high information density and multi-step reasoning.

Details

Motivation: Existing biomedical QA datasets lack fine-grained experimental knowledge needed for high information density and multi-step reasoning tasks, which are crucial for biomedical experiment automation and cross-disciplinary communication.

Method: Created Biomedical Protocol Information Extraction Dataset (BioPIE) with procedure-centric knowledge graphs containing experimental entities, actions, and relations at scale to support reasoning across protocols.

Result: BioPIE enables performance gains on test, HID, and MSR question sets, demonstrating that structured experimental knowledge supports both AI-assisted and autonomous biomedical experimentation.

Conclusion: BioPIE addresses the gap in fine-grained biomedical experimental knowledge representation and enables improved QA systems for biomedical experiments with complex reasoning requirements.

Abstract: Question Answer (QA) systems for biomedical experiments facilitate cross-disciplinary communication, and serve as a foundation for downstream tasks, e.g., laboratory automation. High Information Density (HID) and Multi-Step Reasoning (MSR) pose unique challenges for biomedical experimental QA. While extracting structured knowledge, e.g., Knowledge Graphs (KGs), can substantially benefit biomedical experimental QA. Existing biomedical datasets focus on general or coarsegrained knowledge and thus fail to support the fine-grained experimental reasoning demanded by HID and MSR. To address this gap, we introduce Biomedical Protocol Information Extraction Dataset (BioPIE), a dataset that provides procedure-centric KGs of experimental entities, actions, and relations at a scale that supports reasoning over biomedical experiments across protocols. We evaluate information extraction methods on BioPIE, and implement a QA system that leverages BioPIE, showcasing performance gains on test, HID, and MSR question sets, showing that the structured experimental knowledge in BioPIE underpins both AI-assisted and more autonomous biomedical experimentation.

[312] TCAndon-Router: Adaptive Reasoning Router for Multi-Agent Collaboration

Jiuzhou Zhao, Chunrong Chen, Chenqi Qiao, Lebin Zheng, Minqi Han, Yanchi Liu Yongzhou Xu Xiaochuan Xu Min Zhang

Main category: cs.AI

TL;DR: TCAR is an adaptive reasoning router for multi-agent systems that uses natural language reasoning chains to dynamically select candidate agents and employs collaborative execution with a refining agent to produce high-quality responses.

Details

Motivation: Existing task routing approaches in multi-agent systems rely on static single-label decisions, which have two major limitations: difficulty integrating new agents as business domains expand, and routing conflicts caused by overlapping agent capabilities, degrading accuracy and robustness.

Method: TCAR generates natural-language reasoning chains before predicting candidate agents, supports dynamic agent onboarding, and uses a collaborative execution pipeline where selected agents independently produce responses that are aggregated and refined by a dedicated Refining Agent.

Result: Experiments on public datasets and real enterprise data show TCAR significantly improves routing accuracy, reduces routing conflicts, and remains robust in ambiguous scenarios.

Conclusion: TCAR addresses limitations of traditional routing approaches by providing an adaptive, explainable, and collaborative multi-agent routing solution that supports dynamic agent integration and handles capability overlaps effectively.

Abstract: Multi-Agent Systems(MAS) have become a powerful paradigm for building high performance intelligent applications. Within these systems, the router responsible for determining which expert agents should handle a given query plays a crucial role in overall performance. Existing routing strategies generally fall into two categories: performance routing, which balances latency and cost across models of different sizes, and task routing, which assigns queries to domain-specific experts to improve accuracy. In real-world enterprise applications, task routing is more suitable; however, most existing approaches rely on static single-label decisions, which introduce two major limitations: (i) difficulty in seamlessly integrating new agents as business domains expand, and (ii) routing conflicts caused by overlapping agent capabilities, ultimately degrading accuracy and robustness.To address these challenges, we propose TCAndon-Router(TCAR): an adaptive reasoning router for multi-agent collaboration. Unlike traditional routers, TCAR supports dynamic agent onboarding and first generates a natural-language reasoning chain before predicting a set of candidate agents capable of handling the query. In addition, we design a collaborative execution pipeline in which selected agents independently produce responses, which are then aggregated and refined into a single high-quality response by a dedicated Refining Agent.Experiments on public datasets and real enterprise data demonstrate that TCAR significantly improves routing accuracy, reduces routing conflicts, and remains robust in ambiguous scenarios. We have released TCAR at https://huggingface.co/tencent/TCAndon-Router to support future research on explainable and collaborative multi-agent routing.

[313] Personalized Model-Based Design of Human Centric AI enabled CPS for Long term usage

Bernard Ngabonziza, Ayan Banerjee, Sandeep K. S. Gupta

Main category: cs.AI

TL;DR: The paper analyzes limitations of existing safety/security testing methods for AI-enabled human-centric systems in long-term operation, proposing personalized model-based solutions to address corner cases and uncertainties.

Details

Motivation: AI-enabled human-centric systems (medical monitoring, autonomous vehicles, etc.) operate long-term but face corner cases leading to safety/security violations due to design flaws, limited testing, computational constraints, or unknown human interactions.

Method: Analyzes existing safety/sustainability/security analysis techniques for AI-enabled human-centric control systems, identifies their limitations for long-term testing, and proposes personalized model-based solutions to eliminate these limitations.

Result: Existing techniques have limitations for testing AI systems in long-term operation due to corner cases and uncertainties; personalized model-based approaches are needed to address these challenges.

Conclusion: Personalized model-based solutions are necessary to overcome limitations of current testing methods and ensure safety, sustainability, and security of AI-enabled human-centric systems during long-term operation.

Abstract: Human centric critical systems are increasingly involving artificial intelligence to enable knowledge extraction from sensor collected data. Examples include medical monitoring and control systems, gesture based human computer interaction systems, and autonomous cars. Such systems are intended to operate for a long term potentially for a lifetime in many scenarios such as closed loop blood glucose control for Type 1 diabetics, self-driving cars, and monitoting systems for stroke diagnosis, and rehabilitation. Long term operation of such AI enabled human centric applications can expose them to corner cases for which their operation is may be uncertain. This can be due to many reasons such as inherent flaws in the design, limited resources for testing, inherent computational limitations of the testing methodology, or unknown use cases resulting from human interaction with the system. Such untested corner cases or cases for which the system performance is uncertain can lead to violations in the safety, sustainability, and security requirements of the system. In this paper, we analyze the existing techniques for safety, sustainability, and security analysis of an AI enabled human centric control system and discuss their limitations for testing the system for long term use in practice. We then propose personalized model based solutions for potentially eliminating such limitations.

[314] Reasoning Over Space: Enabling Geographic Reasoning for LLM-Based Generative Next POI Recommendation

Dongyi Lv, Qiuyu Ding, Heng-Da Xu, Zhaoxu Sun, Zhi Wang, Feng Xiong, Mu Xu

Main category: cs.AI

TL;DR: ROS is a framework that integrates geographic reasoning into LLM-based recommendation systems using hierarchical spatial tokens and a three-stage mobility chain-of-thought approach, achieving significant performance improvements.

Details

Motivation: Existing LLM-based recommenders fail to effectively leverage geographic signals, which are crucial for mobility and local-services scenarios where location plays a vital role in user decision-making.

Method: Introduces Hierarchical Spatial Semantic ID (SID) to discretize locality and POI semantics into compositional tokens, implements three-stage Mobility Chain-of-Thought paradigm (user personality modeling, intent-aligned candidate space construction, locality-informed pruning), and uses spatial-guided Reinforcement Learning for real-world geography alignment.

Result: Achieves over 10% relative gains in hit rate compared to strongest LLM-based baselines on three LBSN datasets, improves cross-city transfer performance, and demonstrates effectiveness despite using a smaller backbone model.

Conclusion: ROS successfully integrates geographic reasoning into LLM-based recommendation systems, demonstrating that explicit modeling of spatial relationships significantly enhances recommendation performance in location-based scenarios.

Abstract: Generative recommendation with large language models (LLMs) reframes prediction as sequence generation, yet existing LLM-based recommenders remain limited in leveraging geographic signals that are crucial in mobility and local-services scenarios. Here, we present Reasoning Over Space (ROS), a framework that utilizes geography as a vital decision variable within the reasoning process. ROS introduces a Hierarchical Spatial Semantic ID (SID) that discretizes coarse-to-fine locality and POI semantics into compositional tokens, and endows LLM with a three-stage Mobility Chain-of-Thought (CoT) paradigm that models user personality, constructs an intent-aligned candidate space, and performs locality informed pruning. We further align the model with real world geography via spatial-guided Reinforcement Learning (RL). Experiments on three widely used location-based social network (LBSN) datasets show that ROS achieves over 10% relative gains in hit rate over strongest LLM-based baselines and improves cross-city transfer, despite using a smaller backbone model.

[315] BackdoorAgent: A Unified Framework for Backdoor Attacks on LLM-based Agents

Yunhao Feng, Yige Li, Yutao Wu, Yingshui Tan, Yanming Guo, Yifan Ding, Kun Zhai, Xingjun Ma, Yugang Jiang

Main category: cs.AI

TL;DR: BackdoorAgent is a framework for analyzing backdoor threats in LLM agents across planning, memory, and tool-use stages, showing triggers can persist through multiple workflow steps.

Details

Motivation: Existing studies on backdoor threats in LLM agents are fragmented and analyze individual attack vectors in isolation, lacking understanding of cross-stage interaction and propagation of backdoor triggers from an agent-centric perspective.

Method: Proposed BackdoorAgent framework structures attacks into three functional stages (planning, memory, tool-use), instruments agent execution for systematic analysis, and creates a benchmark across four agent applications (Agent QA, Agent Code, Agent Web, Agent Drive) in language-only and multimodal settings.

Result: Triggers implanted at a single stage can persist across multiple steps and propagate through intermediate states. With GPT-based backbone: 43.58% persistence in planning attacks, 77.97% in memory attacks, and 60.28% in tool-stage attacks, highlighting workflow vulnerabilities.

Conclusion: The agentic workflow itself is vulnerable to backdoor threats, with triggers capable of persisting across multiple stages. The BackdoorAgent framework provides a unified view for systematic analysis, and the benchmark enables reproducibility and future research.

Abstract: Large language model (LLM) agents execute tasks through multi-step workflows that combine planning, memory, and tool use. While this design enables autonomy, it also expands the attack surface for backdoor threats. Backdoor triggers injected into specific stages of an agent workflow can persist through multiple intermediate states and adversely influence downstream outputs. However, existing studies remain fragmented and typically analyze individual attack vectors in isolation, leaving the cross-stage interaction and propagation of backdoor triggers poorly understood from an agent-centric perspective. To fill this gap, we propose \textbf{BackdoorAgent}, a modular and stage-aware framework that provides a unified, agent-centric view of backdoor threats in LLM agents. BackdoorAgent structures the attack surface into three functional stages of agentic workflows, including \textbf{planning attacks}, \textbf{memory attacks}, and \textbf{tool-use attacks}, and instruments agent execution to enable systematic analysis of trigger activation and propagation across different stages. Building on this framework, we construct a standardized benchmark spanning four representative agent applications: \textbf{Agent QA}, \textbf{Agent Code}, \textbf{Agent Web}, and \textbf{Agent Drive}, covering both language-only and multimodal settings. Our empirical analysis shows that \textit{triggers implanted at a single stage can persist across multiple steps and propagate through intermediate states.} For instance, when using a GPT-based backbone, we observe trigger persistence in 43.58% of planning attacks, 77.97% of memory attacks, and 60.28% of tool-stage attacks, highlighting the vulnerabilities of the agentic workflow itself to backdoor threats. To facilitate reproducibility and future research, our code and benchmark are publicly available at GitHub.

[316] Neurosymbolic Retrievers for Retrieval-augmented Generation

Yash Saxena, Manas Gaur

Main category: cs.AI

TL;DR: Neurosymbolic RAG integrates symbolic reasoning with neural retrieval to improve transparency and interpretability in retrieval-augmented generation systems.

Details

Motivation: Traditional RAG systems have opaque internal reasoning processes across retriever, re-ranker, and generator components, which complicates interpretability, hinders debugging, and erodes trust in high-stakes domains where clear decision-making is essential.

Method: Three neurosymbolic methods: 1) MAR (Knowledge Modulation Aligned Retrieval) uses modulation networks to refine query embeddings with interpretable symbolic features; 2) KG-Path RAG enhances queries by traversing knowledge graphs; 3) Process Knowledge-infused RAG uses domain-specific tools to reorder retrieved content based on validated workflows.

Result: Preliminary results from mental health risk assessment tasks indicate that the neurosymbolic approach enhances both transparency and overall performance.

Conclusion: Neurosymbolic RAG successfully addresses transparency issues in traditional RAG systems by integrating symbolic reasoning with neural retrieval, providing clearer interpretability while maintaining or improving performance.

Abstract: Retrieval Augmented Generation (RAG) has made significant strides in overcoming key limitations of large language models, such as hallucination, lack of contextual grounding, and issues with transparency. However, traditional RAG systems consist of three interconnected neural components - the retriever, re-ranker, and generator - whose internal reasoning processes remain opaque. This lack of transparency complicates interpretability, hinders debugging efforts, and erodes trust, especially in high-stakes domains where clear decision-making is essential. To address these challenges, we introduce the concept of Neurosymbolic RAG, which integrates symbolic reasoning using a knowledge graph with neural retrieval techniques. This new framework aims to answer two primary questions: (a) Can retrievers provide a clear and interpretable basis for document selection? (b) Can symbolic knowledge enhance the clarity of the retrieval process? We propose three methods to improve this integration. First is MAR (Knowledge Modulation Aligned Retrieval) that employs modulation networks to refine query embeddings using interpretable symbolic features, thereby making document matching more explicit. Second, KG-Path RAG enhances queries by traversing knowledge graphs to improve overall retrieval quality and interpretability. Lastly, Process Knowledge-infused RAG utilizes domain-specific tools to reorder retrieved content based on validated workflows. Preliminary results from mental health risk assessment tasks indicate that this neurosymbolic approach enhances both transparency and overall performance

[317] Scaling Behavior Cloning Improves Causal Reasoning: An Open Model for Real-Time Video Game Playing

Yuguang Yue, Irakli Salia, Samuel Hunt, Chris Green, Wenzhe Shi, Jonathan J Hunt

Main category: cs.AI

TL;DR: The paper introduces an open-source video game playing foundation model trained on 8300+ hours of human gameplay, showing competitive human-level performance and studying how model/data scaling affects causal reasoning in behavior cloning.

Details

Motivation: Behavior cloning is gaining popularity as scaling models and data provides strong baselines for many tasks. The authors aim to create an open foundation model for realtime video game playing and systematically study how scaling affects causal reasoning in behavior cloning.

Method: The authors develop an open recipe for training video game playing foundation models, releasing all data (8300+ hours of human gameplay), training/inference code, and pretrained checkpoints. They systematically examine scaling laws by varying model size (up to 1.2B parameters) and training data, studying how performance and causal reasoning scale.

Result: The best model achieves competitive human-level performance across various 3D video games. Scaling experiments show that increasing both training data and network depth improves the model’s ability to learn causal policies. Similar scaling patterns for causal reasoning are observed in both toy problems and large-scale models up to 1.2B parameters.

Conclusion: Behavior cloning can produce competitive video game playing agents when scaled appropriately. Both model size and data quantity are crucial for learning causal reasoning, with systematic scaling laws showing consistent improvements in causal policy learning as both dimensions increase.

Abstract: Behavior cloning is enjoying a resurgence in popularity as scaling both model and data sizes proves to provide a strong starting point for many tasks of interest. In this work, we introduce an open recipe for training a video game playing foundation model designed for inference in realtime on a consumer GPU. We release all data (8300+ hours of high quality human gameplay), training and inference code, and pretrained checkpoints under an open license. We show that our best model is capable of playing a variety of 3D video games at a level competitive with human play. We use this recipe to systematically examine the scaling laws of behavior cloning to understand how the model’s performance and causal reasoning varies with model and data scale. We first show in a simple toy problem that, for some types of causal reasoning, increasing both the amount of training data and the depth of the network results in the model learning a more causal policy. We then systematically study how causality varies with the number of parameters (and depth) and training steps in scaled models of up to 1.2 billion parameters, and we find similar scaling results to what we observe in the toy problem.

[318] Sci-Reasoning: A Dataset Decoding AI Innovation Patterns

Jiachen Liu, Maestro Harmon, Zechen Zhang

Main category: cs.AI

TL;DR: Sci-Reasoning is the first dataset capturing the intellectual synthesis behind high-quality AI research, identifying 15 distinct thinking patterns with three dominant strategies accounting for 52.7% of innovation.

Details

Motivation: While AI innovation accelerates rapidly, the intellectual process behind breakthroughs remains poorly understood. The lack of structured data on scientific reasoning hinders systematic analysis and development of AI research agents.

Method: Using community-validated quality signals and an LLM-accelerated, human-verified pipeline, the authors trace Oral and Spotlight papers across NeurIPS, ICML, and ICLR (2023-2025) to their key predecessors, articulating specific reasoning links in a structured format.

Result: Identified 15 distinct thinking patterns, with three dominant strategies: Gap-Driven Reframing (24.2%), Cross-Domain Synthesis (18.0%), and Representation Shift (10.5%). The most powerful innovation recipes combine multiple patterns.

Conclusion: This dataset enables quantitative studies of scientific progress and provides structured reasoning trajectories for training the next generation of AI research agents.

Abstract: While AI innovation accelerates rapidly, the intellectual process behind breakthroughs – how researchers identify gaps, synthesize prior work, and generate insights – remains poorly understood. The lack of structured data on scientific reasoning hinders systematic analysis and development of AI research agents. We introduce Sci-Reasoning, the first dataset capturing the intellectual synthesis behind high-quality AI research. Using community-validated quality signals and an LLM-accelerated, human-verified pipeline, we trace Oral and Spotlight papers across NeurIPS, ICML, and ICLR (2023-2025) to its key predecessors, articulating specific reasoning links in a structured format. Our analysis identifies 15 distinct thinking patterns, with three dominant strategies accounting for 52.7%: Gap-Driven Reframing (24.2%), Cross-Domain Synthesis (18.0%), and Representation Shift (10.5%). The most powerful innovation recipes combine multiple patterns: Gap-Driven Reframing + Representation Shift, Cross-Domain Synthesis + Representation Shift, and Gap-Driven Reframing + Cross-Domain Synthesis. This dataset enables quantitative studies of scientific progress and provides structured reasoning trajectories for training the next generation AI research agents.

[319] Evaluating Human and Machine Confidence in Phishing Email Detection: A Comparative Study

Paras Jain, Khushi Dhar, Olyemi E. Amujo, Esa M. Rantanen

Main category: cs.AI

TL;DR: The paper examines how human cognition and machine learning models work together to detect phishing emails, finding that while ML models achieve good accuracy, humans use more diverse linguistic cues and maintain more consistent confidence levels.

Details

Motivation: The research aims to understand how human cognitive processes (pattern recognition, confidence assessment, contextual analysis) and machine learning models can collaborate to identify deceptive content like phishing emails, with implications for creating transparent AI systems that complement human cognition.

Method: Used three interpretable algorithms (Logistic Regression, Decision Trees, Random Forests) trained on both TF-IDF features and semantic embeddings. Compared ML predictions against human evaluations that captured confidence ratings and linguistic observations.

Result: ML models provide good accuracy but with varying confidence levels. Human evaluators use more diverse language signs and maintain more consistent confidence. Language proficiency has minimal effect on detection performance, but aging does affect it.

Conclusion: The findings offer guidance for creating transparent AI systems that complement human cognitive functions, ultimately improving human-AI cooperation in challenging content analysis tasks like phishing detection.

Abstract: Identifying deceptive content like phishing emails demands sophisticated cognitive processes that combine pattern recognition, confidence assessment, and contextual analysis. This research examines how human cognition and machine learning models work together to distinguish phishing emails from legitimate ones. We employed three interpretable algorithms Logistic Regression, Decision Trees, and Random Forests training them on both TF-IDF features and semantic embeddings, then compared their predictions against human evaluations that captured confidence ratings and linguistic observations. Our results show that machine learning models provide good accuracy rates, but their confidence levels vary significantly. Human evaluators, on the other hand, use a greater variety of language signs and retain more consistent confidence. We also found that while language proficiency has minimal effect on detection performance, aging does. These findings offer helpful direction for creating transparent AI systems that complement human cognitive functions, ultimately improving human-AI cooperation in challenging content analysis tasks.

[320] AgentDevel: Reframing Self-Evolving LLM Agents as Release Engineering

Di Zhang

Main category: cs.AI

TL;DR: AgentDevel is a release engineering pipeline for LLM agents that treats agents as shippable artifacts, externalizes improvement into a regression-aware pipeline, and emphasizes non-regression as a primary objective.

Details

Motivation: Current LLM agent improvement approaches (self-improvement mechanisms or concurrent variant search) yield unstable, hard-to-audit trajectories, making it difficult to guarantee non-regression or reason about failures across versions.

Method: AgentDevel pipeline: 1) runs current agent, 2) produces implementation-blind, symptom-level quality signals from execution traces, 3) synthesizes a single release candidate via executable diagnosis, and 4) promotes it under flip-centered gating. Core designs: implementation-blind LLM critic, script-based executable diagnosis, and flip-centered gating.

Result: Experiments on execution-heavy benchmarks show AgentDevel yields stable improvements with significantly fewer regressions while producing reproducible, auditable artifacts.

Conclusion: AgentDevel provides a practical development discipline for building, debugging, and releasing LLM agents as software development, maintaining a single canonical version line with emphasis on non-regression.

Abstract: Recent progress in large language model (LLM) agents has largely focused on embedding self-improvement mechanisms inside the agent or searching over many concurrent variants. While these approaches can raise aggregate scores, they often yield unstable and hard-to-audit improvement trajectories, making it difficult to guarantee non-regression or to reason about failures across versions. We reframe agent improvement as \textbf{release engineering}: agents are treated as shippable artifacts, and improvement is externalized into a regression-aware release pipeline. We introduce \textbf{AgentDevel}, a release engineering pipeline that iteratively runs the current agent, produces implementation-blind, symptom-level quality signals from execution traces, synthesizes a single release candidate (RC) via executable diagnosis, and promotes it under flip-centered gating. AgentDevel features three core designs: (i) an implementation-blind LLM critic that characterizes failure appearances without accessing agent internals, (ii) script-based executable diagnosis that aggregates dominant symptom patterns and produces auditable engineering specifications, and (iii) flip-centered gating that prioritizes pass to fail regressions and fail to pass fixes as first-class evidence. Unlike population-based search or in-agent self-refinement, AgentDevel maintains a single canonical version line and emphasizes non-regression as a primary objective. Experiments on execution-heavy benchmarks demonstrate that AgentDevel yields stable improvements with significantly fewer regressions while producing reproducible, auditable artifacts. Overall, AgentDevel provides a practical development discipline for building, debugging, and releasing LLM agents as software development.

[321] Beyond the “Truth”: Investigating Election Rumors on Truth Social During the 2024 Election

Etienne Casanova, R. Michael Alvarez

Main category: cs.AI

TL;DR: LLMs enable large-scale psychological measurement of rumor propagation, showing dose-response belief reinforcement and rapid contagion in alt-tech platforms.

Details

Motivation: To demonstrate how large language models can transform psychological science by enabling rigorous measurement of belief dynamics and misinformation spread in massive, real-world datasets, particularly focusing on the "illusory truth effect" in naturalistic settings.

Method: Developed a multistage Rumor Detection Agent combining: (1) synthetic data-augmented, fine-tuned RoBERTa classifier, (2) precision keyword filtering, and (3) a two-pass LLM verification pipeline using GPT-4o mini. Applied this to compile the first large-scale dataset of election rumors on a niche alt-tech platform.

Result: Sharing probability rises steadily with each additional exposure (dose-response belief reinforcement). Simulation shows rapid contagion: nearly 25% of users become “infected” within four propagation iterations. Provides large-scale empirical evidence for psychological dynamics in ideologically homogeneous networks.

Conclusion: LLMs offer unprecedented opportunities for analyzing social phenomena at scale and can transform psychological science by enabling rigorous measurement of belief dynamics and misinformation spread in massive, real-world datasets.

Abstract: Large language models (LLMs) offer unprecedented opportunities for analyzing social phenomena at scale. This paper demonstrates the value of LLMs in psychological measurement by (1) compiling the first large-scale dataset of election rumors on a niche alt-tech platform, (2) developing a multistage Rumor Detection Agent that leverages LLMs for high-precision content classification, and (3) quantifying the psychological dynamics of rumor propagation, specifically the “illusory truth effect” in a naturalistic setting. The Rumor Detection Agent combines (i) a synthetic data-augmented, fine-tuned RoBERTa classifier, (ii) precision keyword filtering, and (iii) a two-pass LLM verification pipeline using GPT-4o mini. The findings reveal that sharing probability rises steadily with each additional exposure, providing large-scale empirical evidence for dose-response belief reinforcement in ideologically homogeneous networks. Simulation results further demonstrate rapid contagion effects: nearly one quarter of users become “infected” within just four propagation iterations. Taken together, these results illustrate how LLMs can transform psychological science by enabling the rigorous measurement of belief dynamics and misinformation spread in massive, real-world datasets.

[322] Vibe Coding an LLM-powered Theorem Prover

Zhe Hou

Main category: cs.AI

TL;DR: Isabellm is an LLM-powered theorem prover for Isabelle/HOL that combines stepwise proof search with higher-level planning, demonstrating practical value by proving lemmas that defeat standard automation, but revealing challenges in LLM code generation.

Details

Motivation: To create an accessible, fully automatic theorem prover for Isabelle/HOL that leverages LLMs to enhance proof synthesis capabilities beyond what standard automation tools like Sledgehammer can achieve, while being deployable on consumer-grade hardware.

Method: Combines stepwise prover (LLM-proposed proof commands validated by Isabelle in bounded search) with higher-level proof planner (generates structured Isar outlines and attempts fill-and-repair). Includes beam search for tactics, ML/RL rerankers, premise selection with small transformers, micro-RAG for Isar proofs, and counter-example guided proof repair.

Result: Isabellm can prove certain lemmas that defeat Isabelle’s standard automation including Sledgehammer, demonstrating practical value. However, even state-of-the-art LLMs struggle to reliably implement complex fill-and-repair mechanisms, highlighting fundamental challenges in LLM code generation and reasoning.

Conclusion: LLM-guided proof search shows practical value for theorem proving, but current LLMs face fundamental challenges with complex algorithmic designs, indicating limitations in code generation and reasoning capabilities that need to be addressed for more reliable automated theorem proving.

Abstract: We present Isabellm, an LLM-powered theorem prover for Isabelle/HOL that performs fully automatic proof synthesis. Isabellm works with any local LLM on Ollama and APIs such as Gemini CLI, and it is designed to run on consumer grade computers. The system combines a stepwise prover, which uses large language models to propose proof commands validated by Isabelle in a bounded search loop, with a higher-level proof planner that generates structured Isar outlines and attempts to fill and repair remaining gaps. The framework includes beam search for tactics, tactics reranker ML and RL models, premise selection with small transformer models, micro-RAG for Isar proofs built from AFP, and counter-example guided proof repair. All the code is implemented by GPT 4.1 - 5.2, Gemini 3 Pro, and Claude 4.5. Empirically, Isabellm can prove certain lemmas that defeat Isabelle’s standard automation, including Sledgehammer, demonstrating the practical value of LLM-guided proof search. At the same time, we find that even state-of-the-art LLMs, such as GPT 5.2 Extended Thinking and Gemini 3 Pro struggle to reliably implement the intended fill-and-repair mechanisms with complex algorithmic designs, highlighting fundamental challenges in LLM code generation and reasoning. The code of Isabellm is available at https://github.com/zhehou/llm-isabelle

[323] Know Thy Enemy: Securing LLMs Against Prompt Injection via Diverse Data Synthesis and Instruction-Level Chain-of-Thought Learning

Zhiyuan Chang, Mingyang Li, Yuekai Huang, Ziyou Jiang, Xiaojun Jia, Qian Xiong, Junjie Wang, Zhaoyang Li, Qing Wang

Main category: cs.AI

TL;DR: InstruCoT is a model enhancement method for prompt injection defense that uses instruction-level chain-of-thought fine-tuning to help LLMs identify and reject malicious instructions regardless of source or position.

Details

Motivation: LLM-integrated applications face critical security vulnerabilities from prompt injection attacks, which are difficult to defend against because malicious instructions can come from diverse vectors and lack clear semantic boundaries from surrounding context.

Method: InstruCoT synthesizes diverse training data and employs instruction-level chain-of-thought fine-tuning to enable LLMs to effectively identify and reject malicious instructions regardless of their source or position in the context.

Result: Experimental results across four LLMs show that InstruCoT significantly outperforms baselines in all three critical dimensions (Behavior Deviation, Privacy Leakage, and Harmful Output) while maintaining utility performance without degradation.

Conclusion: InstruCoT provides an effective defense against prompt injection attacks by enabling LLMs to better identify malicious instructions through instruction-level chain-of-thought reasoning, addressing key challenges in PI defense.

Abstract: Large language model (LLM)-integrated applications have become increasingly prevalent, yet face critical security vulnerabilities from prompt injection (PI) attacks. Defending against PI attacks faces two major issues: malicious instructions can be injected through diverse vectors, and injected instructions often lack clear semantic boundaries from the surrounding context, making them difficult to identify. To address these issues, we propose InstruCoT, a model enhancement method for PI defense that synthesizes diverse training data and employs instruction-level chain-of-thought fine-tuning, enabling LLMs to effectively identify and reject malicious instructions regardless of their source or position in the context. We evaluate InstruCoT across three critical dimensions: Behavior Deviation, Privacy Leakage, and Harmful Output. Experimental results across four LLMs demonstrate that InstruCoT significantly outperforms baselines in all dimensions while maintaining utility performance without degradation

[324] LLM-Guided Quantified SMT Solving over Uninterpreted Functions

Kunhang Lv, Yuhang Dong, Rui Han, Fuqi Jia, Feifei Ma, Jian Zhang

Main category: cs.AI

TL;DR: AquaForte uses LLMs to provide semantic guidance for quantifier instantiation in SMT solving with uninterpreted functions over non-linear real arithmetic, reducing search space and solving instances where traditional solvers timeout.

Details

Motivation: Traditional quantifier instantiation methods for SMT solving with uninterpreted functions over non-linear real arithmetic lack semantic understanding of UF constraints, forcing them to search through unbounded solution spaces with limited guidance, leading to poor performance.

Method: AquaForte preprocesses formulas through constraint separation, uses structured prompts to extract mathematical reasoning from LLMs to generate instantiated candidates for function definitions, and integrates results with traditional SMT algorithms through adaptive instantiation. It maintains soundness through systematic validation and preserves completeness with fallback to traditional solvers augmented with learned constraints.

Result: Experimental evaluation on SMT-COMP benchmarks shows AquaForte solves numerous instances where state-of-the-art solvers like Z3 and CVC5 timeout, with particular effectiveness on satisfiable formulas.

Conclusion: LLMs can provide valuable mathematical intuition for symbolic reasoning, establishing a new paradigm for SMT constraint solving by combining semantic guidance from LLMs with traditional SMT algorithms.

Abstract: Quantified formulas with Uninterpreted Functions (UFs) over non-linear real arithmetic pose fundamental challenges for Satisfiability Modulo Theories (SMT) solving. Traditional quantifier instantiation methods struggle because they lack semantic understanding of UF constraints, forcing them to search through unbounded solution spaces with limited guidance. We present AquaForte, a framework that leverages Large Language Models to provide semantic guidance for UF instantiation by generating instantiated candidates for function definitions that satisfy the constraints, thereby significantly reducing the search space and complexity for solvers. Our approach preprocesses formulas through constraint separation, uses structured prompts to extract mathematical reasoning from LLMs, and integrates the results with traditional SMT algorithms through adaptive instantiation. AquaForte maintains soundness through systematic validation: LLM-guided instantiations yielding SAT solve the original problem, while UNSAT results generate exclusion clauses for iterative refinement. Completeness is preserved by fallback to traditional solvers augmented with learned constraints. Experimental evaluation on SMT-COMP benchmarks demonstrates that AquaForte solves numerous instances where state-of-the-art solvers like Z3 and CVC5 timeout, with particular effectiveness on satisfiable formulas. Our work shows that LLMs can provide valuable mathematical intuition for symbolic reasoning, establishing a new paradigm for SMT constraint solving.

[325] ResMAS: Resilience Optimization in LLM-based Multi-agent Systems

Zhilun Zhou, Zihan Liu, Jiahe Liu, Qingyu Shao, Yihan Wang, Kun Shao, Depeng Jin, Fengli Xu

Main category: cs.AI

TL;DR: ResMAS: A two-stage framework for enhancing resilience in LLM-based Multi-Agent Systems by automatically designing resilient communication topologies and optimizing prompts based on agent connections.

Details

Motivation: LLM-based Multi-Agent Systems are distributed across devices/environments and vulnerable to perturbations like agent failures. Existing works focus on reactive defense after attacks occur, rather than proactive design of inherently resilient systems.

Method: Two-stage framework: 1) Train reward model to predict MAS resilience, then train topology generator via RL to design resilient topologies for specific tasks. 2) Topology-aware prompt optimization that refines each agent’s prompt based on its connections and interactions.

Result: Extensive experiments show substantial improvement in MAS resilience under various constraints. Framework demonstrates strong generalization ability to new tasks and models.

Conclusion: ResMAS provides a proactive approach to building resilient LLM-based MAS by optimizing both communication topology and prompt design, with potential for practical deployment in distributed agent systems.

Abstract: Large Language Model-based Multi-Agent Systems (LLM-based MAS), where multiple LLM agents collaborate to solve complex tasks, have shown impressive performance in many areas. However, MAS are typically distributed across different devices or environments, making them vulnerable to perturbations such as agent failures. While existing works have studied the adversarial attacks and corresponding defense strategies, they mainly focus on reactively detecting and mitigating attacks after they occur rather than proactively designing inherently resilient systems. In this work, we study the resilience of LLM-based MAS under perturbations and find that both the communication topology and prompt design significantly influence system resilience. Motivated by these findings, we propose ResMAS: a two-stage framework for enhancing MAS resilience. First, we train a reward model to predict the MAS’s resilience, based on which we train a topology generator to automatically design resilient topology for specific tasks through reinforcement learning. Second, we introduce a topology-aware prompt optimization method that refines each agent’s prompt based on its connections and interactions with other agents. Extensive experiments across a range of tasks show that our approach substantially improves MAS resilience under various constraints. Moreover, our framework demonstrates strong generalization ability to new tasks and models, highlighting its potential for building resilient MASs.

[326] Tape: A Cellular Automata Benchmark for Evaluating Rule-Shift Generalization in Reinforcement Learning

Enze Pan

Main category: cs.AI

TL;DR: Tape is a controlled RL benchmark for studying OOD failures under latent rule shifts, using cellular automata with fixed observation/action spaces but changing transition rules. It reveals that strong in-distribution methods can collapse OOD, and establishes standardized evaluation protocols with statistical rigor.

Details

Motivation: To create a controlled environment for studying out-of-distribution (OOD) failures in reinforcement learning, specifically isolating the effects of latent rule shifts while keeping observation and action spaces fixed. Current RL benchmarks often conflate multiple types of distribution shifts, making it difficult to understand OOD generalization failures.

Method: Derived from one-dimensional cellular automata, Tape enables precise train/test splits where only transition rules change. The benchmark includes a reproducible evaluation pipeline comparing model-free baselines, model-based planning with learned world models, and task-inference (meta-RL) methods. The paper also provides standardized OOD protocols and statistical reporting requirements.

Result: A consistent pattern emerges: methods that perform strongly in-distribution (ID) can collapse under heldout-rule OOD evaluation. High-variance OOD evaluation makes rankings unstable unless experiments are sufficiently replicated. The benchmark reveals fundamental limitations of current RL methods when facing rule shifts.

Conclusion: Tape provides a controlled testbed for studying OOD generalization in RL, highlighting the need for standardized evaluation protocols with statistical rigor. The paper establishes information-theoretic identities connecting entropy reduction to conditional mutual information and expected posterior KL divergence, clarifying what “uncertainty reduction” objectives can and cannot guarantee under rule shifts.

Abstract: We present Tape, a controlled reinforcement-learning benchmark designed to isolate out-of-distribution (OOD) failure under latent rule shifts.Tape is derived from one-dimensional cellular automata, enabling precise train/test splits where observation and action spaces are held fixed while transition rules change. Using a reproducible evaluation pipeline, we compare model-free baselines, model-based planning with learned world models, and task-inference (meta-RL) methods. A consistent pattern emerges: methods that are strong in-distribution (ID) can collapse under heldout-rule OOD, and high-variance OOD evaluation can make rankings unstable unless experiments are sufficiently replicated.We provide (i) standardized OOD protocols, (ii) statistical reporting requirements (seeds, confidence intervals, and hypothesis tests), and (iii) information-theoretic identities connecting entropy reduction to conditional mutual information and expected posterior KL divergence, clarifying what “uncertainty reduction” objectives can and cannot guarantee under rule shifts.

[327] A Method for Constructing a Digital Transformation Driving Mechanism Based on Semantic Understanding of Large Models

Huayi Liu

Main category: cs.AI

TL;DR: Combines LLM and knowledge graph for enterprise digital transformation, using BERT for entity recognition, GPT-4 for semantic enhancement, GNN for knowledge graph construction, and reinforcement learning for decision optimization.

Details

Motivation: Addresses insufficient semantic understanding of unstructured data and lack of intelligent decision-making basis in enterprise digital transformation driving mechanisms.

Method: 1) Fine-tuned BERT for entity recognition and relationship extraction from multi-source texts + GPT-4 for semantic vector enhancement; 2) Two-layer GNN architecture to fuse LLM semantic vectors with business metadata for dynamic knowledge graph; 3) Reinforcement learning for decision path optimization with reward-driven iteration.

Result: In manufacturing case: Reduced equipment failure response time from 7.8 to 3.7 hours, achieved 94.3% F1 score, and decreased decision error compensation in annual digital transformation costs by 45.3%.

Conclusion: The integration of large model semantic understanding with structured knowledge significantly enhances intelligence level and execution efficiency of digital transformation driving mechanisms.

Abstract: In the process of digital transformation, enterprises are faced with problems such as insufficient semantic understanding of unstructured data and lack of intelligent decision-making basis in driving mechanisms. This study proposes a method that combines a large language model (LLM) and a knowledge graph. First, a fine-tuned BERT (Bidirectional Encoder Representations from Transformers) model is used to perform entity recognition and relationship extraction on multi-source heterogeneous texts, and GPT-4 is used to generate semantically enhanced vector representations; secondly, a two-layer graph neural network (GNN) architecture is designed to fuse the semantic vectors output by LLM with business metadata to construct a dynamic and scalable enterprise knowledge graph; then reinforcement learning is introduced to optimize decision path generation, and the reward function is used to drive the mechanism iteration. In the case of the manufacturing industry, this mechanism reduced the response time for equipment failure scenarios from 7.8 hours to 3.7 hours, the F1 value reached 94.3%, and the compensation for decision errors in the annual digital transformation cost decreased by 45.3%. This method significantly enhances the intelligence level and execution efficiency of the digital transformation driving mechanism by integrating large model semantic understanding with structured knowledge.

[328] TourPlanner: A Competitive Consensus Framework with Constraint-Gated Reinforcement Learning for Travel Planning

Yinuo Wang, Mining Tan, Wenxiang Jiao, Xiaoxi Li, Hao Wang, Xuanyu Zhang, Yuan Lu, Weiming Dong

Main category: cs.AI

TL;DR: TourPlanner is a travel planning framework that uses multi-path reasoning and constraint-gated reinforcement learning to address challenges in POI selection, solution space exploration, and constraint optimization.

Details

Motivation: Existing travel planning approaches struggle with: (1) pruning candidate POIs while maintaining high recall, (2) limited exploration due to single reasoning paths, and (3) simultaneously optimizing hard and soft constraints.

Method: TourPlanner combines three key components: Personalized Recall and Spatial Optimization (PReSO) for candidate POI selection, Competitive consensus Chain-of-Thought (CCoT) for multi-path reasoning, and a sigmoid-based gating mechanism in reinforcement learning to prioritize hard constraints before soft constraints.

Result: Experimental results on travel planning benchmarks show TourPlanner achieves state-of-the-art performance, significantly outperforming existing methods in both feasibility and user-preference alignment.

Conclusion: TourPlanner provides a comprehensive solution to travel planning challenges through its integrated approach of spatial optimization, multi-path reasoning, and constraint-aware reinforcement learning, demonstrating superior performance in generating feasible and personalized itineraries.

Abstract: Travel planning is a sophisticated decision-making process that requires synthesizing multifaceted information to construct itineraries. However, existing travel planning approaches face several challenges: (1) Pruning candidate points of interest (POIs) while maintaining a high recall rate; (2) A single reasoning path restricts the exploration capability within the feasible solution space for travel planning; (3) Simultaneously optimizing hard constraints and soft constraints remains a significant difficulty. To address these challenges, we propose TourPlanner, a comprehensive framework featuring multi-path reasoning and constraint-gated reinforcement learning. Specifically, we first introduce a Personalized Recall and Spatial Optimization (PReSO) workflow to construct spatially-aware candidate POIs’ set. Subsequently, we propose Competitive consensus Chain-of-Thought (CCoT), a multi-path reasoning paradigm that improves the ability of exploring the feasible solution space. To further refine the plan, we integrate a sigmoid-based gating mechanism into the reinforcement learning stage, which dynamically prioritizes soft-constraint satisfaction only after hard constraints are met. Experimental results on travel planning benchmarks demonstrate that TourPlanner achieves state-of-the-art performance, significantly surpassing existing methods in both feasibility and user-preference alignment.

[329] Beyond Monolithic Architectures: A Multi-Agent Search and Knowledge Optimization Framework for Agentic Search

Yiqun Chen, Lingyong Yan, Zixuan Yang, Erhan Zhang, Jiashu Zhao, Shuaiqiang Wang, Dawei Yin, Jiaxin Mao

Main category: cs.AI

TL;DR: M-ASK is a multi-agent framework that decouples agentic search into specialized roles for search behavior and knowledge management, using turn-level rewards for stable coordination, achieving superior performance on multi-hop QA tasks.

Details

Motivation: Current agentic search systems rely on monolithic agents that suffer from structural bottlenecks: unconstrained reasoning outputs inflate trajectories, sparse outcome-level rewards complicate credit assignment, and stochastic search noise destabilizes learning.

Method: M-ASK decomposes agentic search into two complementary roles: Search Behavior Agents (plan and execute search actions) and Knowledge Management Agents (aggregate, filter, and maintain compact internal context). It employs turn-level rewards to provide granular supervision for both search decisions and knowledge updates.

Result: Experiments on multi-hop QA benchmarks show M-ASK outperforms strong baselines, achieving superior answer accuracy and significantly more stable training dynamics.

Conclusion: Explicit role decomposition with specialized agents for search behavior and knowledge management, combined with turn-level rewards, effectively addresses structural bottlenecks in agentic search systems, leading to improved performance and training stability.

Abstract: Agentic search has emerged as a promising paradigm for complex information seeking by enabling Large Language Models (LLMs) to interleave reasoning with tool use. However, prevailing systems rely on monolithic agents that suffer from structural bottlenecks, including unconstrained reasoning outputs that inflate trajectories, sparse outcome-level rewards that complicate credit assignment, and stochastic search noise that destabilizes learning. To address these challenges, we propose \textbf{M-ASK} (Multi-Agent Search and Knowledge), a framework that explicitly decouples agentic search into two complementary roles: Search Behavior Agents, which plan and execute search actions, and Knowledge Management Agents, which aggregate, filter, and maintain a compact internal context. This decomposition allows each agent to focus on a well-defined subtask and reduces interference between search and context construction. Furthermore, to enable stable coordination, M-ASK employs turn-level rewards to provide granular supervision for both search decisions and knowledge updates. Experiments on multi-hop QA benchmarks demonstrate that M-ASK outperforms strong baselines, achieving not only superior answer accuracy but also significantly more stable training dynamics.\footnote{The source code for M-ASK is available at https://github.com/chenyiqun/M-ASK.}

[330] Bridging Temporal and Textual Modalities: A Multimodal Framework for Automated Cloud Failure Root Cause Analysis

Gijun Park

Main category: cs.AI

TL;DR: A multimodal framework that aligns time-series metrics with language model embeddings for cloud incident root cause analysis, achieving 48.75% diagnostic accuracy.

Details

Motivation: Current LLMs struggle with continuous time-series data due to token-based architecture limitations, hindering their use in cloud incident management despite strong textual reasoning capabilities.

Method: Three technical advances: 1) semantic compression of temporal segments into single-token abstractions, 2) alignment encoder with gated cross-attention to project time-series into language model space, 3) retrieval-augmented diagnostic pipeline combining aligned embeddings with historical incident knowledge.

Result: Achieved leading performance with 48.75% diagnostic accuracy across six cloud system benchmarks, with notable improvements on compound failure mode scenarios.

Conclusion: Embedding-space alignment is an effective strategy for enabling language models to reason over multimodal telemetry data in production incident response contexts.

Abstract: Root cause analysis in modern cloud infrastructure demands sophisticated understanding of heterogeneous data sources, particularly time-series performance metrics that involve core failure signatures. While large language models demonstrate remarkable capabilities in textual reasoning, their discrete token-based architecture creates fundamental incompatibilities with continuous numerical sequences exhibiting temporal dependencies. Current methodologies inadequately address this modality mismatch, constraining the potential of language model-driven automation in incident management workflows. This paper presents a multimodal diagnostic framework that harmonizes time-series representations with pretrained language model embedding spaces. Our approach contributes three technical advances: (1) a semantic compression technique that distills temporal segments into single-token abstractions while preserving pattern semantics, (2) an alignment encoder utilizing gated cross-attention to project time-series features into language model latent space, and (3) a retrieval-augmented diagnostic pipeline that synthesizes aligned embeddings with historical incident knowledge for expert-level failure attribution. Comprehensive evaluation across six cloud system benchmarks demonstrates that our framework achieves leading performance, reaching 48.75% diagnostic accuracy with notable improvements on scenarios involving compound failure modes. The results validate embedding-space alignment as an effective strategy for enabling language models to reason over multimodal telemetry data in production incident response contexts.

[331] ThinkDrive: Chain-of-Thought Guided Progressive Reinforcement Learning Fine-Tuning for Autonomous Driving

Chang Zhao, Zheming Yang, Yunqing Hu, Qi Guo, Zijian Wang, Pengcheng Li, Wen Ji

Main category: cs.AI

TL;DR: ThinkDrive: A Chain-of-Thought guided progressive reinforcement learning framework for autonomous driving that combines explicit reasoning with difficulty-aware adaptive policy optimization to address unstructured reasoning and poor generalization in existing methods.

Details

Motivation: Existing LLM applications in autonomous driving suffer from unstructured reasoning, poor generalization, and misalignment with human driving intent. While Chain-of-Thought reasoning enhances transparency, conventional supervised fine-tuning fails to fully exploit its potential, and reinforcement learning approaches face instability and suboptimal reasoning depth.

Method: Two-stage training strategy: 1) Supervised fine-tuning using CoT explanations, followed by 2) Progressive reinforcement learning with a difficulty-aware adaptive policy optimizer that dynamically adjusts learning intensity based on sample complexity.

Result: ThinkDrive outperforms strong RL baselines by 1.45%, 1.95%, and 1.01% on exam, easy-exam, and accuracy metrics respectively. A 2B-parameter model trained with ThinkDrive surpasses the much larger GPT-4o by 3.28% on the exam metric.

Conclusion: ThinkDrive effectively synergizes explicit reasoning with adaptive policy optimization, demonstrating superior performance over existing methods and showing that smaller models can outperform much larger ones when trained with this approach.

Abstract: With the rapid advancement of large language models (LLMs) technologies, their application in the domain of autonomous driving has become increasingly widespread. However, existing methods suffer from unstructured reasoning, poor generalization, and misalignment with human driving intent. While Chain-of-Thought (CoT) reasoning enhances decision transparency, conventional supervised fine-tuning (SFT) fails to fully exploit its potential, and reinforcement learning (RL) approaches face instability and suboptimal reasoning depth. We propose ThinkDrive, a CoT guided progressive RL fine-tuning framework for autonomous driving that synergizes explicit reasoning with difficulty-aware adaptive policy optimization. Our method employs a two-stage training strategy. First, we perform SFT using CoT explanations. Then, we apply progressive RL with a difficulty-aware adaptive policy optimizer that dynamically adjusts learning intensity based on sample complexity. We evaluate our approach on a public dataset. The results show that ThinkDrive outperforms strong RL baselines by 1.45%, 1.95%, and 1.01% on exam, easy-exam, and accuracy, respectively. Moreover, a 2B-parameter model trained with our method surpasses the much larger GPT-4o by 3.28% on the exam metric.

[332] Memory Matters More: Event-Centric Memory as a Logic Map for Agent Searching and Reasoning

Yuyang Hu, Jiongnan Liu, Jiejun Tan, Yutao Zhu, Zhicheng Dou

Main category: cs.AI

TL;DR: CompassMem is an event-centric memory framework that organizes experiences into an Event Graph with logical relations, enabling structured navigation for long-horizon reasoning in LLM agents.

Details

Motivation: Current LLM memory systems use flat storage and simple similarity-based retrieval, lacking explicit logical relationship capture and structured reasoning over long-horizon dependencies.

Method: Inspired by Event Segmentation Theory, CompassMem incrementally segments experiences into events and links them through explicit logical relations to form an Event Graph, enabling goal-directed navigation.

Result: Experiments on LoCoMo and NarrativeQA show CompassMem consistently improves both retrieval and reasoning performance across multiple backbone models.

Conclusion: CompassMem’s event-centric graph-based memory framework enables more effective long-horizon reasoning by providing structured, logically-connected memory organization beyond superficial retrieval.

Abstract: Large language models (LLMs) are increasingly deployed as intelligent agents that reason, plan, and interact with their environments. To effectively scale to long-horizon scenarios, a key capability for such agents is a memory mechanism that can retain, organize, and retrieve past experiences to support downstream decision-making. However, most existing approaches organize and store memories in a flat manner and rely on simple similarity-based retrieval techniques. Even when structured memory is introduced, existing methods often struggle to explicitly capture the logical relationships among experiences or memory units. Moreover, memory access is largely detached from the constructed structure and still depends on shallow semantic retrieval, preventing agents from reasoning logically over long-horizon dependencies. In this work, we propose CompassMem, an event-centric memory framework inspired by Event Segmentation Theory. CompassMem organizes memory as an Event Graph by incrementally segmenting experiences into events and linking them through explicit logical relations. This graph serves as a logic map, enabling agents to perform structured and goal-directed navigation over memory beyond superficial retrieval, progressively gathering valuable memories to support long-horizon reasoning. Experiments on LoCoMo and NarrativeQA demonstrate that CompassMem consistently improves both retrieval and reasoning performance across multiple backbone models.

[333] Miner:Mining Intrinsic Mastery for Data-Efficient RL in Large Reasoning Models

Shuyang Jiang, Yuhao Wang, Ya Zhang, Yanfeng Wang, Yu Wang

Main category: cs.AI

TL;DR: Miner introduces a novel RL method that uses the policy’s intrinsic uncertainty as a self-supervised reward signal to solve inefficiency in critic-free RL for reasoning models, achieving state-of-the-art performance.

Details

Motivation: Current critic-free RL methods for large reasoning models are inefficient when training on positive homogeneous prompts (where all rollouts are correct), wasting rollouts due to zero advantage estimates.

Method: Miner repurposes the policy’s intrinsic uncertainty as a self-supervised reward signal without external supervision. Key innovations: (1) token-level focal credit assignment that amplifies gradients on critical uncertain tokens while suppressing overconfident ones, and (2) adaptive advantage calibration to integrate intrinsic and verifiable rewards.

Result: Evaluated across six reasoning benchmarks on Qwen3-4B and Qwen3-8B base models, Miner achieves state-of-the-art performance among four other algorithms, with up to 4.58 absolute gains in Pass@1 and 6.66 gains in Pass@K compared to GRPO.

Conclusion: Latent uncertainty exploitation is both necessary and sufficient for efficient and scalable RL training of reasoning models, as demonstrated by Miner’s superior performance over other exploration enhancement methods.

Abstract: Current critic-free RL methods for large reasoning models suffer from severe inefficiency when training on positive homogeneous prompts (where all rollouts are correct), resulting in waste of rollouts due to zero advantage estimates. We introduce a radically simple yet powerful solution to \uline{M}ine \uline{in}trinsic mast\uline{er}y (Miner), that repurposes the policy’s intrinsic uncertainty as a self-supervised reward signal, with no external supervision, auxiliary models, or additional inference cost. Our method pioneers two key innovations: (1) a token-level focal credit assignment mechanism that dynamically amplifies gradients on critical uncertain tokens while suppressing overconfident ones, and (2) adaptive advantage calibration to seamlessly integrate intrinsic and verifiable rewards. Evaluated across six reasoning benchmarks on Qwen3-4B and Qwen3-8B base models, Miner achieves state-of-the-art performance among the other four algorithms, yielding up to \textbf{4.58} absolute gains in Pass@1 and \textbf{6.66} gains in Pass@K compared to GRPO. Comparison with other methods targeted at exploration enhancement further discloses the superiority of the two newly proposed innovations. This demonstrates that latent uncertainty exploitation is both necessary and sufficient for efficient and scalable RL training of reasoning models.

[334] KnowMe-Bench: Benchmarking Person Understanding for Lifelong Digital Companions

Tingyu Wu, Zhisheng Chen, Ziyan Weng, Shuhe Wang, Chenglong Li, Shuo Zhang, Sen Hu, Silin Wu, Qizhen Lan, Huacan Wang, Ronghao Chen

Main category: cs.AI

TL;DR: KnowMeBench is a benchmark for evaluating person understanding using long-form autobiographical narratives, testing models on factual recall, subjective state attribution, and principle-level reasoning.

Details

Motivation: Existing memory benchmarks use multi-turn dialogues or synthetic histories, making retrieval performance an imperfect proxy for true person understanding. The authors aim to create a more realistic benchmark using authentic autobiographical narratives.

Method: The benchmark reconstructs autobiographical narratives into flashback-aware, time-anchored streams and evaluates models with evidence-linked questions across three categories: factual recall, subjective state attribution, and principle-level reasoning.

Result: Retrieval-augmented systems mainly improve factual accuracy but errors persist on temporally grounded explanations and higher-level inferences, showing limitations of current retrieval-based approaches.

Conclusion: The benchmark highlights the need for memory mechanisms beyond simple retrieval to achieve true person understanding, as current systems struggle with temporal reasoning and higher-level inference tasks.

Abstract: Existing long-horizon memory benchmarks mostly use multi-turn dialogues or synthetic user histories, which makes retrieval performance an imperfect proxy for person understanding. We present \BenchName, a publicly releasable benchmark built from long-form autobiographical narratives, where actions, context, and inner thoughts provide dense evidence for inferring stable motivations and decision principles. \BenchName~reconstructs each narrative into a flashback-aware, time-anchored stream and evaluates models with evidence-linked questions spanning factual recall, subjective state attribution, and principle-level reasoning. Across diverse narrative sources, retrieval-augmented systems mainly improve factual accuracy, while errors persist on temporally grounded explanations and higher-level inferences, highlighting the need for memory mechanisms beyond retrieval. Our data is in \href{KnowMeBench}{https://github.com/QuantaAlpha/KnowMeBench}.

[335] Orion-RAG: Path-Aligned Hybrid Retrieval for Graphless Data

Zhen Chen, Weihao Xie, Peilin Chen, Shiqi Wang, Jianping Wang

Main category: cs.AI

TL;DR: Orion-RAG is a lightweight RAG system that extracts natural connections between fragmented documents without heavy algorithms, outperforming mainstream frameworks with 25.2% improvement on FinanceBench.

Details

Motivation: Standard RAG struggles with discrete, fragmented data where information is distributed across isolated files without explicit links. Manual Knowledge Graph construction is impractical for vast data, and standard search engines process files independently ignoring connections.

Method: Uses low-complexity strategy to extract lightweight paths that naturally link related concepts, transforming fragmented documents into semi-structured data without heavy algorithms. Enables linking information across different files effectively.

Result: Consistently outperforms mainstream frameworks across diverse domains, supports real-time updates and explicit Human-in-the-Loop verification with high cost-efficiency. On FinanceBench, achieves 25.2% relative improvement in precision over strong baselines.

Conclusion: A streamlined, lightweight approach to extracting natural connections between fragmented documents is sufficient for effective RAG, offering superior performance, real-time capability, and cost-efficiency compared to complex algorithms.

Abstract: Retrieval-Augmented Generation (RAG) has proven effective for knowledge synthesis, yet it encounters significant challenges in practical scenarios where data is inherently discrete and fragmented. In most environments, information is distributed across isolated files like reports and logs that lack explicit links. Standard search engines process files independently, ignoring the connections between them. Furthermore, manually building Knowledge Graphs is impractical for such vast data. To bridge this gap, we present Orion-RAG. Our core insight is simple yet effective: we do not need heavy algorithms to organize this data. Instead, we use a low-complexity strategy to extract lightweight paths that naturally link related concepts. We demonstrate that this streamlined approach suffices to transform fragmented documents into semi-structured data, enabling the system to link information across different files effectively. Extensive experiments demonstrate that Orion-RAG consistently outperforms mainstream frameworks across diverse domains, supporting real-time updates and explicit Human-in-the-Loop verification with high cost-efficiency. Experiments on FinanceBench demonstrate superior precision with a 25.2% relative improvement over strong baselines.

[336] AT$^2$PO: Agentic Turn-based Policy Optimization via Tree Search

Zefang Zong, Dingwei Chen, Yang Li, Qi Yi, Bo Zhou, Chengming Li, Bo Qian, Peng Chen, Jie Jiang

Main category: cs.AI

TL;DR: AT²PO is a unified framework for multi-turn agentic RL that addresses exploration diversity, credit assignment, and policy optimization challenges through turn-level tree search and optimization.

Details

Motivation: LLM agents need refinement through agentic reinforcement learning, but current approaches face three core challenges: limited exploration diversity, sparse credit assignment, and misaligned policy optimization in multi-turn tasks.

Method: AT²PO introduces a turn-level tree structure with Entropy-Guided Tree Expansion for strategic exploration and Turn-wise Credit Assignment for reward propagation. It also proposes Agentic Turn-based Policy Optimization (ATPO) as a turn-level learning objective that aligns with agentic decision granularity.

Result: Experiments across seven benchmarks show consistent improvements over state-of-the-art baselines by up to 1.84 percentage points on average, with ablation studies validating each component’s effectiveness.

Conclusion: AT²PO provides a unified framework that successfully addresses key challenges in multi-turn agentic RL, with ATPO being orthogonal to tree search and easily integrable into existing RL pipelines.

Abstract: LLM agents have emerged as powerful systems for tackling multi-turn tasks by interleaving internal reasoning and external tool interactions. Agentic Reinforcement Learning has recently drawn significant research attention as a critical post-training paradigm to further refine these capabilities. In this paper, we present AT$^2$PO (Agentic Turn-based Policy Optimization via Tree Search), a unified framework for multi-turn agentic RL that addresses three core challenges: limited exploration diversity, sparse credit assignment, and misaligned policy optimization. AT$^2$PO introduces a turn-level tree structure that jointly enables Entropy-Guided Tree Expansion for strategic exploration and Turn-wise Credit Assignment for fine-grained reward propagation from sparse outcomes. Complementing this, we propose Agentic Turn-based Policy Optimization, a turn-level learning objective that aligns policy updates with the natural decision granularity of agentic interactions. ATPO is orthogonal to tree search and can be readily integrated into any multi-turn RL pipeline. Experiments across seven benchmarks demonstrate consistent improvements over the state-of-the-art baseline by up to 1.84 percentage points in average, with ablation studies validating the effectiveness of each component. Our code is available at https://github.com/zzfoutofspace/ATPO.

[337] SciIF: Benchmarking Scientific Instruction Following Towards Rigorous Scientific Intelligence

Encheng Su, Jianyu Wu, Chen Tang, Lintao Wang, Pengze Li, Aoran Wang, Jinouwen Zhang, Yizhou Wang, Yuan Meng, Xinzhu Ma, Shixiang Tang, Houqiang Li

Main category: cs.AI

TL;DR: SciIF is a new benchmark that evaluates LLMs’ ability to follow scientific constraints while solving problems, focusing on explicit evidence of constraint satisfaction rather than just final answers.

Details

Motivation: Existing benchmarks have a critical blind spot: general instruction-following metrics focus on superficial formatting, while domain-specific scientific benchmarks only assess final-answer correctness, often rewarding models that get the right result with wrong reasoning. There's a need to evaluate LLMs' ability to adhere to scientific validity constraints as they transition from general knowledge retrieval to complex scientific discovery.

Method: Introduces SciIF, a multi-discipline benchmark that pairs university-level problems with a fixed catalog of constraints across three pillars: scientific conditions (boundary checks, assumptions), semantic stability (unit and symbol conventions), and specific processes (required numerical methods). Emphasizes auditability by requiring models to provide explicit evidence of constraint satisfaction.

Result: By measuring both solution correctness and multi-constraint adherence, SciIF enables fine-grained diagnosis of compositional reasoning failures in LLMs.

Conclusion: SciIF ensures LLMs can function as reliable agents within the strict logical frameworks of science by evaluating their scientific instruction-following capability - the ability to solve problems while strictly adhering to constraints that establish scientific validity.

Abstract: As large language models (LLMs) transition from general knowledge retrieval to complex scientific discovery, their evaluation standards must also incorporate the rigorous norms of scientific inquiry. Existing benchmarks exhibit a critical blind spot: general instruction-following metrics focus on superficial formatting, while domain-specific scientific benchmarks assess only final-answer correctness, often rewarding models that arrive at the right result with the wrong reasons. To address this gap, we introduce scientific instruction following: the capability to solve problems while strictly adhering to the constraints that establish scientific validity. Specifically, we introduce SciIF, a multi-discipline benchmark that evaluates this capability by pairing university-level problems with a fixed catalog of constraints across three pillars: scientific conditions (e.g., boundary checks and assumptions), semantic stability (e.g., unit and symbol conventions), and specific processes(e.g., required numerical methods). Uniquely, SciIF emphasizes auditability, requiring models to provide explicit evidence of constraint satisfaction rather than implicit compliance. By measuring both solution correctness and multi-constraint adherence, SciIF enables finegrained diagnosis of compositional reasoning failures, ensuring that LLMs can function as reliable agents within the strict logical frameworks of science.

[338] APEX: Academic Poster Editing Agentic Expert

Chengxin Shi, Qinnan Cai, Zeyuan Chen, Long Zeng, Yibo Zhao, Jing Yu, Jianxiang Yu, Xiang Li

Main category: cs.AI

TL;DR: APEX is an agentic framework for interactive academic poster editing with multi-level API-based editing and review-adjustment mechanism, outperforming baselines on a new benchmark.

Details

Motivation: Existing paper-to-poster generation methods are single-pass, non-interactive, and often fail to align with complex user intent, creating a need for interactive editing frameworks.

Method: APEX framework with fine-grained control using robust multi-level API-based editing and a review-and-adjustment mechanism, plus APEX-Bench benchmark with 514 instructions categorized by operation type, difficulty, and abstraction level.

Result: APEX significantly outperforms baseline methods on the APEX-Bench benchmark using multi-dimensional VLM-as-a-judge evaluation protocol.

Conclusion: APEX provides the first agentic framework for interactive academic poster editing with systematic evaluation, addressing limitations of existing single-pass methods.

Abstract: Designing academic posters is a labor-intensive process requiring the precise balance of high-density content and sophisticated layout. While existing paper-to-poster generation methods automate initial drafting, they are typically single-pass and non-interactive, often fail to align with complex, subjective user intent. To bridge this gap, we propose APEX (Academic Poster Editing agentic eXpert), the first agentic framework for interactive academic poster editing, supporting fine-grained control with robust multi-level API-based editing and a review-and-adjustment Mechanism. In addition, we introduce APEX-Bench, the first systematic benchmark comprising 514 academic poster editing instructions, categorized by a multi-dimensional taxonomy including operation type, difficulty, and abstraction level, constructed via reference-guided and reference-free strategies to ensure realism and diversity. We further establish a multi-dimensional VLM-as-a-judge evaluation protocol to assess instruction fulfillment, modification scope, and visual consistency & harmony. Experimental results demonstrate that APEX significantly outperforms baseline methods. Our implementation is available at https://github.com/Breesiu/APEX.

[339] Thinking-Based Non-Thinking: Solving the Reward Hacking Problem in Training Hybrid Reasoning Models via Reinforcement Learning

Siyuan Gan, Jiaheng Liu, Boyan Wang, Tianpei Yang, Runqing Miao, Yuyao Zhang, Fanyu Meng, Junlan Feng, Linjian Meng, Jing Huo, Yang Gao

Main category: cs.AI

TL;DR: TNT (Thinking-Based Non-Thinking) reduces computational overhead in large reasoning models by dynamically adjusting token limits for non-thinking responses based on thinking-based solutions, achieving ~50% token reduction while improving accuracy.

Details

Motivation: Large reasoning models suffer from "overthinking" - using long chains of thought that increase computational overhead. Existing RL-based approaches to decide when to think suffer from reward hacking problems, while SFT alternatives are computationally expensive and uniform token limits provide limited mitigation.

Method: TNT (Thinking-Based Non-Thinking) avoids SFT and sets different maximum token usage for non-thinking responses across queries by leveraging information from thinking-based solutions. It dynamically adjusts token limits based on the complexity revealed by thinking-based approaches.

Result: TNT reduces token usage by ~50% compared to DeepSeek-R1-Distill-Qwen-1.5B/7B and DeepScaleR-1.5B while significantly improving accuracy. It achieves optimal accuracy-efficiency trade-off and keeps reward hacking probability below 10% for non-thinking responses across all datasets.

Conclusion: TNT effectively addresses the overthinking problem in large reasoning models by intelligently managing when to use thinking vs. non-thinking responses with dynamic token limits, achieving superior efficiency-accuracy balance while minimizing reward hacking issues.

Abstract: Large reasoning models (LRMs) have attracted much attention due to their exceptional performance. However, their performance mainly stems from thinking, a long Chain of Thought (CoT), which significantly increase computational overhead. To address this overthinking problem, existing work focuses on using reinforcement learning (RL) to train hybrid reasoning models that automatically decide whether to engage in thinking or not based on the complexity of the query. Unfortunately, using RL will suffer the the reward hacking problem, e.g., the model engages in thinking but is judged as not doing so, resulting in incorrect rewards. To mitigate this problem, existing works either employ supervised fine-tuning (SFT), which incurs high computational costs, or enforce uniform token limits on non-thinking responses, which yields limited mitigation of the problem. In this paper, we propose Thinking-Based Non-Thinking (TNT). It does not employ SFT, and sets different maximum token usage for responses not using thinking across various queries by leveraging information from the solution component of the responses using thinking. Experiments on five mathematical benchmarks demonstrate that TNT reduces token usage by around 50% compared to DeepSeek-R1-Distill-Qwen-1.5B/7B and DeepScaleR-1.5B, while significantly improving accuracy. In fact, TNT achieves the optimal trade-off between accuracy and efficiency among all tested methods. Additionally, the probability of reward hacking problem in TNT’s responses, which are classified as not using thinking, remains below 10% across all tested datasets.

[340] SCALER:Synthetic Scalable Adaptive Learning Environment for Reasoning

Caijun Xu, Changyi Xiao, Zhongyuan Peng, Xinrun Wang, Yixin Cao

Main category: cs.AI

TL;DR: SCALER is a framework that creates adaptive synthetic reasoning environments to sustain effective RL training signals for LLMs by dynamically adjusting difficulty and maintaining task diversity.

Details

Motivation: Current RL training for LLMs often slows down when task difficulty becomes misaligned with model capability or when training is dominated by narrow problem patterns, leading to reward sparsity and overfitting.

Method: SCALER uses a scalable synthesis pipeline to convert real-world programming problems into verifiable reasoning environments with controllable difficulty and unbounded instance generation. It employs adaptive multi-environment RL that dynamically adjusts instance difficulty and curates active environments to track model capability and maintain diversity.

Result: SCALER consistently outperforms dataset-based RL baselines across diverse reasoning benchmarks and exhibits more stable, long-horizon training dynamics.

Conclusion: SCALER’s adaptive environment design effectively sustains informative learning signals throughout RL training, preventing reward sparsity and overfitting while enabling continuous improvement in reasoning capabilities.

Abstract: Reinforcement learning (RL) offers a principled way to enhance the reasoning capabilities of large language models, yet its effectiveness hinges on training signals that remain informative as models evolve. In practice, RL progress often slows when task difficulty becomes poorly aligned with model capability, or when training is dominated by a narrow set of recurring problem patterns. To jointly address these issues, we propose SCALER (Synthetic sCalable Adaptive Learning Environment for Reasoning), a framework that sustains effective learning signals through adaptive environment design. SCALER introduces a scalable synthesis pipeline that converts real-world programming problems into verifiable reasoning environments with controllable difficulty and unbounded instance generation, enabling RL training beyond finite datasets while preserving strong correctness guarantees. Building on this, SCALER further employs an adaptive multi-environment RL strategy that dynamically adjusts instance difficulty and curates the active set of environments to track the model’s capability frontier and maintain distributional diversity. This co-adaptation prevents reward sparsity, mitigates overfitting to narrow task patterns, and supports sustained improvement throughout training. Extensive experiments show that SCALER consistently outperforms dataset-based RL baselines across diverse reasoning benchmarks and exhibits more stable, long-horizon training dynamics.

[341] AECV-Bench: Benchmarking Multimodal Models on Architectural and Engineering Drawings Understanding

Aleksei Kondratenko, Mussie Birhane, Houssame E. Hsain, Guido Maciocci

Main category: cs.AI

TL;DR: AECV-Bench benchmark evaluates multimodal models on AEC drawings, showing they excel at text-based tasks but struggle with symbol interpretation like door/window counting.

Details

Motivation: To assess whether modern multimodal and vision-language models can reliably interpret the graphical language of AEC drawings, which encode geometry and semantics through symbols, layout conventions, and dense annotation.

Method: Created AECV-Bench with two use cases: (1) object counting on 120 floor plans for doors, windows, bedrooms, toilets; (2) drawing-grounded document QA with 192 question-answer pairs testing OCR, instance counting, spatial reasoning, and comparative reasoning. Used per-field exact-match accuracy, MAPE, LLM-as-a-judge scoring, and human adjudication.

Result: Models show a capability gradient: OCR/text-centric QA strongest (up to 0.95 accuracy), spatial reasoning moderate, symbol-centric drawing understanding weakest (0.40-0.55 accuracy for door/window counting). Current systems work well as document assistants but lack robust drawing literacy.

Conclusion: Current multimodal models function well as document assistants but lack robust drawing literacy for AEC automation, motivating domain-specific representations and tool-augmented, human-in-the-loop workflows.

Abstract: AEC drawings encode geometry and semantics through symbols, layout conventions, and dense annotation, yet it remains unclear whether modern multimodal and vision-language models can reliably interpret this graphical language. We present AECV-Bench, a benchmark for evaluating multimodal and vision-language models on realistic AEC artefacts via two complementary use cases: (i) object counting on 120 high-quality floor plans (doors, windows, bedrooms, toilets), and (ii) drawing-grounded document QA spanning 192 question-answer pairs that test text extraction (OCR), instance counting, spatial reasoning, and comparative reasoning over common drawing regions. Object-counting performance is reported using per-field exact-match accuracy and MAPE results, while document-QA performance is reported using overall accuracy and per-category breakdowns with an LLM-as-a-judge scoring pipeline and targeted human adjudication for edge cases. Evaluating a broad set of state-of-the-art models under a unified protocol, we observe a stable capability gradient; OCR and text-centric document QA are strongest (up to 0.95 accuracy), spatial reasoning is moderate, and symbol-centric drawing understanding - especially reliable counting of doors and windows - remains unsolved (often 0.40-0.55 accuracy) with substantial proportional errors. These results suggest that current systems function well as document assistants but lack robust drawing literacy, motivating domain-specific representations and tool-augmented, human-in-the-loop workflows for an efficient AEC automation.

[342] DR-LoRA: Dynamic Rank LoRA for Mixture-of-Experts Adaptation

Guanzhi Deng, Bo Li, Ronghao Chen, Huacan Wang, Linqi Song, Lijie Wen

Main category: cs.AI

TL;DR: DR-LoRA: A dynamic rank allocation framework for fine-tuning MoE LLMs that assigns different LoRA ranks to experts based on task-specific demands, outperforming uniform allocation methods.

Details

Motivation: Existing PEFT methods like LoRA assign identical ranks to all experts in MoE LLMs, ignoring functional specialization and causing resource mismatch - task-relevant experts get insufficient capacity while irrelevant ones waste parameters.

Method: DR-LoRA dynamically grows expert LoRA ranks during fine-tuning using Expert Saliency Scoring that combines expert routing frequency and LoRA rank importance to quantify each expert’s capacity needs. Higher-saliency experts get priority for rank expansion.

Result: Experiments on multiple benchmarks show DR-LoRA consistently outperforms standard LoRA and static allocation strategies under the same parameter budget, achieving better task performance with more efficient parameter utilization.

Conclusion: DR-LoRA enables automatic formation of heterogeneous rank distributions tailored to specific tasks, addressing resource mismatch in MoE fine-tuning and improving parameter efficiency.

Abstract: Mixture-of-Experts (MoE) has become a prominent paradigm for scaling Large Language Models (LLMs). Parameter-efficient fine-tuning (PEFT), such as LoRA, is widely adopted to adapt pretrained MoE LLMs to downstream tasks. However, existing approaches assign identical LoRA ranks to all experts, overlooking the intrinsic functional specialization within MoE LLMs. This uniform allocation leads to resource mismatch, task-relevant experts are under-provisioned while less relevant ones receive redundant parameters. We propose a Dynamic Rank LoRA framework named DR-LoRA, which dynamically grows expert LoRA ranks during fine-tuning based on task-specific demands. DR-LoRA employs an Expert Saliency Scoring mechanism that integrates expert routing frequency and LoRA rank importance to quantify each expert’s demand for additional capacity. Experts with higher saliency scores are prioritized for rank expansion, enabling the automatic formation of a heterogeneous rank distribution tailored to the target task. Experiments on multiple benchmarks demonstrate that DR-LoRA consistently outperforms standard LoRA and static allocation strategies under the same parameter budget, achieving superior task performance with more efficient parameter utilization.

[343] Orchestrating Intelligence: Confidence-Aware Routing for Efficient Multi-Agent Collaboration across Multi-Scale Models

Jingbo Wang, Sendong Zhao, Jiatong Liu, Haochun Wang, Wanting Li, Bing Qin, Ting Liu

Main category: cs.AI

TL;DR: OI-MAS is a multi-agent framework that uses adaptive model selection across heterogeneous LLMs to improve computational efficiency while maintaining performance.

Details

Motivation: Existing multi-agent systems suffer from computational inefficiencies by uniformly deploying large LLMs across all agent roles, failing to account for varying cognitive demands at different reasoning stages.

Method: Proposes OI-MAS framework with: 1) adaptive model-selection policy across heterogeneous multi-scale LLMs, 2) state-dependent routing mechanism for dynamic agent role and model scale selection, and 3) confidence-aware mechanism that selects model scales based on task complexity.

Result: OI-MAS outperforms baseline multi-agent systems, improving accuracy by up to 12.88% while reducing cost by up to 79.78%.

Conclusion: The OI-MAS framework successfully addresses computational inefficiencies in multi-agent systems through adaptive model selection, achieving better performance at significantly lower cost.

Abstract: While multi-agent systems (MAS) have demonstrated superior performance over single-agent approaches in complex reasoning tasks, they often suffer from significant computational inefficiencies. Existing frameworks typically deploy large language models (LLMs) uniformly across all agent roles, failing to account for the varying cognitive demands of different reasoning stages. We address this inefficiency by proposing OI-MAS framework, a novel multi-agent framework that implements an adaptive model-selection policy across a heterogeneous pool of multi-scale LLMs. Specifically, OI-MAS introduces a state-dependent routing mechanism that dynamically selects agent roles and model scales throughout the reasoning process. In addition, we introduce a confidence-aware mechanism that selects appropriate model scales conditioned on task complexity, thus reducing unnecessary reliance on large-scale models. Experimental results show that OI-MAS consistently outperforms baseline multi-agent systems, improving accuracy by up to 12.88% while reducing cost by up to 79.78%.

[344] Key-Value Pair-Free Continual Learner via Task-Specific Prompt-Prototype

Haihua Luo, Xuming Ran, Zhengji Li, Huiyan Xue, Tingting Jiang, Jiangrong Shen, Tommi Kärkkäinen, Qi Xu, Fengyu Cong

Main category: cs.AI

TL;DR: ProP: A prompt-based continual learning method that uses task-specific prompt-prototype pairs instead of key-value pairs to reduce inter-task interference and improve scalability.

Details

Motivation: Existing prompt-based continual learning methods rely on key-value pairing, which introduces inter-task interference and scalability limitations. The authors aim to overcome these issues by eliminating the dependency on key-value pairs.

Method: Proposes task-specific Prompt-Prototype (ProP) pairs where prompts facilitate feature learning for current tasks and prototypes capture representative input features. During inference, predictions are made by binding each task-specific prompt with its associated prototype. Also introduces regularization constraints during prompt initialization to penalize large values for enhanced stability.

Result: Experiments on several widely used datasets demonstrate the effectiveness of the proposed method in continual learning scenarios.

Conclusion: The framework successfully removes dependency on key-value pairs, offering a fresh perspective for future continual learning research by reducing inter-task interference and improving scalability.

Abstract: Continual learning aims to enable models to acquire new knowledge while retaining previously learned information. Prompt-based methods have shown remarkable performance in this domain; however, they typically rely on key-value pairing, which can introduce inter-task interference and hinder scalability. To overcome these limitations, we propose a novel approach employing task-specific Prompt-Prototype (ProP), thereby eliminating the need for key-value pairs. In our method, task-specific prompts facilitate more effective feature learning for the current task, while corresponding prototypes capture the representative features of the input. During inference, predictions are generated by binding each task-specific prompt with its associated prototype. Additionally, we introduce regularization constraints during prompt initialization to penalize excessively large values, thereby enhancing stability. Experiments on several widely used datasets demonstrate the effectiveness of the proposed method. In contrast to mainstream prompt-based approaches, our framework removes the dependency on key-value pairs, offering a fresh perspective for future continual learning research.

[345] Higher-Order Knowledge Representations for Agentic Scientific Reasoning

Isabella A. Stewart, Markus J. Buehler

Main category: cs.AI

TL;DR: Researchers develop hypergraph-based knowledge representation to capture higher-order interactions in scientific data, enabling AI systems to generate novel mechanistic hypotheses without explicit supervision.

Details

Motivation: Current LLMs lack structural depth in reasoning, while traditional knowledge graphs fail to capture irreducible higher-order interactions that govern emergent physical behavior in scientific domains.

Method: Construct hypergraph-based knowledge representations that encode multi-entity relationships, applied to ~1,100 manuscripts on biocomposite scaffolds. Use node-intersection constraints for hypergraph traversal in agentic systems.

Result: Built a global hypergraph with 161,172 nodes and 320,201 hyperedges showing scale-free topology (power law exponent ~1.23). System successfully generates grounded mechanistic hypotheses, such as linking cerium oxide to PCL scaffolds via chitosan intermediates.

Conclusion: Hypergraph topology serves as a verifiable guardrail for “teacherless” agentic reasoning systems, accelerating scientific discovery by uncovering relationships obscured by traditional graph methods.

Abstract: Scientific inquiry requires systems-level reasoning that integrates heterogeneous experimental data, cross-domain knowledge, and mechanistic evidence into coherent explanations. While Large Language Models (LLMs) offer inferential capabilities, they often depend on retrieval-augmented contexts that lack structural depth. Traditional Knowledge Graphs (KGs) attempt to bridge this gap, yet their pairwise constraints fail to capture the irreducible higher-order interactions that govern emergent physical behavior. To address this, we introduce a methodology for constructing hypergraph-based knowledge representations that faithfully encode multi-entity relationships. Applied to a corpus of ~1,100 manuscripts on biocomposite scaffolds, our framework constructs a global hypergraph of 161,172 nodes and 320,201 hyperedges, revealing a scale-free topology (power law exponent ~1.23) organized around highly connected conceptual hubs. This representation prevents the combinatorial explosion typical of pairwise expansions and explicitly preserves the co-occurrence context of scientific formulations. We further demonstrate that equipping agentic systems with hypergraph traversal tools, specifically using node-intersection constraints, enables them to bridge semantically distant concepts. By exploiting these higher-order pathways, the system successfully generates grounded mechanistic hypotheses for novel composite materials, such as linking cerium oxide to PCL scaffolds via chitosan intermediates. This work establishes a “teacherless” agentic reasoning system where hypergraph topology acts as a verifiable guardrail, accelerating scientific discovery by uncovering relationships obscured by traditional graph methods.

[346] Precomputing Multi-Agent Path Replanning using Temporal Flexibility: A Case Study on the Dutch Railway Network

Issa Hanou, Eric Kemmeren, Devin Wild Thomas, Mathijs de Weerdt

Main category: cs.AI

TL;DR: FlexSIPP algorithm efficiently replans multi-agent systems when one agent is delayed by leveraging temporal flexibility of other agents to avoid cascading delays.

Details

Motivation: Multi-agent plan execution becomes challenging when an agent is delayed, creating conflicts with other agents. Replanning only the delayed agent often fails to produce efficient or feasible plans, while replanning all agents can cause cascading delays and changes.

Method: FlexSIPP algorithm tracks and uses temporal flexibility of other agents - the maximum delay an agent can take without changing the order of or further delaying more agents. It precomputes all possible plans for the delayed agent along with necessary changes for other agents, for any single-agent delay within a given scenario.

Result: The method was demonstrated in a real-world case study of replanning trains in the densely-used Dutch railway network. Experiments show FlexSIPP provides effective solutions relevant to real-world adjustments within reasonable timeframes.

Conclusion: FlexSIPP offers an efficient approach to multi-agent replanning by leveraging temporal flexibility to handle agent delays while minimizing cascading effects, proving practical for real-world applications like railway scheduling.

Abstract: Executing a multi-agent plan can be challenging when an agent is delayed, because this typically creates conflicts with other agents. So, we need to quickly find a new safe plan. Replanning only the delayed agent often does not result in an efficient plan, and sometimes cannot even yield a feasible plan. On the other hand, replanning other agents may lead to a cascade of changes and delays. We show how to efficiently replan by tracking and using the temporal flexibility of other agents while avoiding cascading delays. This flexibility is the maximum delay an agent can take without changing the order of or further delaying more agents. Our algorithm, FlexSIPP, precomputes all possible plans for the delayed agent, also returning the changes for the other agents, for any single-agent delay within the given scenario. We demonstrate our method in a real-world case study of replanning trains in the densely-used Dutch railway network. Our experiments show that FlexSIPP provides effective solutions, relevant to real-world adjustments, and within a reasonable timeframe.

Sofiene Lassoued, Laxmikant Shrikant Bahetic, Nathalie Weiß-Borkowskib, Stefan Lierc, Andreas Schwunga

Main category: cs.AI

TL;DR: A novel approach combining Colored-Timed Petri Nets with actor-critic model-based reinforcement learning for Flexible Manufacturing Systems scheduling with AGVs and tool-sharing, achieving better performance on large instances with 10x faster computation.

Details

Motivation: Traditional job shop scheduling needs enhancement to handle modern manufacturing complexities including automated guided vehicles (AGVs) and tool-sharing systems in Flexible Manufacturing Systems.

Method: Combines Colored-Timed Petri Nets (CTPNs) for formal modeling and dynamic action masking with actor-critic model-based reinforcement learning (MBRL) for adaptability, plus lookahead strategy for AGV positioning.

Result: Matches traditional methods on small benchmarks, outperforms them on large instances in makespan, achieves 10x reduction in computation time, validated through ablation studies.

Conclusion: The proposed CTPN+MBRL framework effectively addresses complex FMS scheduling with AGVs and tool-sharing, offering superior scalability and efficiency for large-scale manufacturing problems.

Abstract: Flexible Manufacturing Systems (FMS) are pivotal in optimizing production processes in today’s rapidly evolving manufacturing landscape. This paper advances the traditional job shop scheduling problem by incorporating additional complexities through the simultaneous integration of automated guided vehicles (AGVs) and tool-sharing systems. We propose a novel approach that combines Colored-Timed Petri Nets (CTPNs) with actor-critic model-based reinforcement learning (MBRL), effectively addressing the multifaceted challenges associated with FMS. CTPNs provide a formal modeling structure and dynamic action masking, significantly reducing the action search space, while MBRL ensures adaptability to changing environments through the learned policy. Leveraging the advantages of MBRL, we incorporate a lookahead strategy for optimal positioning of AGVs, improving operational efficiency. Our approach was evaluated on small-sized public benchmarks and a newly developed large-scale benchmark inspired by the Taillard benchmark. The results show that our approach matches traditional methods on smaller instances and outperforms them on larger ones in terms of makespan while achieving a tenfold reduction in computation time. To ensure reproducibility, we propose a gym-compatible environment and an instance generator. Additionally, an ablation study evaluates the contribution of each framework component to its overall performance.

Tongyu Wen, Guanting Dong, Zhicheng Dou

Main category: cs.AI

TL;DR: SmartSearch improves LLM-based search agents by optimizing intermediate search query quality through process rewards and query refinement mechanisms, achieving better search efficiency and performance.

Details

Motivation: Existing LLM-based search agents focus on reasoning paradigms but overlook query quality, leading to inaccurate queries, poor retrieval results, and limited overall effectiveness.

Method: Introduces SmartSearch with two key mechanisms: (1) Process rewards with Dual-Level Credit Assessment for fine-grained supervision of query quality, and (2) Query refinement that selectively improves low-quality queries and regenerates subsequent searches. Uses three-stage curriculum learning (imitation → alignment → generalization) to internalize query improvement.

Result: SmartSearch consistently surpasses existing baselines, showing significant gains in both search efficiency and query quality.

Conclusion: Optimizing intermediate search query quality through process rewards and refinement mechanisms significantly improves LLM-based search agent performance, with SmartSearch demonstrating superior effectiveness over existing approaches.

Abstract: Large language model (LLM)-based search agents have proven promising for addressing knowledge-intensive problems by incorporating information retrieval capabilities. Existing works largely focus on optimizing the reasoning paradigms of search agents, yet the quality of intermediate search queries during reasoning remains overlooked. As a result, the generated queries often remain inaccurate, leading to unexpected retrieval results and ultimately limiting search agents’ overall effectiveness. To mitigate this issue, we introduce SmartSearch, a framework built upon two key mechanisms: (1) Process rewards, which provide fine-grained supervision for the quality of each intermediate search query through Dual-Level Credit Assessment. (2) Query refinement, which promotes the optimization of query generation by selectively refining low-quality search queries and regenerating subsequent search rounds based on these refinements. To enable the search agent to progressively internalize the ability to improve query quality under the guidance of process rewards, we design a three-stage curriculum learning framework. This framework guides the agent through a progression from imitation, to alignment, and ultimately to generalization. Experimental results show that SmartSearch consistently surpasses existing baselines, and additional quantitative analyses further confirm its significant gains in both search efficiency and query quality. The code is available at https://github.com/MYVAE/SmartSearch.

[349] DVD: A Robust Method for Detecting Variant Contamination in Large Language Model Evaluation

Renzhao Liang, Jingru Chen, Bo Jia, Bo Deng, Chenggang Xie, Yidong Wang, Ke Jin, Xin Wang, Linfeng Zhang, Cunxiang Wang

Main category: cs.AI

TL;DR: DVD detects variant contamination in LLM evaluation by analyzing variance in generation distributions, outperforming existing methods on benchmark datasets.

Details

Motivation: Current LLM evaluation is confounded by variant contamination - where training data contains semantically equivalent but lexically/syntactically altered versions of test items, which evade existing detectors and inflate benchmark scores through memorization rather than genuine reasoning.

Method: Introduces DVD (Detection via Variance of generation Distribution), a single-sample detector that models local output distribution via temperature sampling. Key insight: contaminated items trigger alternation between memory-adherence and perturbation-drift states, yielding abnormally high variance in synthetic difficulty of low-probability tokens.

Result: DVD consistently outperforms perplexity-based, Min-k%++, edit-distance (CDD), and embedding-similarity baselines across datasets (Omni-MATH and SuperGPQA) and models (Qwen2.5 and Llama3.1), with strong robustness to hyperparameters.

Conclusion: Variance of the generation distribution serves as a principled and practical fingerprint for detecting variant contamination in LLM evaluation, addressing a critical problem in benchmark reliability.

Abstract: Evaluating large language models (LLMs) is increasingly confounded by \emph{variant contamination}: the training corpus contains semantically equivalent yet lexically or syntactically altered versions of test items. Unlike verbatim leakage, these paraphrased or structurally transformed variants evade existing detectors based on sampling consistency or perplexity, thereby inflating benchmark scores via memorization rather than genuine reasoning. We formalize this problem and introduce \textbf{DVD} (\textbf{D}etection via \textbf{V}ariance of generation \textbf{D}istribution), a single-sample detector that models the local output distribution induced by temperature sampling. Our key insight is that contaminated items trigger alternation between a \emph{memory-adherence} state and a \emph{perturbation-drift} state, yielding abnormally high variance in the synthetic difficulty of low-probability tokens; uncontaminated items remain in drift with comparatively smooth variance. We construct the first benchmark for variant contamination across two domains Omni-MATH and SuperGPQA by generating and filtering semantically equivalent variants, and simulate contamination via fine-tuning models of different scales and architectures (Qwen2.5 and Llama3.1). Across datasets and models, \textbf{DVD} consistently outperforms perplexity-based, Min-$k$%++, edit-distance (CDD), and embedding-similarity baselines, while exhibiting strong robustness to hyperparameters. Our results establish variance of the generation distribution as a principled and practical fingerprint for detecting variant contamination in LLM evaluation.

[350] From Stories to Cities to Games: A Qualitative Evaluation of Behaviour Planning

Mustafa F. Abdelwahed, Joan Espasa, Alice Toniolo, Ian P. Gent

Main category: cs.AI

TL;DR: Paper presents three real-world case studies demonstrating the practical application of behaviour planning, a novel diverse planning paradigm that incorporates explicit diversity models and supports multiple planning categories.

Details

Motivation: Diverse planning approaches are valuable in real-world domains like risk management, automated stream data analysis, and malware detection, but the recently proposed behaviour planning paradigm needs practical validation through real-world applications.

Method: The paper employs three case studies to demonstrate behaviour planning: 1) storytelling, 2) urban planning, and 3) game evaluation, showing how this diverse planning approach with explicit diversity models works in practical settings.

Result: The paper demonstrates the usefulness of behaviour planning through three successful real-world applications, showing how this approach can generate distinct plans while incorporating diversity models across different domains.

Conclusion: Behaviour planning is a valuable diverse planning paradigm that effectively incorporates explicit diversity models and supports multiple planning categories, as evidenced by its successful application in storytelling, urban planning, and game evaluation case studies.

Abstract: The primary objective of a diverse planning approach is to generate a set of plans that are distinct from one another. Such an approach is applied in a variety of real-world domains, including risk management, automated stream data analysis, and malware detection. More recently, a novel diverse planning paradigm, referred to as behaviour planning, has been proposed. This approach extends earlier methods by explicitly incorporating a diversity model into the planning process and supporting multiple planning categories. In this paper, we demonstrate the usefulness of behaviour planning in real-world settings by presenting three case studies. The first case study focuses on storytelling, the second addresses urban planning, and the third examines game evaluation.

[351] What Students Ask, How a Generative AI Assistant Responds: Exploring Higher Education Students’ Dialogues on Learning Analytics Feedback

Yildiz Uzun, Andrea Gauthier, Mutlu Cukurova

Main category: cs.AI

TL;DR: Study explores how GenAI assistants integrated into learning analytics dashboards can scaffold students’ engagement with feedback, finding distinct query patterns between high and low self-regulated learners and identifying both benefits and limitations of current GenAI implementations.

Details

Motivation: Students, especially those with lower self-regulated learning (SRL) competence, often struggle to engage with and interpret analytics feedback from learning analytics dashboards (LADs). Conversational GenAI assistants show potential to scaffold this process through real-time, personalized dialogue-based support.

Method: Explored authentic dialogues between students and GenAI assistant integrated into LAD during a 10-week semester. Analyzed questions students with different SRL levels posed, relevance/quality of assistant’s answers, and student perceptions of the assistant’s role in learning.

Result: Distinct query patterns emerged: low SRL students sought clarification and reassurance, while high SRL students queried technical aspects and requested personalized strategies. The assistant provided clear, reliable explanations but limited in personalization, handling emotional queries, and integrating multiple data points. GenAI interventions were especially valuable for low SRL students, narrowing gaps with higher SRL peers.

Conclusion: GenAI assistants show promise for scaffolding student engagement with learning analytics feedback, particularly benefiting low SRL students. Future systems need greater adaptivity, context-awareness, emotional intelligence, and technical refinement, with trust being a critical factor for adoption.

Abstract: Learning analytics dashboards (LADs) aim to support students’ regulation of learning by translating complex data into feedback. Yet students, especially those with lower self-regulated learning (SRL) competence, often struggle to engage with and interpret analytics feedback. Conversational generative artificial intelligence (GenAI) assistants have shown potential to scaffold this process through real-time, personalised, dialogue-based support. Further advancing this potential, we explored authentic dialogues between students and GenAI assistant integrated into LAD during a 10-week semester. The analysis focused on questions students with different SRL levels posed, the relevance and quality of the assistant’s answers, and how students perceived the assistant’s role in their learning. Findings revealed distinct query patterns. While low SRL students sought clarification and reassurance, high SRL students queried technical aspects and requested personalised strategies. The assistant provided clear and reliable explanations but limited in personalisation, handling emotionally charged queries, and integrating multiple data points for tailored responses. Findings further extend that GenAI interventions can be especially valuable for low SRL students, offering scaffolding that supports engagement with feedback and narrows gaps with their higher SRL peers. At the same time, students’ reflections underscored the importance of trust, need for greater adaptivity, context-awareness, and technical refinement in future systems.

[352] Conversational AI for Rapid Scientific Prototyping: A Case Study on ESA’s ELOPE Competition

Nils Einecke

Main category: cs.AI

TL;DR: Using ChatGPT for rapid prototyping in ESA’s lunar lander competition achieved 2nd place, demonstrating LLMs can accelerate scientific development but have limitations requiring structured integration.

Details

Motivation: To explore how large language models (LLMs) can accelerate scientific discovery through human-AI collaboration, specifically in competitive scientific settings like ESA's ELOPE competition for lunar lander trajectory estimation.

Method: Case study approach using ChatGPT for rapid prototyping in ESA’s ELOPE competition, where the AI contributed executable code, algorithmic reasoning, data handling routines, and methodological suggestions while analyzing strengths and limitations of conversational AI.

Result: Achieved second place in the competition with a score of 0.01282 despite joining late, demonstrating successful human-AI collaboration. Identified both strengths (code generation, algorithmic reasoning, methodological suggestions) and limitations (unnecessary structural changes, confusion by intermediate discussions, critical errors, forgetting important aspects).

Conclusion: Conversational AI can accelerate scientific development and support conceptual insight when properly integrated. Structured integration of LLMs into scientific workflows with best practices can enhance rapid prototyping and human-AI collaboration in research.

Abstract: Large language models (LLMs) are increasingly used as coding partners, yet their role in accelerating scientific discovery remains underexplored. This paper presents a case study of using ChatGPT for rapid prototyping in ESA’s ELOPE (Event-based Lunar OPtical flow Egomotion estimation) competition. The competition required participants to process event camera data to estimate lunar lander trajectories. Despite joining late, we achieved second place with a score of 0.01282, highlighting the potential of human-AI collaboration in competitive scientific settings. ChatGPT contributed not only executable code but also algorithmic reasoning, data handling routines, and methodological suggestions, such as using fixed number of events instead of fixed time spans for windowing. At the same time, we observed limitations: the model often introduced unnecessary structural changes, gets confused by intermediate discussions about alternative ideas, occasionally produced critical errors and forgets important aspects in longer scientific discussions. By analyzing these strengths and shortcomings, we show how conversational AI can both accelerate development and support conceptual insight in scientific research. We argue that structured integration of LLMs into the scientific workflow can enhance rapid prototyping by proposing best practices for AI-assisted scientific work.

[353] T-Retriever: Tree-based Hierarchical Retrieval Augmented Generation for Textual Graphs

Chunyu Wei, Huaiyu Qin, Siyuan He, Yunhai Wang, Yueguo Chen

Main category: cs.AI

TL;DR: T-Retriever: A novel tree-based retrieval framework for graph RAG that replaces rigid compression quotas with adaptive encoding and uses semantic-structural entropy to jointly optimize structure and semantics.

Details

Motivation: Current graph-based RAG approaches have two critical limitations: (1) they impose rigid layer-specific compression quotas that damage local graph structures, and (2) they prioritize topological structure while neglecting semantic content, which hinders effective hierarchical information management.

Method: T-Retriever reformulates attributed graph retrieval as tree-based retrieval using a semantic and structure-guided encoding tree. It introduces two key innovations: (1) Adaptive Compression Encoding that replaces artificial compression quotas with a global optimization strategy preserving natural hierarchical organization, and (2) Semantic-Structural Entropy (S²-Entropy) that jointly optimizes for both structural cohesion and semantic consistency in hierarchical partitions.

Result: Experiments across diverse graph reasoning benchmarks demonstrate that T-Retriever significantly outperforms state-of-the-art RAG methods, providing more coherent and contextually relevant responses to complex queries.

Conclusion: T-Retriever successfully addresses the limitations of current graph-based RAG approaches by introducing a tree-based framework that better preserves both structural and semantic information through adaptive compression and joint optimization, leading to superior performance in complex query answering.

Abstract: Retrieval-Augmented Generation (RAG) has significantly enhanced Large Language Models’ ability to access external knowledge, yet current graph-based RAG approaches face two critical limitations in managing hierarchical information: they impose rigid layer-specific compression quotas that damage local graph structures, and they prioritize topological structure while neglecting semantic content. We introduce T-Retriever, a novel framework that reformulates attributed graph retrieval as tree-based retrieval using a semantic and structure-guided encoding tree. Our approach features two key innovations: (1) Adaptive Compression Encoding, which replaces artificial compression quotas with a global optimization strategy that preserves the graph’s natural hierarchical organization, and (2) Semantic-Structural Entropy ($S^2$-Entropy), which jointly optimizes for both structural cohesion and semantic consistency when creating hierarchical partitions. Experiments across diverse graph reasoning benchmarks demonstrate that T-Retriever significantly outperforms state-of-the-art RAG methods, providing more coherent and contextually relevant responses to complex queries.

[354] ConMax: Confidence-Maximizing Compression for Efficient Chain-of-Thought Reasoning

Minda Hu, Zexuan Qiu, Zenan Xu, Kun Li, Bo Zhou, Irwin King

Main category: cs.AI

TL;DR: ConMax is a reinforcement learning framework that compresses reasoning traces in Large Reasoning Models by pruning redundancy while preserving logical coherence, achieving 43% length reduction with only 0.7% accuracy drop.

Details

Motivation: Large Reasoning Models often suffer from "overthinking" - generating redundant reasoning paths that increase computational costs without improving accuracy. Existing compression techniques for reasoning traces either compromise logical coherence or require prohibitive sampling costs.

Method: ConMax formulates compression as a reward-driven optimization problem using reinforcement learning. It trains a policy to prune redundancy by maximizing a weighted combination of answer confidence (for predictive fidelity) and thinking confidence (for reasoning validity) through a frozen auxiliary LRM.

Result: Extensive experiments across five reasoning datasets show ConMax reduces inference length by 43% over strong baselines with only a 0.7% dip in accuracy, achieving superior efficiency-performance trade-off.

Conclusion: ConMax effectively generates high-quality, efficient training data for LRMs by automatically compressing reasoning traces while preserving essential reasoning patterns, solving the overthinking problem in large reasoning models.

Abstract: Recent breakthroughs in Large Reasoning Models (LRMs) have demonstrated that extensive Chain-of-Thought (CoT) generation is critical for enabling intricate cognitive behaviors, such as self-verification and backtracking, to solve complex tasks. However, this capability often leads to ``overthinking’’, where models generate redundant reasoning paths that inflate computational costs without improving accuracy. While Supervised Fine-Tuning (SFT) on reasoning traces is a standard paradigm for the ‘cold start’ phase, applying existing compression techniques to these traces often compromises logical coherence or incurs prohibitive sampling costs. In this paper, we introduce ConMax (Confidence-Maximizing Compression), a novel reinforcement learning framework designed to automatically compress reasoning traces while preserving essential reasoning patterns. ConMax formulates compression as a reward-driven optimization problem, training a policy to prune redundancy by maximizing a weighted combination of answer confidence for predictive fidelity and thinking confidence for reasoning validity through a frozen auxiliary LRM. Extensive experiments across five reasoning datasets demonstrate that ConMax achieves a superior efficiency-performance trade-off. Specifically, it reduces inference length by 43% over strong baselines at the cost of a mere 0.7% dip in accuracy, proving its effectiveness in generating high-quality, efficient training data for LRMs.

[355] AlgBench: To What Extent Do Large Reasoning Models Understand Algorithms?

Henan Sun, Kaichi Yu, Yuyao Wang, Bowen Liu, Xunkai Li, Rong-Hua Li, Nuo Chen, Jia Li

Main category: cs.AI

TL;DR: AlgBench is a new expert-curated benchmark with 3,000+ problems across 27 algorithms that reveals LRMs struggle with optimized algorithms despite performing well on non-optimized tasks, exposing limitations in current training approaches.

Details

Motivation: Existing benchmarks for algorithmic reasoning are limited and fail to determine whether Large Reasoning Models truly master algorithmic reasoning. There's a need for a more comprehensive evaluation under an algorithm-centric paradigm.

Method: Created AlgBench, an expert-curated benchmark with over 3,000 original problems spanning 27 algorithms, organized under a comprehensive taxonomy including Euclidean-structured, non-Euclidean-structured, non-optimized, local-optimized, global-optimized, and heuristic-optimized categories. Evaluated leading LRMs like Gemini-3-Pro, DeepSeek-v3.2-Speciale, and GPT-o3.

Result: Models perform well on non-optimized tasks (up to 92%) but accuracy drops sharply to around 49% on globally optimized algorithms like dynamic programming. Analysis reveals “strategic over-shifts” where models prematurely abandon correct algorithmic designs due to necessary low-entropy tokens.

Conclusion: Current problem-centric reinforcement learning has fundamental limitations for algorithmic reasoning. An algorithm-centric training paradigm is necessary for robust algorithmic reasoning in LRMs.

Abstract: Reasoning ability has become a central focus in the advancement of Large Reasoning Models (LRMs). Although notable progress has been achieved on several reasoning benchmarks such as MATH500 and LiveCodeBench, existing benchmarks for algorithmic reasoning remain limited, failing to answer a critical question: Do LRMs truly master algorithmic reasoning? To answer this question, we propose AlgBench, an expert-curated benchmark that evaluates LRMs under an algorithm-centric paradigm. AlgBench consists of over 3,000 original problems spanning 27 algorithms, constructed by ACM algorithmic experts and organized under a comprehensive taxonomy, including Euclidean-structured, non-Euclidean-structured, non-optimized, local-optimized, global-optimized, and heuristic-optimized categories. Empirical evaluations on leading LRMs (e.g., Gemini-3-Pro, DeepSeek-v3.2-Speciale and GPT-o3) reveal substantial performance heterogeneity: while models perform well on non-optimized tasks (up to 92%), accuracy drops sharply to around 49% on globally optimized algorithms such as dynamic programming. Further analysis uncovers \textbf{strategic over-shifts}, wherein models prematurely abandon correct algorithmic designs due to necessary low-entropy tokens. These findings expose fundamental limitations of problem-centric reinforcement learning and highlight the necessity of an algorithm-centric training paradigm for robust algorithmic reasoning.

[356] An Empirical Investigation of Robustness in Large Language Models under Tabular Distortions

Avik Dutta, Harshit Nigam, Hosein Hasanbeig, Arjun Radhakrishna, Sumit Gulwani

Main category: cs.AI

TL;DR: LLMs struggle to detect and correct subtle distortions in tabular data representations, requiring explicit prompts to partially adjust reasoning, with accuracy dropping at least 22% under distortion.

Details

Motivation: To investigate how LLMs fail when tabular data is subjected to semantic and structural distortions, and to understand their limitations in detecting and correcting such distortions without explicit guidance.

Method: Introduced an expert-curated dataset for table question answering tasks requiring error-correction steps, and evaluated LLMs (including GPT-5.2) on their ability to handle distorted tabular data with and without explicit system prompts.

Result: LLMs lack inherent ability to detect/correct table distortions; only partially adjust with explicit prompts; SoTA models show minimum 22% accuracy drop under distortion; systematic differences in how models interpret distorted tabular information.

Conclusion: Findings raise important questions about when/how models should autonomously realign tabular inputs without explicit prompts, analogous to human behavior, highlighting a significant research gap in LLM table understanding capabilities.

Abstract: We investigate how large language models (LLMs) fail when tabular data in an otherwise canonical representation is subjected to semantic and structural distortions. Our findings reveal that LLMs lack an inherent ability to detect and correct subtle distortions in table representations. Only when provided with an explicit prior, via a system prompt, do models partially adjust their reasoning strategies and correct some distortions, though not consistently or completely. To study this phenomenon, we introduce a small, expert-curated dataset that explicitly evaluates LLMs on table question answering (TQA) tasks requiring an additional error-correction step prior to analysis. Our results reveal systematic differences in how LLMs ingest and interpret tabular information under distortion, with even SoTA models such as GPT-5.2 model exhibiting a drop of minimum 22% accuracy under distortion. These findings raise important questions for future research, particularly regarding when and how models should autonomously decide to realign tabular inputs, analogous to human behavior, without relying on explicit prompts or tabular data pre-processing.

[357] OptiSet: Unified Optimizing Set Selection and Ranking for Retrieval-Augmented Generation

Yi Jiang, Sendong Zhao, Jianbo Li, Bairui Hu, Yanrui Du, Haochun Wang, Bing Qin

Main category: cs.AI

TL;DR: OptiSet is a set-centric RAG framework that uses Expand-then-Refine paradigm and set-list wise training to select compact, complementary evidence sets instead of redundant top-k passages.

Details

Motivation: Existing RAG methods use static top-k passage selection based on individual relevance, which fails to exploit combinatorial gains among passages and introduces substantial redundancy, limiting generation quality and efficiency.

Method: 1) Expand-then-Refine paradigm: expand query into multiple perspectives for diverse candidates, then refine via re-selection. 2) Self-synthesis strategy: derive preference labels from set conditional utility changes without strong LLM supervision. 3) Set-list wise training: jointly optimize set selection and set-level ranking to favor compact, high-gain evidence sets.

Result: Extensive experiments show OptiSet improves performance on complex combinatorial problems and makes generation more efficient. The framework demonstrates better evidence selection than traditional top-k approaches.

Conclusion: OptiSet successfully addresses redundancy in RAG by treating evidence selection as a set optimization problem, achieving better generation quality through complementary evidence sets rather than individually relevant but redundant passages.

Abstract: Retrieval-Augmented Generation (RAG) improves generation quality by incorporating evidence retrieved from large external corpora. However, most existing methods rely on statically selecting top-k passages based on individual relevance, which fails to exploit combinatorial gains among passages and often introduces substantial redundancy. To address this limitation, we propose OptiSet, a set-centric framework that unifies set selection and set-level ranking for RAG. OptiSet adopts an “Expand-then-Refine” paradigm: it first expands a query into multiple perspectives to enable a diverse candidate pool and then refines the candidate pool via re-selection to form a compact evidence set. We then devise a self-synthesis strategy without strong LLM supervision to derive preference labels from the set conditional utility changes of the generator, thereby identifying complementary and redundant evidence. Finally, we introduce a set-list wise training strategy that jointly optimizes set selection and set-level ranking, enabling the model to favor compact, high-gain evidence sets. Extensive experiments demonstrate that OptiSet improves performance on complex combinatorial problems and makes generation more efficient. The source code is publicly available.

[358] How to Set the Batch Size for Large-Scale Pre-training?

Yunhua Zhou, Junhao Huang, Shuhao Xin, Yechen Zhang, Runyu Peng, Qiping Guo, Xipeng Qiu

Main category: cs.AI

TL;DR: The paper revises the Critical Batch Size theory for modern WSD learning rate schedulers, deriving new E(S) relationship, identifying B_min and B_opt thresholds, and proposing a dynamic batch size scheduler that improves training efficiency and model quality.

Details

Motivation: The original Critical Batch Size theory from OpenAI doesn't align with modern Warmup-Stable-Decay (WSD) learning rate schedulers used in large-scale pre-training, creating a gap between theory and practice that needs to be addressed.

Method: Derived a revised E(S) relationship specifically for WSD schedulers, analyzed theoretical properties to identify B_min (minimum batch size threshold) and B_opt (optimal batch size for data efficiency), and proposed a dynamic Batch Size Scheduler based on these insights.

Result: The revised formula accurately captures large-scale pre-training dynamics with WSD schedulers, and the proposed scheduling strategy significantly enhances both training efficiency and final model quality in extensive experiments.

Conclusion: The paper successfully bridges the theory-practice gap for modern pre-training by updating Critical Batch Size theory for WSD schedulers, providing practical tools (B_min, B_opt) and an effective dynamic batch size scheduling strategy.

Abstract: The concept of Critical Batch Size, as pioneered by OpenAI, has long served as a foundational principle for large-scale pre-training. However, with the paradigm shift towards the Warmup-Stable-Decay (WSD) learning rate scheduler, we observe that the original theoretical framework and its underlying mechanisms fail to align with new pre-training dynamics. To bridge this gap between theory and practice, this paper derives a revised E(S) relationship tailored for WSD scheduler, characterizing the trade-off between training data consumption E and steps S during pre-training. Our theoretical analysis reveals two fundamental properties of WSD-based pre-training: 1) B_min, the minimum batch size threshold required to achieve a target loss, and 2) B_opt, the optimal batch size that maximizes data efficiency by minimizing total tokens. Building upon these properties, we propose a dynamic Batch Size Scheduler. Extensive experiments demonstrate that our revised formula precisely captures the dynamics of large-scale pre-training, and the resulting scheduling strategy significantly enhances both training efficiency and final model quality.

[359] How to Set the Learning Rate for Large-Scale Pre-training?

Yunhua Zhou, Shuhao Xing, Junhao Huang, Xipeng Qiu, Qipeng Guo

Main category: cs.AI

TL;DR: The paper investigates optimal learning rate configuration for large-scale pre-training, comparing two paradigms: Fitting (using scaling laws to reduce search complexity) and Transfer (extending μTransfer to MoE architectures). It challenges μTransfer’s scalability in large-scale settings and provides practical guidelines.

Details

Motivation: Learning rate configuration is crucial but challenging in large-scale pre-training due to high training costs. The paper aims to determine if optimal learning rates can be extrapolated from low-cost experiments, addressing the trade-off between training efficiency and model performance.

Method: Two research paradigms: 1) Fitting Paradigm - introduces a Scaling Law for search factor to reduce search complexity from O(n³) to O(nC_DC_η) via predictive modeling. 2) Transfer Paradigm - extends μTransfer principles to Mixture of Experts (MoE) architecture, covering model depth, weight decay, and token horizons.

Result: Empirical results challenge the scalability of widely adopted μTransfer in large-scale pre-training scenarios. The paper shows that module-wise parameter tuning underperforms in large-scale settings, analyzed through training stability and feature learning perspectives.

Conclusion: Provides systematic practical guidelines and fresh theoretical perspective for optimizing industrial-level pre-training. The work offers insights into why certain transfer methods fail at scale and presents more scalable alternatives for learning rate configuration.

Abstract: Optimal configuration of the learning rate (LR) is a fundamental yet formidable challenge in large-scale pre-training. Given the stringent trade-off between training costs and model performance, the pivotal question is whether the optimal LR can be accurately extrapolated from low-cost experiments. In this paper, we formalize this investigation into two distinct research paradigms: Fitting and Transfer. Within the Fitting Paradigm, we innovatively introduce a Scaling Law for search factor, effectively reducing the search complexity from O(n^3) to O(nC_DC_η) via predictive modeling. Within the Transfer Paradigm, we extend the principles of $μ$Transfer to the Mixture of Experts (MoE) architecture, broadening its applicability to encompass model depth, weight decay, and token horizons. By pushing the boundaries of existing hyperparameter research in terms of scale, we conduct a comprehensive comparison between these two paradigms. Our empirical results challenge the scalability of the widely adopted $μ$ Transfer in large-scale pre-training scenarios. Furthermore, we provide a rigorous analysis through the dual lenses of training stability and feature learning to elucidate the underlying reasons why module-wise parameter tuning underperforms in large-scale settings. This work offers systematic practical guidelines and a fresh theoretical perspective for optimizing industrial-level pre-training.

[360] Large language models can effectively convince people to believe conspiracies

Thomas H. Costello, Kellin Pelrine, Matthew Kowal, Antonio A. Arechar, Jean-François Godbout, Adam Gleave, David Rand, Gordon Pennycook

Main category: cs.AI

TL;DR: GPT-4o is equally effective at increasing or decreasing conspiracy beliefs, with standard guardrails providing little protection against misinformation promotion.

Details

Motivation: To investigate whether LLMs' persuasive power advantages truth over falsehood, or if they can promote misbeliefs as easily as refute them, particularly in the context of conspiracy theories.

Method: Three pre-registered experiments with 2,724 Americans discussing uncertain conspiracy theories with GPT-4o, comparing “debunking” (arguing against) vs “bunking” (arguing for) conditions, including jailbroken and standard GPT-4o variants.

Result: Jailbroken GPT-4o was equally effective at increasing and decreasing conspiracy beliefs; bunking AI was rated more positively and increased trust more than debunking AI; standard GPT-4o produced similar effects despite guardrails; corrective conversations reversed induced beliefs; prompting for accurate information dramatically reduced misinformation promotion.

Conclusion: LLMs possess potent abilities to promote both truth and falsehood, but potential solutions exist (corrective conversations, accuracy prompts) to help mitigate misinformation risks.

Abstract: Large language models (LLMs) have been shown to be persuasive across a variety of context. But it remains unclear whether this persuasive power advantages truth over falsehood, or if LLMs can promote misbeliefs just as easily as refuting them. Here, we investigate this question across three pre-registered experiments in which participants (N = 2,724 Americans) discussed a conspiracy theory they were uncertain about with GPT-4o, and the model was instructed to either argue against (“debunking”) or for (“bunking”) that conspiracy. When using a “jailbroken” GPT-4o variant with guardrails removed, the AI was as effective at increasing conspiracy belief as decreasing it. Concerningly, the bunking AI was rated more positively, and increased trust in AI, more than the debunking AI. Surprisingly, we found that using standard GPT-4o produced very similar effects, such that the guardrails imposed by OpenAI did little to revent the LLM from promoting conspiracy beliefs. Encouragingly, however, a corrective conversation reversed these newly induced conspiracy beliefs, and simply prompting GPT-4o to only use accurate information dramatically reduced its ability to increase conspiracy beliefs. Our findings demonstrate that LLMs possess potent abilities to promote both truth and falsehood, but that potential solutions may exist to help mitigate this risk.

[361] Publishing FAIR and Machine-actionable Reviews in Materials Science: The Case for Symbolic Knowledge in Neuro-symbolic Artificial Intelligence

Jennifer D’Souza, Soren Auer, Eleni Poupaki, Alex Watkins, Anjana Devi, Riikka L. Puurunen, Bora Karasulu, Adrie Mackus, Erwin Kessels

Main category: cs.AI

TL;DR: This paper presents a case study in atomic layer deposition/etching (ALD/E) where review tables are published as FAIR, machine-actionable comparisons in the Open Research Knowledge Graph (ORKG), enabling structured, queryable knowledge from traditional narrative reviews.

Details

Motivation: Scientific reviews in materials science contain valuable insights but remain locked in narrative text and static PDF tables, limiting reuse by both humans and machines. There's a need to make this knowledge more accessible and actionable.

Method: The authors publish review tables as FAIR (Findable, Accessible, Interoperable, Reusable), machine-actionable comparisons in the Open Research Knowledge Graph (ORKG). They then contrast symbolic querying over ORKG with large language model-based querying approaches.

Result: The approach successfully transforms traditional review tables into structured, queryable knowledge in ORKG. The comparison reveals that symbolic querying provides more reliable results than LLM-based approaches alone.

Conclusion: A curated symbolic layer should remain the backbone of reliable neurosymbolic AI in materials science, with LLMs serving as complementary, symbolically grounded interfaces rather than standalone sources of truth.

Abstract: Scientific reviews are central to knowledge integration in materials science, yet their key insights remain locked in narrative text and static PDF tables, limiting reuse by humans and machines alike. This article presents a case study in atomic layer deposition and etching (ALD/E) where we publish review tables as FAIR, machine-actionable comparisons in the Open Research Knowledge Graph (ORKG), turning them into structured, queryable knowledge. Building on this, we contrast symbolic querying over ORKG with large language model-based querying, and argue that a curated symbolic layer should remain the backbone of reliable neurosymbolic AI in materials science, with LLMs serving as complementary, symbolically grounded interfaces rather than standalone sources of truth.

[362] Reinforced Efficient Reasoning via Semantically Diverse Exploration

Ziqi Zhao, Zhaochun Ren, Jiahong Zou, Liu Yang, Zhiwei Xu, Xuri Ge, Zhumin Chen, Xinyu Ma, Daiting Shi, Shuaiqiang Wang, Dawei Yin, Xin Xin

Main category: cs.AI

TL;DR: ROSE is a reinforcement learning method that improves LLM reasoning through semantic diversity exploration and efficient advantage estimation, outperforming existing RLVR approaches on math reasoning tasks.

Details

Motivation: Existing RLVR methods with MCTS extensions still suffer from limited exploration diversity and inefficient reasoning, needing better mechanisms for diverse reasoning paths and more efficient credit assignment.

Method: ROSE uses semantic-entropy-based branching to select points with high semantic divergence for new reasoning paths, ε-exploration for stochastic root-level rollouts, and length-aware segment-level advantage estimation to reward concise correct reasoning.

Result: Extensive experiments on mathematical reasoning benchmarks with Qwen and Llama models validate ROSE’s effectiveness and efficiency compared to existing methods.

Conclusion: ROSE successfully addresses exploration diversity and reasoning efficiency challenges in RLVR, providing a robust approach for improving LLM reasoning through semantically diverse explorations and efficient advantage estimation.

Abstract: Reinforcement learning with verifiable rewards (RLVR) has proven effective in enhancing the reasoning of large language models (LLMs). Monte Carlo Tree Search (MCTS)-based extensions improve upon vanilla RLVR (e.g., GRPO) by providing tree-based reasoning rollouts that enable fine-grained and segment-level credit assignment. However, existing methods still suffer from limited exploration diversity and inefficient reasoning. To address the above challenges, we propose reinforced efficient reasoning via semantically diverse explorations, i.e., ROSE, for LLMs. To encourage more diverse reasoning exploration, our method incorporates a semantic-entropy-based branching strategy and an $\varepsilon$-exploration mechanism. The former operates on already sampled reasoning rollouts to capture semantic uncertainty and select branching points with high semantic divergence to generate new successive reasoning paths, whereas the latter stochastically initiates reasoning rollouts from the root, preventing the search process from becoming overly local. To improve efficiency, we design a length-aware segment-level advantage estimator that rewards concise and correct reasoning while penalizing unnecessarily long reasoning chains. Extensive experiments on various mathematical reasoning benchmarks with Qwen and Llama models validate the effectiveness and efficiency of ROSE. Codes are available at https://github.com/ZiqiZhao1/ROSE-rl.

[363] Chain-of-Sanitized-Thoughts: Plugging PII Leakage in CoT of Large Reasoning Models

Arghyadeep Das, Sai Sreenivas Chintha, Rishiraj Girmal, Kinjal Pandey, Sharvi Endait

Main category: cs.AI

TL;DR: Large Reasoning Models leak PII in chain-of-thought reasoning. The paper introduces methods to achieve private reasoning with minimal utility loss.

Details

Motivation: Chain-of-thought reasoning in Large Reasoning Models exposes personally identifiable information (PII) even when final answers are sanitized, creating serious privacy risks that need to be addressed.

Method: Introduces PII-CoT-Bench dataset with privacy-aware CoT annotations and category-balanced evaluation benchmark. Uses two approaches: prompt-based controls for state-of-the-art models and fine-tuning for weaker models to reduce PII leakage.

Result: Both approaches substantially reduce PII exposure with minimal degradation in utility. State-of-the-art models benefit most from prompt-based controls, while weaker models require fine-tuning for meaningful leakage reduction.

Conclusion: Private chain-of-thought reasoning can be achieved with minimal utility loss, providing practical guidance for building privacy-preserving reasoning systems.

Abstract: Large Reasoning Models (LRMs) improve performance, reliability, and interpretability by generating explicit chain-of-thought (CoT) reasoning, but this transparency introduces a serious privacy risk: intermediate reasoning often leaks personally identifiable information (PII) even when final answers are sanitized. We study how to induce privacy-first reasoning, where models reason without exposing sensitive information, using deployable interventions rather than post-hoc redaction. We introduce PII-CoT-Bench, a supervised dataset with privacy-aware CoT annotations, and a category-balanced evaluation benchmark covering realistic and adversarial leakage scenarios. Our results reveal a capability-dependent trend: state-of-the-art models benefit most from prompt-based controls, whereas weaker models require fine-tuning to achieve meaningful leakage reduction. Across models and categories, both approaches substantially reduce PII exposure with minimal degradation in utility, demonstrating that private reasoning can be achieved without sacrificing performance. Overall, we show that private CoT reasoning can be achieved with minimal utility loss, providing practical guidance for building privacy-preserving reasoning systems.

[364] Arabic Prompts with English Tools: A Benchmark

Konstantin Kubrak, Ahmed El-Moselhy, Ammar Alsulami, Remaz Altuwaim, Hassan Ismail Fawaz, Faisal Alsaby

Main category: cs.AI

TL;DR: First benchmark for evaluating Arabic LLMs’ tool-calling and agentic capabilities reveals significant performance gaps compared to English.

Details

Motivation: Arabic LLMs are advancing but lack proper evaluation benchmarks, especially for tool-calling capabilities. Current frameworks focus on English, leaving Arabic performance poorly understood despite models being pretrained on mostly English data.

Method: Introduces the first dedicated benchmark for evaluating tool-calling and agentic capabilities of LLMs in Arabic language, providing a standardized framework to measure functional accuracy and robustness in Arabic agentic workflows.

Result: Significant performance gap discovered: when users interact in Arabic, tool-calling accuracy drops by 5-10% on average, regardless of whether tool descriptions are in Arabic or English.

Conclusion: The benchmark highlights critical challenges in Arabic tool-calling performance and aims to foster development of more reliable and linguistically equitable AI agents for Arabic-speaking users.

Abstract: Large Language Models (LLMs) are now integral to numerous industries, increasingly serving as the core reasoning engine for autonomous agents that perform complex tasks through tool-use. While the development of Arabic-native LLMs is accelerating, the benchmarks for evaluating their capabilities lag behind, with most existing frameworks focusing on English. A critical and overlooked area is tool-calling, where the performance of models prompted in non-English languages like Arabic is poorly understood, especially since these models are often pretrained on predominantly English data. This paper addresses this critical gap by introducing the first dedicated benchmark for evaluating the tool-calling and agentic capabilities of LLMs in the Arabic language. Our work provides a standardized framework to measure the functional accuracy and robustness of models in Arabic agentic workflows. Our findings reveal a huge performance gap: when users interact in Arabic, tool-calling accuracy drops by an average of 5-10%, regardless of whether the tool descriptions themselves are in Arabic or English. By shedding light on these critical challenges, this benchmark aims to foster the development of more reliable and linguistically equitable AI agents for Arabic-speaking users.

[365] Token-Level LLM Collaboration via FusionRoute

Nuoya Xiong, Yuhang Zhou, Hanqing Zeng, Zhaorun Chen, Furong Huang, Shuchao Bi, Lizhu Zhang, Zhuokai Zhao

Main category: cs.AI

TL;DR: FusionRoute is a token-level multi-LLM collaboration framework that uses a lightweight router to select domain experts at each decoding step while also contributing complementary logits to refine expert outputs, overcoming limitations of pure expert-only routing.

Details

Motivation: Addresses the dilemma between expensive general-purpose large models and specialized but narrow domain models. Large models are expensive to train/deploy, while smaller specialized models struggle with generalization beyond their training distributions.

Method: Proposes FusionRoute with a lightweight router that simultaneously: (1) selects the most suitable expert at each decoding step, and (2) contributes complementary logits via logit addition to refine/correct the selected expert’s next-token distribution. Unlike pure expert-only routing, it combines expert selection with trainable complementary generation.

Result: Outperforms both sequence- and token-level collaboration, model merging, and direct fine-tuning across Llama-3 and Gemma-2 families on diverse benchmarks (mathematical reasoning, code generation, instruction following). Remains competitive with domain experts on their respective tasks.

Conclusion: FusionRoute provides an effective solution to the efficiency-generalization trade-off by enabling token-level multi-LLM collaboration that expands the policy class beyond pure expert routing, achieving strong performance while maintaining efficiency.

Abstract: Large language models (LLMs) exhibit strengths across diverse domains. However, achieving strong performance across these domains with a single general-purpose model typically requires scaling to sizes that are prohibitively expensive to train and deploy. On the other hand, while smaller domain-specialized models are much more efficient, they struggle to generalize beyond their training distributions. To address this dilemma, we propose FusionRoute, a robust and effective token-level multi-LLM collaboration framework in which a lightweight router simultaneously (i) selects the most suitable expert at each decoding step and (ii) contributes a complementary logit that refines or corrects the selected expert’s next-token distribution via logit addition. Unlike existing token-level collaboration methods that rely solely on fixed expert outputs, we provide a theoretical analysis showing that pure expert-only routing is fundamentally limited: unless strong global coverage assumptions hold, it cannot in general realize the optimal decoding policy. By augmenting expert selection with a trainable complementary generator, FusionRoute expands the effective policy class and enables recovery of optimal value functions under mild conditions. Empirically, across both Llama-3 and Gemma-2 families and diverse benchmarks spanning mathematical reasoning, code generation, and instruction following, FusionRoute outperforms both sequence- and token-level collaboration, model merging, and direct fine-tuning, while remaining competitive with domain experts on their respective tasks.

[366] Controllable Memory Usage: Balancing Anchoring and Innovation in Long-Term Human-Agent Interaction

Muzhao Tian, Zisu Huang, Xiaohua Wang, Jingwen Xu, Zhengkang Guo, Qi Qian, Yuanzhe Shen, Kaitao Song, Jiakang Yuan, Changze Lv, Xiaoqing Zheng

Main category: cs.AI

TL;DR: SteeM framework enables dynamic control of LLM agent memory reliance, balancing between innovation (fresh-start) and consistency (high-fidelity) to avoid memory anchoring while utilizing interaction history.

Details

Motivation: Current LLM-based agents use an "all-or-nothing" memory approach, causing either memory anchoring (trapped by past interactions) or under-utilization of important history. There's a need for nuanced memory control in long-term human-agent interactions.

Method: Introduces SteeM (Steerable Memory Agent) framework with: 1) behavioral metric for memory dependence to quantify past interaction influence, and 2) user-controllable memory reliance regulation ranging from fresh-start to high-fidelity modes.

Result: Experiments across scenarios show SteeM consistently outperforms conventional prompting and rigid memory masking strategies, providing more nuanced and effective control for personalized human-agent collaboration.

Conclusion: Memory reliance can be modeled as an explicit, user-controllable dimension, enabling dynamic regulation that avoids memory anchoring while effectively utilizing interaction history for personalized agent behavior.

Abstract: As LLM-based agents are increasingly used in long-term interactions, cumulative memory is critical for enabling personalization and maintaining stylistic consistency. However, most existing systems adopt an ``all-or-nothing’’ approach to memory usage: incorporating all relevant past information can lead to \textit{Memory Anchoring}, where the agent is trapped by past interactions, while excluding memory entirely results in under-utilization and the loss of important interaction history. We show that an agent’s reliance on memory can be modeled as an explicit and user-controllable dimension. We first introduce a behavioral metric of memory dependence to quantify the influence of past interactions on current outputs. We then propose \textbf{Stee}rable \textbf{M}emory Agent, \texttt{SteeM}, a framework that allows users to dynamically regulate memory reliance, ranging from a fresh-start mode that promotes innovation to a high-fidelity mode that closely follows interaction history. Experiments across different scenarios demonstrate that our approach consistently outperforms conventional prompting and rigid memory masking strategies, yielding a more nuanced and effective control for personalized human-agent collaboration.

[367] GlimpRouter: Efficient Collaborative Inference by Glimpsing One Token of Thoughts

Wenhao Zeng, Xuteng Zhang, Yuling Shi, Chao Hu, Yuting Chen, Beijun Shen, Xiaodong Gu

Main category: cs.AI

TL;DR: GlimpRouter: A training-free step-wise collaboration framework that uses initial token entropy to predict reasoning step difficulty and route between small/large models, reducing latency while maintaining accuracy.

Details

Motivation: Large Reasoning Models (LRMs) have high inference latency and computational costs due to explicit multi-step reasoning chains. Collaborative inference between lightweight and large models could help, but existing routing strategies introduce significant overhead through local token probabilities or post-hoc verification.

Method: Proposes GlimpRouter based on the insight that reasoning step difficulty can be predicted from the first token’s entropy (the “Aha Moment” phenomenon). Uses a lightweight model to generate only the first token of each reasoning step, then routes to a larger model only when initial token entropy exceeds a threshold. This is a training-free framework.

Result: Significantly reduces inference latency while preserving accuracy across multiple benchmarks. Achieves 10.7% improvement in accuracy while reducing inference latency by 25.9% compared to standalone large model on AIME25.

Conclusion: Initial token entropy serves as a strong predictor of reasoning step difficulty, enabling efficient step-wise collaboration. GlimpRouter demonstrates that computation can be effectively allocated based on a “glimpse of thought” rather than full-step evaluation, offering a simple yet effective mechanism for reasoning.

Abstract: Large Reasoning Models (LRMs) achieve remarkable performance by explicitly generating multi-step chains of thought, but this capability incurs substantial inference latency and computational cost. Collaborative inference offers a promising solution by selectively allocating work between lightweight and large models, yet a fundamental challenge remains: determining when a reasoning step requires the capacity of a large model or the efficiency of a small model. Existing routing strategies either rely on local token probabilities or post-hoc verification, introducing significant inference overhead. In this work, we propose a novel perspective on step-wise collaboration: the difficulty of a reasoning step can be inferred from its very first token. Inspired by the “Aha Moment” phenomenon in LRMs, we show that the entropy of the initial token serves as a strong predictor of step difficulty. Building on this insight, we introduce GlimpRouter, a training-free step-wise collaboration framework. GlimpRouter employs a lightweight model to generate only the first token of each reasoning step and routes the step to a larger model only when the initial token entropy exceeds a threshold. Experiments on multiple benchmarks demonstrate that our approach significantly reduces inference latency while preserving accuracy. For instance, GlimpRouter attains a substantial 10.7% improvement in accuracy while reducing inference latency by 25.9% compared to a standalone large model on AIME25. These results suggest a simple yet effective mechanism for reasoning: allocating computation based on a glimpse of thought rather than full-step evaluation.

[368] Evaluative Fingerprints: Stable and Systematic Differences in LLM Evaluator Behavior

Wajid Nasser

Main category: cs.AI

TL;DR: LLM judges show near-zero inter-judge agreement but high self-consistency, creating a reliability paradox where judges function as distinct measurement devices with unique evaluative dispositions rather than interchangeable instruments.

Details

Motivation: To investigate the consistency and reliability of LLM-as-judge systems for evaluation tasks, examining whether they provide scalable, consistent assessment as promised.

Method: Conducted 3,240 evaluations using 9 judges across 120 unique video pack items with 3 independent runs. Measured inter-judge agreement using Krippendorff’s α, built classifiers to identify judges from rubric scores, and characterized evaluative dispositions along multiple axes including harshness/leniency, dimension emphasis, within-judge stability (ICC), and evidence behavior metrics.

Result: Inter-judge agreement is near-zero (α = 0.042), with some dimensions showing worse-than-random disagreement (α < 0). Judges can be identified with 77.1% accuracy from rubric scores alone (89.9% with disposition features), and within model families, GPT-4.1 and GPT-5.2 are distinguishable with 99.6% accuracy. Each judge implements a distinct, stable “evaluative disposition.”

Conclusion: LLM judges are not interchangeable instruments measuring a shared construct but distinct measurement devices encoding their own implicit theories of quality. Averaging their scores produces synthetic verdicts that don’t correspond to any judge’s actual values, creating a reliability paradox where judges are self-consistent but not with each other.

Abstract: LLM-as-judge systems promise scalable, consistent evaluation. We find the opposite: judges are consistent, but not with each other; they are consistent with themselves. Across 3,240 evaluations (9 judges x 120 unique video x pack items x 3 independent runs), inter-judge agreement is near-zero (Krippendorff’s α = 0.042). On two dimensions, judges disagree more than random noise would predict (α < 0). Yet this disagreement isn’t chaos; it’s structured. A classifier identifies which judge produced an evaluation with 77.1% accuracy from rubric scores alone, rising to 89.9% with disposition features. Within model families, the signal is even stronger: GPT-4.1 and GPT-5.2 are distinguishable with 99.6% accuracy. We call this the reliability paradox: judges cannot agree on what constitutes quality, yet their disagreement patterns are so stable they function as fingerprints. Each judge implements a distinct, stable theory of quality: an “evaluative disposition” that shapes how it interprets any rubric. We characterize these dispositions along multiple axes: harshness/leniency, dimension emphasis, within-judge stability (ICC), and evidence behavior (receipt validity, semantic linkage via NLI, and shotgun index). The implication is stark: LLM judges are not interchangeable instruments measuring a shared construct. They are distinct measurement devices, each encoding its own implicit theory of quality. Averaging their scores produces a synthetic verdict that corresponds to no judge’s actual values.

[369] Learning Latent Action World Models In The Wild

Quentin Garrido, Tushar Nagarajan, Basile Terver, Nicolas Ballas, Yann LeCun, Michael Rabbat

Main category: cs.AI

TL;DR: Learning latent action world models from in-the-wild videos without action labels, enabling real-world reasoning and planning.

Details

Motivation: Real-world agents need to predict action consequences, but world models typically require action labels that are hard to obtain at scale. In-the-wild videos offer rich action data but present challenges like environmental noise and lack of common embodiment.

Method: Proposes learning latent action models from videos alone, using continuous but constrained latent actions (instead of vector quantization). Discusses necessary action properties, architectural choices, and evaluations for handling video diversity. Includes training a controller to map known actions to latent ones for planning tasks.

Result: Continuous constrained latent actions successfully capture complex actions from in-the-wild videos, enabling transfer of agent-induced environmental changes across videos. Latent actions become spatially localized relative to camera due to lack of common embodiment. Controller allows using latent actions as universal interface for planning with similar performance to action-conditioned baselines.

Conclusion: The work demonstrates progress toward scaling latent action models to real-world applications by successfully learning from diverse in-the-wild videos, providing a foundation for agents to reason and plan without requiring explicit action labels.

Abstract: Agents capable of reasoning and planning in the real world require the ability of predicting the consequences of their actions. While world models possess this capability, they most often require action labels, that can be complex to obtain at scale. This motivates the learning of latent action models, that can learn an action space from videos alone. Our work addresses the problem of learning latent actions world models on in-the-wild videos, expanding the scope of existing works that focus on simple robotics simulations, video games, or manipulation data. While this allows us to capture richer actions, it also introduces challenges stemming from the video diversity, such as environmental noise, or the lack of a common embodiment across videos. To address some of the challenges, we discuss properties that actions should follow as well as relevant architectural choices and evaluations. We find that continuous, but constrained, latent actions are able to capture the complexity of actions from in-the-wild videos, something that the common vector quantization does not. We for example find that changes in the environment coming from agents, such as humans entering the room, can be transferred across videos. This highlights the capability of learning actions that are specific to in-the-wild videos. In the absence of a common embodiment across videos, we are mainly able to learn latent actions that become localized in space, relative to the camera. Nonetheless, we are able to train a controller that maps known actions to latent ones, allowing us to use latent actions as a universal interface and solve planning tasks with our world model with similar performance as action-conditioned baselines. Our analyses and experiments provide a step towards scaling latent action models to the real world.

[370] Distilling the Thought, Watermarking the Answer: A Principle Semantic Guided Watermark for Large Reasoning Models

Shuliang Liu, Xingyu Li, Hongyi Liu, Yibo Yan, Bingchen Duan, Qi Zheng, Dong Fang, Lingfeng Su, Xuming Hu

Main category: cs.AI

TL;DR: ReasonMark is a novel watermarking framework for reasoning LLMs that decouples generation into undisturbed thinking and watermarked answering phases, using semantic guidance to preserve logical coherence while maintaining robustness.

Details

Motivation: Existing watermarking methods for reasoning LLMs either disrupt logical coherence (token-based approaches) or introduce high computational costs (semantic-aware approaches), creating challenges for deploying reasoning LLMs in traceable and trustworthy real-world applications.

Method: Decouples generation into Thinking Phase (undisturbed) and Answering Phase (watermarked). Uses Criticality Score to identify semantically pivotal tokens from reasoning traces, distills them into Principal Semantic Vector (PSV), and applies semantically-adaptive watermarking that modulates strength based on token-PSV alignment.

Result: Outperforms state-of-the-art methods: reduces text Perplexity by 0.35, increases translation BLEU score by 0.164, raises mathematical accuracy by 0.67 points, achieves 0.34% higher watermark detection AUC, stronger robustness to attacks, with negligible latency increase.

Conclusion: ReasonMark enables traceable and trustworthy deployment of reasoning LLMs in real-world applications by preserving logical integrity while maintaining watermark robustness and efficiency.

Abstract: Reasoning Large Language Models (RLLMs) excelling in complex tasks present unique challenges for digital watermarking, as existing methods often disrupt logical coherence or incur high computational costs. Token-based watermarking techniques can corrupt the reasoning flow by applying pseudo-random biases, while semantic-aware approaches improve quality but introduce significant latency or require auxiliary models. This paper introduces ReasonMark, a novel watermarking framework specifically designed for reasoning-intensive LLMs. Our approach decouples generation into an undisturbed Thinking Phase and a watermarked Answering Phase. We propose a Criticality Score to identify semantically pivotal tokens from the reasoning trace, which are distilled into a Principal Semantic Vector (PSV). The PSV then guides a semantically-adaptive mechanism that modulates watermark strength based on token-PSV alignment, ensuring robustness without compromising logical integrity. Extensive experiments show ReasonMark surpasses state-of-the-art methods by reducing text Perplexity by 0.35, increasing translation BLEU score by 0.164, and raising mathematical accuracy by 0.67 points. These advancements are achieved alongside a 0.34% higher watermark detection AUC and stronger robustness to attacks, all with a negligible increase in latency. This work enables the traceable and trustworthy deployment of reasoning LLMs in real-world applications.

[371] Observations and Remedies for Large Language Model Bias in Self-Consuming Performative Loop

Yaxuan Wang, Zhongteng Cai, Yujia Bao, Xueru Zhang, Yang Liu

Main category: cs.AI

TL;DR: The paper introduces Self-Consuming Performative Loop (SCPL) to study how synthetic data from LLMs creates feedback loops that amplify biases, and proposes reward-based rejection sampling to mitigate these biases.

Details

Motivation: As LLMs are increasingly trained on their own synthetic outputs, this creates self-consuming retraining loops that can cause performance degradation and emerging biases. Real-world deployment creates dynamic systems where user feedback influences future training data, potentially exacerbating biases against underserved groups.

Method: Introduces the SCPL framework to study bias evolution in controlled performative feedback settings. Examines two training loops: typical retraining and incremental fine-tuning. Conducts experiments on three real-world tasks and designs a reward-based rejection sampling strategy to mitigate biases.

Result: The performative loop increases preference bias but decreases disparate bias. The proposed reward-based rejection sampling strategy effectively mitigates these biases, moving toward more trustworthy self-improving systems.

Conclusion: Self-consuming training loops with synthetic data create complex bias dynamics that need careful management. The proposed framework and mitigation strategy provide tools to analyze and address bias evolution in iterative LLM training systems.

Abstract: The rapid advancement of large language models (LLMs) has led to growing interest in using synthetic data to train future models. However, this creates a self-consuming retraining loop, where models are trained on their own outputs and may cause performance drops and induce emerging biases. In real-world applications, previously deployed LLMs may influence the data they generate, leading to a dynamic system driven by user feedback. For example, if a model continues to underserve users from a group, less query data will be collected from this particular demographic of users. In this study, we introduce the concept of \textbf{S}elf-\textbf{C}onsuming \textbf{P}erformative \textbf{L}oop (\textbf{SCPL}) and investigate the role of synthetic data in shaping bias during these dynamic iterative training processes under controlled performative feedback. This controlled setting is motivated by the inaccessibility of real-world user preference data from dynamic production systems, and enables us to isolate and analyze feedback-driven bias evolution in a principled manner. We focus on two types of loops, including the typical retraining setting and the incremental fine-tuning setting, which is largely underexplored. Through experiments on three real-world tasks, we find that the performative loop increases preference bias and decreases disparate bias. We design a reward-based rejection sampling strategy to mitigate the bias, moving towards more trustworthy self-improving systems.

[372] SimuAgent: An LLM-Based Simulink Modeling Assistant Enhanced with Reinforcement Learning

Yanchang Liang, Xiaowei Zhao

Main category: cs.AI

TL;DR: SimuAgent is an LLM-powered agent for Simulink modeling that uses a concise Python representation instead of verbose XML, employs a two-stage training approach with Reflection-GRPO for sparse reward tasks, and achieves better performance than GPT-4o on the SimuBench benchmark while running on-premise.

Details

Motivation: LLMs have transformed text-based code automation but their application to graph-oriented engineering workflows like Simulink modeling remains under-explored. There's a need for AI-assisted engineering design tools that can handle complex graphical modeling environments while being privacy-preserving and cost-effective for industrial use.

Method: 1) Replaces verbose XML with concise dictionary-style Python representation to reduce token counts and improve interpretability. 2) Uses lightweight plan-execute architecture with two-stage training (low-level tool skills then high-level design reasoning). 3) Proposes Reflection-GRPO (ReGRPO) that augments GRPO with self-reflection traces to provide intermediate feedback for sparse reward tasks. 4) Employs abstract-reconstruct data augmentation and curriculum learning. 5) Evaluated on SimuBench benchmark with 5300 multi-domain modeling tasks.

Result: Qwen2.5-7B model fine-tuned with SimuAgent converges faster and achieves higher modeling accuracy than standard RL baselines. It even surpasses GPT-4o when evaluated with few-shot prompting on the same benchmark. The system trains and runs entirely on-premise with modest hardware, offering privacy-preserving, cost-effective solution.

Conclusion: SimuAgent successfully bridges the gap between LLMs and graphical modeling environments, providing a practical solution for AI-assisted engineering design in industrial settings. The approach demonstrates that specialized LLM agents can outperform general-purpose models like GPT-4o on domain-specific engineering tasks while maintaining privacy and cost-effectiveness.

Abstract: Large language models (LLMs) have revolutionized text-based code automation, but their potential in graph-oriented engineering workflows remains under-explored. We introduce SimuAgent, an LLM-powered modeling and simulation agent tailored for Simulink. SimuAgent replaces verbose XML with a concise, dictionary-style Python representation, dramatically cutting token counts, improving interpretability, and enabling fast, in-process simulation. A lightweight plan-execute architecture, trained in two stages, equips the agent with both low-level tool skills and high-level design reasoning. To tackle sparse rewards in long-horizon tasks, we propose Reflection-GRPO (ReGRPO), which augments Group Relative Policy Optimization (GRPO) with self-reflection traces that supply rich intermediate feedback, accelerating convergence and boosting robustness. Experiments on SimuBench, our newly released benchmark comprising 5300 multi-domain modeling tasks, show that a Qwen2.5-7B model fine-tuned with SimuAgent converges faster and achieves higher modeling accuracy than standard RL baselines, and even surpasses GPT-4o when evaluated with few-shot prompting on the same benchmark. Ablations confirm that the two-stage curriculum and abstract-reconstruct data augmentation further enhance generalization. SimuAgent trains and runs entirely on-premise with modest hardware, delivering a privacy-preserving, cost-effective solution for industrial model-driven engineering. SimuAgent bridges the gap between LLMs and graphical modeling environments, offering a practical solution for AI-assisted engineering design in industrial settings.

[373] Stock Market Price Prediction using Neural Prophet with Deep Neural Network

Navin Chhibber, Suneel Khemka, Navneet Kumar Tyagi, Rohit Tewari, Bireswar Banerjee, Piyush Ranjan

Main category: cs.AI

TL;DR: Proposes Neural Prophet with Deep Neural Network (NP-DNN) for stock price prediction, achieving 99.21% accuracy using Z-score normalization, missing value imputation, and MLP for complex pattern learning.

Details

Motivation: Existing statistical approaches for time-series prediction often fail to effectively forecast the probability range of future stock prices, creating a need for more accurate forecasting methods.

Method: Uses Neural Prophet with Deep Neural Network (NP-DNN) with Z-score normalization for preprocessing, missing value imputation, and Multi-Layer Perceptron (MLP) to learn complex nonlinear relationships and extract hidden patterns from stock price data.

Result: The proposed NP-DNN model achieved 99.21% accuracy, outperforming other approaches including the Fused Large Language Model.

Conclusion: NP-DNN is an effective approach for stock market price prediction that addresses limitations of traditional statistical methods and demonstrates superior accuracy in forecasting.

Abstract: Stock market price prediction is a significant interdisciplinary research domain that depends at the intersection of finance, statistics, and economics. Forecasting Accurately predicting stock prices has always been a focal point for various researchers. However, existing statistical approaches for time-series prediction often fail to effectively forecast the probability range of future stock prices. Hence, to solve this problem, the Neural Prophet with a Deep Neural Network (NP-DNN) is proposed to predict stock market prices. The preprocessing technique used in this research is Z-score normalization, which normalizes stock price data by removing scale differences, making patterns easier to detect. Missing value imputation fills gaps in historical data, enhancing the models use of complete information for more accurate predictions. The Multi-Layer Perceptron (MLP) learns complex nonlinear relationships among stock market prices and extracts hidden patterns from the input data, thereby creating meaningful feature representations for better prediction accuracy. The proposed NP-DNN model achieved an accuracy of 99.21% compared with other approaches using the Fused Large Language Model. Keywords: deep neural network, forecasting stock prices, multi-layer perceptron, neural prophet, stock market price prediction.

[374] Internal Representations as Indicators of Hallucinations in Agent Tool Selection

Kait Healy, Bharathi Srinivasan, Visakh Madathil, Jing Wu

Main category: cs.AI

TL;DR: Real-time hallucination detection for LLM tool calling using internal representations during generation, achieving 86.4% accuracy with minimal computational overhead.

Details

Motivation: LLMs suffer from tool-calling hallucinations (incorrect tool selection, malformed parameters, tool bypass) which undermine reliability, bypass security controls, and require early detection without multiple forward passes or external validation.

Method: Computationally efficient framework that detects tool-calling hallucinations in real-time by leveraging LLMs’ internal representations during the same forward pass used for generation, without requiring multiple forward passes.

Result: Strong detection performance up to 86.4% accuracy on reasoning tasks across multiple domains, maintaining real-time inference with minimal computational overhead, particularly effective at detecting parameter-level hallucinations and inappropriate tool selections.

Conclusion: The framework enables reliable LLM agent deployment by providing efficient real-time hallucination detection for tool calling, addressing critical reliability and security concerns in production systems.

Abstract: Large Language Models (LLMs) have shown remarkable capabilities in tool calling and tool usage, but suffer from hallucinations where they choose incorrect tools, provide malformed parameters and exhibit ’tool bypass’ behavior by performing simulations and generating outputs instead of invoking specialized tools or external systems. This undermines the reliability of LLM based agents in production systems as it leads to inconsistent results, and bypasses security and audit controls. Such hallucinations in agent tool selection require early detection and error handling. Unlike existing hallucination detection methods that require multiple forward passes or external validation, we present a computationally efficient framework that detects tool-calling hallucinations in real-time by leveraging LLMs’ internal representations during the same forward pass used for generation. We evaluate this approach on reasoning tasks across multiple domains, demonstrating strong detection performance (up to 86.4% accuracy) while maintaining real-time inference capabilities with minimal computational overhead, particularly excelling at detecting parameter-level hallucinations and inappropriate tool selections, critical for reliable agent deployment.

[375] MineNPC-Task: Task Suite for Memory-Aware Minecraft Agents

Tamil Sudaravan Mohan Doss, Michael Xu, Sudha Rao, Andrew D. Wilson, Balasaravanan Thoravi Kumaravel

Main category: cs.AI

TL;DR: MineNPC-Task is a user-authored benchmark and evaluation framework for testing memory-aware LLM agents in Minecraft, featuring real player-derived tasks with explicit dependencies and machine-checkable validators.

Details

Motivation: To create a more realistic and comprehensive evaluation framework for memory-aware LLM agents in open-world environments like Minecraft, moving beyond synthetic prompts and capturing real player interactions and dependencies.

Method: Tasks are elicited from formative and summative co-play with expert Minecraft players, normalized into parametric templates with explicit preconditions and dependencies, and paired with machine-checkable validators under a bounded-knowledge policy that prevents out-of-world shortcuts.

Result: Initial evaluation with GPT-4o across 216 subtasks from 8 experienced players revealed recurring breakdown patterns in code execution, inventory/tool handling, referencing, and navigation, but also showed recoveries supported by mixed-initiative clarifications and lightweight memory. Participants rated interaction quality and interface usability positively while highlighting the need for stronger memory persistence.

Conclusion: The framework provides a transparent, reproducible evaluation system for future memory-aware embodied agents, with the complete task suite, validators, logs, and harness released to support further research in this area.

Abstract: We present \textsc{MineNPC-Task}, a user-authored benchmark and evaluation harness for testing memory-aware, mixed-initiative LLM agents in open-world \emph{Minecraft}. Rather than relying on synthetic prompts, tasks are elicited from formative and summative co-play with expert players, normalized into parametric templates with explicit preconditions and dependency structure, and paired with machine-checkable validators under a bounded-knowledge policy that forbids out-of-world shortcuts. The harness captures plan/act/memory events-including plan previews, targeted clarifications, memory reads and writes, precondition checks, and repair attempts and reports outcomes relative to the total number of attempted subtasks, derived from in-world evidence. As an initial snapshot, we instantiate the framework with GPT-4o and evaluate \textbf{216} subtasks across \textbf{8} experienced players. We observe recurring breakdown patterns in code execution, inventory/tool handling, referencing, and navigation, alongside recoveries supported by mixed-initiative clarifications and lightweight memory. Participants rated interaction quality and interface usability positively, while highlighting the need for stronger memory persistence across tasks. We release the complete task suite, validators, logs, and harness to support transparent, reproducible evaluation of future memory-aware embodied agents.

[376] Calculating Ultra-Strong and Extended Solutions for Nine Men’s Morris, Morabaraba, and Lasker Morris

Gábor E. Gévay, Gábor Danner

Main category: cs.AI

TL;DR: Extended strong solutions for Nine Men’s Morris, Lasker Morris, and Morabaraba variants, plus ultra-strong solving algorithm using multi-valued retrograde analysis.

Details

Motivation: To extend strong solutions beyond standard starting positions for three Morris variants and develop ultra-strong solving algorithms that outperform random optimal move selection against fallible opponents.

Method: Multi-valued retrograde analysis algorithm for calculating game-theoretic values of all possible game states from various starting positions, including non-standard stone counts.

Result: Extended strong solutions for Nine Men’s Morris and Lasker Morris, plus first solution for Morabaraba showing most equal-stone starting positions are first-player wins (unlike the other variants which are draws).

Conclusion: The multi-valued retrograde analysis enables ultra-strong solutions that achieve better results against fallible opponents than traditional strong solutions, with Morabaraba showing different strategic characteristics from other Morris variants.

Abstract: The strong solutions of Nine Men’s Morris and its variant, Lasker Morris are well-known results (the starting positions are draws). We re-examined both of these games, and calculated extended strong solutions for them. By this we mean the game-theoretic values of all possible game states that could be reached from certain starting positions where the number of stones to be placed by the players is different from the standard rules. These were also calculated for a previously unsolved third variant, Morabaraba, with interesting results: most of the starting positions where the players can place an equal number of stones (including the standard starting position) are wins for the first player (as opposed to the above games, where these are usually draws). We also developed a multi-valued retrograde analysis, and used it as a basis for an algorithm for solving these games ultra-strongly. This means that when our program is playing against a fallible opponent, it has a greater chance of achieving a better result than the game-theoretic value, compared to randomly selecting between “just strongly” optimal moves. Previous attempts on ultra-strong solutions used local heuristics or learning during games, but we incorporated our algorithm into the retrograde analysis.

[377] Talking with Tables for Better LLM Factual Data Interactions

Jio Oh, Geon Heo, Seungjun Oh, Hyunjin Kim, JinYeong Bak, Jindong Wang, Xing Xie, Steven Euijong Whang

Main category: cs.AI

TL;DR: Using tabular structures in LLM interactions yields 40.29% average performance gain for information retrieval and data manipulation tasks compared to other structures.

Details

Motivation: LLMs struggle with real-world information retrieval and data manipulation requests that involve multiple conditions. There's a need for more effective ways to handle factual data operations in LLM applications.

Method: Leveraging tabular structures in LLM interactions, comparing them against other structures (knowledge graphs, JSON, blended structured text). Using attention-value analysis to understand why tables work better, and evaluating text-to-table conversion for unstructured sources.

Result: Tabular structures provide 40.29% average performance gain with better robustness and token efficiency. Tables help LLMs better locate relevant information. Tables offer the best balance between efficiency and effectiveness compared to other structures.

Conclusion: Tabular representations have untapped potential for future LLM applications, remaining robust to task complexity and adaptable to unstructured sources through conversion.

Abstract: Large Language Models (LLMs) often struggle with requests related to information retrieval and data manipulation that frequently arise in real-world scenarios under multiple conditions. In this paper, we demonstrate that leveraging tabular structures in LLM interactions, is more effective than utilizing other structures for handling prevalent requests that operate over factual data. Through comprehensive evaluations across various scenarios and request types, we show that providing tabular structures yields a 40.29% average performance gain along with better robustness and token efficiency. Through attention-value analysis, we discover that tables help LLMs better locate relevant information, explaining these improvements. Beyond tables and text, we evaluate whether (1) blending structuredness within text, such as providing templates or fixing the order of attributes, and (2) other representative structures, such as knowledge graphs and JSON are helpful. We observe that utilizing tables offers the best balance between efficiency and effectiveness. The method remains robust to task complexity and adapts to unstructured sources through text-to-table conversion. Overall, we highlight the untapped potential of tabular representations for future LLM applications.

[378] Beyond Retrieval: Improving Evidence Quality for LLM-based Multimodal Fact-Checking

Haoran Ou, Gelei Deng, Xingshuo Han, Jie Zhang, Han Qiu, Shangwei Guo, Tianwei Zhang

Main category: cs.AI

TL;DR: Aletheia is an end-to-end framework for automated multimodal fact-checking that improves evidence retrieval quality and coverage, achieving up to 30.8% higher accuracy than existing methods.

Details

Motivation: The paper addresses the challenge of multimodal disinformation where deceptive claims use coordinated text and visual content. While LLMs and retrieval-augmented frameworks show promise for automated fact-checking, they suffer from poor external search coverage and evidence quality evaluation.

Method: Proposes Aletheia, an end-to-end framework with a novel evidence retrieval strategy that improves evidence coverage and filters useless information from open-domain sources to extract high-quality evidence for verification.

Result: Aletheia achieves 88.3% accuracy on two public multimodal disinformation datasets and 90.2% on newly emerging claims. Compared to existing evidence retrieval strategies, it improves verification accuracy by up to 30.8%.

Conclusion: The framework demonstrates the critical role of evidence quality in LLM-based disinformation verification and provides an effective solution for automated multimodal fact-checking with improved evidence retrieval.

Abstract: The increasing multimodal disinformation, where deceptive claims are reinforced through coordinated text and visual content, poses significant challenges to automated fact-checking. Recent efforts leverage Large Language Models (LLMs) for this task, capitalizing on their strong reasoning and multimodal understanding capabilities. Emerging retrieval-augmented frameworks further equip LLMs with access to open-domain external information, enabling evidence-based verification beyond their internal knowledge. Despite their promising gains, our empirical study reveals notable shortcomings in the external search coverage and evidence quality evaluation. To mitigate those limitations, we propose Aletheia, an end-to-end framework for automated multimodal fact-checking. It introduces a novel evidence retrieval strategy that improves evidence coverage and filters useless information from open-domain sources, enabling the extraction of high-quality evidence for verification. Extensive experiments demonstrate that Aletheia achieves an accuracy of 88.3% on two public multimodal disinformation datasets and 90.2% on newly emerging claims. Compared with existing evidence retrieval strategies, our approach improves verification accuracy by up to 30.8%, highlighting the critical role of evidence quality in LLM-based disinformation verification.

[379] TabularMath: Understanding Math Reasoning over Tables with Large Language Models

Shi-Yu Tian, Zhi Zhou, Wei Dong, Kun-Yang Yu, Ming Yang, Zi-Jian Cheng, Lan-Zhe Guo, Yu-Feng Li

Main category: cs.AI

TL;DR: AutoT2T framework transforms math word problems into scalable tabular reasoning tasks, creating TabularMath benchmark to evaluate LLMs on real-world table reasoning with complexity, quality, and representation variations.

Details

Motivation: Real-world applications like business intelligence require multi-step numerical reasoning with tables and robustness to incomplete/inconsistent information, but current evaluation is limited by manually collected tables and lack of coverage for real-world traps.

Method: Propose AutoT2T, a neuro-symbolic framework that controllably transforms math word problems into scalable and verified tabular reasoning tasks, then develop TabularMath benchmark with four subsets covering text-based and image-based tables across complexity, quality, and representation dimensions.

Result: Three key findings: (1) Table complexity and reasoning difficulty jointly impact performance; (2) Low-quality tables pose severe risks to reliable reasoning in current LLMs; (3) Different table modalities show similar trends, with text-based tables typically easier for models.

Conclusion: The work addresses the gap in evaluating tabular reasoning for real-world applications, providing a scalable benchmark and revealing critical insights about LLM performance on tabular data with varying complexity, quality, and representation.

Abstract: Mathematical reasoning has long been a key benchmark for evaluating large language models. Although substantial progress has been made on math word problems, the need for reasoning over tabular data in real-world applications has been overlooked. For instance, applications such as business intelligence demand not only multi-step numerical reasoning with tables but also robustness to incomplete or inconsistent information. However, comprehensive evaluation in this area is severely limited, constrained by the reliance on manually collected tables that are difficult to scale and the lack of coverage for potential traps encountered in real-world scenarios. To address this problem, we propose AutoT2T, a neuro-symbolic framework that controllably transforms math word problems into scalable and verified tabular reasoning tasks. Building on this pipeline, we develop TabularMath, a benchmark comprising four subsets that include both text-based and image-based tables, covering table complexity, table quality, and table representation dimensions. Our study reveals three key observations: (1) Table complexity and reasoning difficulty impact reasoning performance jointly; (2) Low-quality tables pose severe risks to reliable reasoning in current LLMs; (3) Different table modalities show similar trends, with text-based tables typically being easier for models to reason over. In-depth analyses are conducted for each observation to guide future research.

[380] HGMF: A Hierarchical Gaussian Mixture Framework for Scalable Tool Invocation within the Model Context Protocol

Wenpeng Xing, Zhipeng Chen, Changting Lin, Meng Han

Main category: cs.AI

TL;DR: HGMF is a hierarchical Gaussian mixture framework that improves LLM tool selection accuracy by pruning irrelevant options through probabilistic clustering and filtering in a unified semantic space.

Details

Motivation: Large Language Models struggle with selecting correct tools from large, hierarchical libraries due to limited context windows and noise from irrelevant options, leading to low accuracy and high computational costs.

Method: HGMF maps queries and tool descriptions into a unified semantic space, then uses two-stage hierarchical clustering: first clusters servers with Gaussian Mixture Model and filters by query likelihood, then repeats the process for tools within selected servers to produce a compact candidate set.

Result: Experiments on public datasets show HGMF significantly improves tool selection accuracy while reducing inference latency, confirming scalability and effectiveness for large-scale tool libraries.

Conclusion: HGMF provides an effective probabilistic pruning method for scalable tool invocation that addresses the challenges of large hierarchical tool libraries, improving both accuracy and efficiency.

Abstract: Invoking external tools enables Large Language Models (LLMs) to perform complex, real-world tasks, yet selecting the correct tool from large, hierarchically-structured libraries remains a significant challenge. The limited context windows of LLMs and noise from irrelevant options often lead to low selection accuracy and high computational costs. To address this, we propose the Hierarchical Gaussian Mixture Framework (HGMF), a probabilistic pruning method for scalable tool invocation. HGMF first maps the user query and all tool descriptions into a unified semantic space. The framework then operates in two stages: it clusters servers using a Gaussian Mixture Model (GMM) and filters them based on the query’s likelihood. Subsequently, it applies the same GMM-based clustering and filtering to the tools associated with the selected servers. This hierarchical process produces a compact, high-relevance candidate set, simplifying the final selection task for the LLM. Experiments on a public dataset show that HGMF significantly improves tool selection accuracy while reducing inference latency, confirming the framework’s scalability and effectiveness for large-scale tool libraries.

[381] Improving and Evaluating Open Deep Research Agents

Doaa Allabadi, Kyle Bradbury, Jordan M. Malof

Main category: cs.AI

TL;DR: ODR+ achieves 10% success rate on BC-Small benchmark, outperforming both open-source and proprietary deep research agents which scored 0%.

Details

Motivation: Most Deep Research Agents (DRAs) are proprietary closed-source systems, limiting research accessibility. Only one open-source DRA (ODR) exists, but its performance compared to proprietary systems is unknown.

Method: Adapted BrowseComp benchmark to create BrowseComp-Small (BC-Small) as a computationally-tractable benchmark. Compared ODR to Anthropic and Google proprietary systems, then introduced three strategic improvements to create ODR+.

Result: All three baseline systems (ODR, Anthropic, Google) achieved 0% accuracy on BC-Small test set. ODR+ achieved 10% success rate, becoming state-of-the-art among both open-source and closed-source systems.

Conclusion: Open-source DRAs can be improved to compete with proprietary systems through strategic enhancements. BC-Small provides a practical benchmark for academic research on DRAs.

Abstract: We focus here on Deep Research Agents (DRAs), which are systems that can take a natural language prompt from a user, and then autonomously search for, and utilize, internet-based content to address the prompt. Recent DRAs have demonstrated impressive capabilities on public benchmarks however, recent research largely involves proprietary closed-source systems. At the time of this work, we only found one open-source DRA, termed Open Deep Research (ODR). In this work we adapt the challenging recent BrowseComp benchmark to compare ODR to existing proprietary systems. We propose BrowseComp-Small (BC-Small), comprising a subset of BrowseComp, as a more computationally-tractable DRA benchmark for academic labs. We benchmark ODR and two other proprietary systems on BC-Small: one system from Anthropic and one system from Google. We find that all three systems achieve 0% accuracy on the test set of 60 questions. We introduce three strategic improvements to ODR, resulting in the ODR+ model, which achieves a state-of-the-art 10% success rate on BC-Small among both closed-source and open-source systems. We report ablation studies indicating that all three of our improvements contributed to the success of ODR+.

[382] An LLM + ASP Workflow for Joint Entity-Relation Extraction

Trang Tran, Trung Hoang Le, Huiping Cao, Tran Cao Son

Main category: cs.AI

TL;DR: Proposes a novel joint entity-relation extraction workflow combining LLMs for natural language understanding and Answer Set Programming for knowledge representation, achieving state-of-the-art results with only 10% training data.

Details

Motivation: Traditional JERE approaches require large annotated datasets and lack flexibility for incorporating domain knowledge, making model creation labor-intensive and elaboration-intolerant.

Method: A generic workflow combining generative pre-trained LLMs (for natural language understanding from unannotated text) with Answer Set Programming (for knowledge representation and reasoning). ASP’s elaboration tolerance allows easy incorporation of domain knowledge without modifying core programs.

Result: The LLM+ASP workflow outperforms state-of-the-art JERE systems with only 10% training data, achieving 2.5x improvement (35% vs 15%) in Relation Extraction on the challenging SciERC corpus.

Conclusion: The proposed hybrid approach effectively addresses limitations of traditional JERE methods by leveraging LLMs’ language understanding and ASP’s knowledge representation capabilities, enabling high performance with minimal training data and easy domain adaptation.

Abstract: Joint entity-relation extraction (JERE) identifies both entities and their relationships simultaneously. Traditional machine-learning based approaches to performing this task require a large corpus of annotated data and lack the ability to easily incorporate domain specific information in the construction of the model. Therefore, creating a model for JERE is often labor intensive, time consuming, and elaboration intolerant. In this paper, we propose harnessing the capabilities of generative pre-trained large language models (LLMs) and the knowledge representation and reasoning capabilities of Answer Set Programming (ASP) to perform JERE. We present a generic workflow for JERE using LLMs and ASP. The workflow is generic in the sense that it can be applied for JERE in any domain. It takes advantage of LLM’s capability in natural language understanding in that it works directly with unannotated text. It exploits the elaboration tolerant feature of ASP in that no modification of its core program is required when additional domain specific knowledge, in the form of type specifications, is found and needs to be used. We demonstrate the usefulness of the proposed workflow through experiments with limited training data on three well-known benchmarks for JERE. The results of our experiments show that the LLM + ASP workflow is better than state-of-the-art JERE systems in several categories with only 10% of training data. It is able to achieve a 2.5 times (35% over 15%) improvement in the Relation Extraction task for the SciERC corpus, one of the most difficult benchmarks.

[383] When Models Outthink Their Safety: Unveiling and Mitigating Self-Jailbreak in Large Reasoning Models

Yingzhi Mao, Chunkang Zhang, Junxiang Wang, Xinyan Guan, Boxi Cao, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun

Main category: cs.AI

TL;DR: The paper identifies Self-Jailbreak failures in Large Reasoning Models where models initially recognize harmful queries but override safety judgments during reasoning, then proposes Chain-of-Guardrail (CoG) for targeted step-level safety interventions while preserving reasoning capabilities.

Details

Motivation: Existing safety methods apply coarse-grained constraints over entire reasoning trajectories, which can undermine reasoning capability while failing to address root causes of unsafe behavior. The authors discovered Self-Jailbreak failures where models recognize harmful intent initially but override safety judgments during reasoning steps.

Method: Proposed Chain-of-Guardrail (CoG), a trajectory-level training framework that mitigates Self-Jailbreak via targeted, step-level interventions. It focuses on addressing safety failures at the reasoning step level rather than applying blanket constraints over entire reasoning processes.

Result: Experiments across multiple safety and reasoning benchmarks show that CoG achieves a favorable balance between safety and reasoning performance compared with existing approaches.

Conclusion: Safety failures in LRMs primarily arise from reasoning steps rather than initial harm recognition, and targeted step-level interventions (CoG) can effectively mitigate Self-Jailbreak while maintaining reasoning capabilities.

Abstract: Large Reasoning Models (LRMs) achieve strong performance on complex multi-step reasoning, yet they still exhibit severe safety failures such as harmful content generation. Existing methods often apply coarse-grained constraints over the entire reasoning trajectories, which can undermine reasoning capability while failing to address the root causes of unsafe behavior. In this work, we uncover a previously underexplored failure mode in LRMs, termed Self-Jailbreak, where models initially recognize the harmful intent of a query, but override this judgment during subsequent reasoning steps, ultimately generating unsafe outputs. Such a phenomenon reveals that LRMs are capable of recognizing harm, while safety failures primarily arise from reasoning steps. Motivated by this finding, we propose \emph{Chain-of-Guardrail} (CoG), a trajectory-level training framework that mitigates Self-Jailbreak via targeted, step-level interventions while maintaining reasoning ability. Experiments across multiple safety and reasoning benchmarks indicate that CoG achieves a favorable balance between safety and reasoning performance compared with existing approaches.

[384] Rethinking the Text-Vision Reasoning Imbalance in MLLMs through the Lens of Training Recipes

Guanyu Yao, Qiucheng Wu, Yang Zhang, Zhaowen Wang, Handong Zhao, Shiyu Chang

Main category: cs.AI

TL;DR: The paper identifies and addresses the “modality gap” in MLLMs where models over-rely on textual cues and under-attend to visual content, proposing training recipe improvements to bridge this gap.

Details

Motivation: Current multimodal large language models (MLLMs) show an imbalance in reasoning capabilities, over-relying on textual cues while under-attending to visual content, leading to suboptimal performance on vision-centric tasks requiring genuine visual reasoning.

Method: Analyzes the modality gap through training recipes, showing existing approaches amplify the gap, then systematically explores strategies to bridge it from two complementary perspectives: data design and loss function design.

Result: The findings provide insights into developing training recipes that mitigate the modality gap and promote more balanced multimodal reasoning in MLLMs.

Conclusion: The paper addresses the modality gap problem in MLLMs and proposes training recipe improvements to achieve more balanced multimodal reasoning, with code publicly available for further research.

Abstract: Multimodal large language models (MLLMs) have demonstrated strong capabilities on vision-and-language tasks. However, recent findings reveal an imbalance in their reasoning capabilities across visual and textual modalities. Specifically, current MLLMs often over-rely on textual cues while under-attending to visual content, resulting in suboptimal performance on tasks that require genuine visual reasoning. We refer to this phenomenon as the \textit{modality gap}, defined as the performance disparity between text-centric and vision-centric inputs. In this paper, we analyze the modality gap through the lens of training recipes. We first show that existing training recipes tend to amplify this gap. Then, we systematically explore strategies to bridge it from two complementary perspectives: data and loss design. Our findings provide insights into developing training recipes that mitigate the modality gap and promote more balanced multimodal reasoning. Our code is publicly available at https://github.com/UCSB-NLP-Chang/Bridging-Modality-Gap.

[385] LPFQA: A Long-Tail Professional Forum-based Benchmark for LLM Evaluation

Liya Zhu, Peizhuang Cong, Jingzhe Ding, Aowei Ji, Wenya Wu, Jiani Hou, Chunjie Wu, Xiang Gao, Jingkai Liu, Zhou Huan, Xuelei Sun, Yang Yang, Jianpeng Jiao, Liang Hu, Xinjie Chen, Jiashuo Liu, Tong Yang, Zaiyuan Wang, Ge Zhang, Wenhao Huang

Main category: cs.AI

TL;DR: LPFQA is a new benchmark for evaluating LLMs on long-tail, expertise-intensive knowledge from real professional forums across 7 domains, revealing significant performance gaps in specialized reasoning.

Details

Motivation: Standard LLM benchmarks fail to capture real-world professional expertise and long-tail knowledge. Current evaluations don't adequately test specialized reasoning, domain terminology understanding, or contextual interpretation in authentic professional scenarios.

Method: Created LPFQA benchmark from authentic professional forum discussions across 7 academic/industrial domains with 430 curated tasks. Uses hierarchical difficulty structure for semantic clarity and uniquely identifiable answers. Evaluates specialized reasoning, domain terminology, and contextual interpretation.

Result: Experiments on multiple mainstream LLMs show substantial performance gaps, especially on tasks requiring deep domain reasoning. Exposes limitations overlooked by existing benchmarks.

Conclusion: LPFQA provides an authentic, discriminative evaluation framework that complements prior benchmarks and informs future LLM development for real-world professional applications.

Abstract: Large Language Models (LLMs) perform well on standard reasoning and question-answering benchmarks, yet such evaluations often fail to capture their ability to handle long-tail, expertise-intensive knowledge in real-world professional scenarios. We introduce LPFQA, a long-tail knowledge benchmark derived from authentic professional forum discussions, covering 7 academic and industrial domains with 430 curated tasks grounded in practical expertise. LPFQA evaluates specialized reasoning, domain-specific terminology understanding, and contextual interpretation, and adopts a hierarchical difficulty structure to ensure semantic clarity and uniquely identifiable answers. Experiments on over multiple mainstream LLMs reveal substantial performance gaps, particularly on tasks requiring deep domain reasoning, exposing limitations overlooked by existing benchmarks. Overall, LPFQA provides an authentic and discriminative evaluation framework that complements prior benchmarks and informs future LLM development.

[386] MENTOR: A Metacognition-Driven Self-Evolution Framework for Uncovering and Mitigating Implicit Domain Risks in LLMs

Liang Shan, Kaicheng Shen, Wen Wu, Zhenyu Ying, Chaochao Lu, Yan Teng, Jingqi Huang, Guangze Ye, Guoqing Wang, Liang He

Main category: cs.AI

TL;DR: MENTOR framework uses metacognition and activation steering to reduce LLM vulnerabilities to domain-specific implicit risks, achieving expert-level safety performance.

Details

Motivation: Current LLM safety measures fail to address implicit, domain-specific risks, creating vulnerabilities that need adaptive, scalable solutions for real-world deployment.

Method: MENTOR framework: 1) Structured self-assessment through simulated critical thinking (perspective-taking, consequential reasoning), 2) Formalization into dynamic rule-based knowledge graphs, 3) Activation steering at inference time to modulate internal representations for compliance.

Result: Reduced attack success rates across education, finance, and management domains; achieved risk analysis performance comparable to human experts; substantial improvement over baseline 57.8% jailbreak vulnerability.

Conclusion: MENTOR provides a scalable, adaptive pathway for robust domain-specific alignment of LLMs, addressing critical safety gaps through metacognition-driven self-evolution.

Abstract: Ensuring the safety of Large Language Models (LLMs) is critical for real-world deployment. However, current safety measures often fail to address implicit, domain-specific risks. To investigate this gap, we introduce a dataset of 3,000 annotated queries spanning education, finance, and management. Evaluations across 14 leading LLMs reveal a concerning vulnerability: an average jailbreak success rate of 57.8%. In response, we propose MENTOR, a metacognition-driven self-evolution framework. MENTOR first performs structured self-assessment through simulated critical thinking, such as perspective-taking and consequential reasoning to uncover latent model misalignments. These reflections are formalized into dynamic rule-based knowledge graphs that evolve with emerging risk patterns. To enforce these rules at inference time, we introduce activation steering, a method that directly modulates the model’s internal representations to ensure compliance. Experiments demonstrate that MENTOR substantially reduces attack success rates across all tested domains and achieves risk analysis performance comparable to human experts. Our work offers a scalable and adaptive pathway toward robust domain-specific alignment of LLMs.

[387] Beyond Detection: Exploring Evidence-based Multi-Agent Debate for Misinformation Intervention and Persuasion

Chen Han, Yijia Ma, Jin Tan, Wenzhen Zheng, Xijin Tang

Main category: cs.AI

TL;DR: ED2D is an evidence-based multi-agent debate framework for misinformation detection that not only detects misinformation but also generates persuasive debunking explanations to correct user beliefs and discourage misinformation sharing.

Details

Motivation: Prior multi-agent debate frameworks focused only on detection accuracy but overlooked helping users understand reasoning behind factual judgments and developing future resilience against misinformation. Debate transcripts offer rich but underutilized resources for transparent reasoning.

Method: ED2D extends previous multi-agent debate approaches by incorporating factual evidence retrieval. It’s designed as both a detection framework and a persuasive multi-agent system that generates debunking transcripts to correct user beliefs and discourage misinformation sharing.

Result: ED2D outperforms existing baselines across three misinformation detection benchmarks. When correct, its debunking transcripts show persuasive effects comparable to human experts. However, when ED2D misclassifies, its explanations may inadvertently reinforce user misconceptions even when presented alongside accurate human explanations.

Conclusion: The findings highlight both the promise and potential risks of deploying multi-agent debate systems for misinformation intervention. The authors developed a public community website to help users explore ED2D, fostering transparency, critical thinking, and collaborative fact-checking.

Abstract: Multi-agent debate (MAD) frameworks have emerged as promising approaches for misinformation detection by simulating adversarial reasoning. While prior work has focused on detection accuracy, it overlooks the importance of helping users understand the reasoning behind factual judgments and develop future resilience. The debate transcripts generated during MAD offer a rich but underutilized resource for transparent reasoning. In this study, we introduce ED2D, an evidence-based MAD framework that extends previous approach by incorporating factual evidence retrieval. More importantly, ED2D is designed not only as a detection framework but also as a persuasive multi-agent system aimed at correcting user beliefs and discouraging misinformation sharing. We compare the persuasive effects of ED2D-generated debunking transcripts with those authored by human experts. Results demonstrate that ED2D outperforms existing baselines across three misinformation detection benchmarks. When ED2D generates correct predictions, its debunking transcripts exhibit persuasive effects comparable to those of human experts; However, when ED2D misclassifies, its accompanying explanations may inadvertently reinforce users’misconceptions, even when presented alongside accurate human explanations. Our findings highlight both the promise and the potential risks of deploying MAD systems for misinformation intervention. We further develop a public community website to help users explore ED2D, fostering transparency, critical thinking, and collaborative fact-checking.

[388] Belief Is All You Need: Modeling Narrative Archetypes in Conspiratorial Discourse

Soorya Ram Shimgekar, Abhay Goyal, Roy Ka-Wei Lee, Koustuv Saha, Pi Zonooz, Navin Kumar

Main category: cs.AI

TL;DR: Researchers analyze conspiratorial narratives in Singapore Telegram groups using a two-stage framework: fine-tuning RoBERTa for classification and building a signed belief graph with a novel SiBeGNN model to identify narrative archetypes.

Details

Motivation: Conspiratorial discourse is increasingly embedded in digital ecosystems but remains difficult to study structurally. Current approaches often assume such content exists in isolated echo chambers, but this work aims to understand how conspiratorial narratives are woven into everyday discussions.

Method: Two-stage computational framework: 1) Fine-tune RoBERTa-large to classify messages as conspiratorial (F1=0.866 on 2k expert-labeled messages). 2) Build signed belief graph with nodes as messages and edge signs reflecting belief alignment, weighted by textual similarity. Introduce SiBeGNN with Sign Disentanglement Loss to learn embeddings separating ideological alignment from stylistic features.

Result: Identified seven narrative archetypes across 553,648 messages: legal topics, medical concerns, media discussions, finance, contradictions in authority, group moderation, and general chat. SiBeGNN achieved superior clustering quality (cDBI=8.38) vs baselines (13.60-67.27), with 88% inter-rater agreement. Found conspiratorial messages appear not only in skepticism clusters but also within routine discussions of finance, law, and everyday matters.

Conclusion: Conspiratorial discourse operates within ordinary social interaction, challenging assumptions about online radicalization. The framework advances computational methods for belief-driven discourse analysis with applications for stance detection, political communication studies, and content moderation policy.

Abstract: Conspiratorial discourse is increasingly embedded within digital communication ecosystems, yet its structure and spread remain difficult to study. This work analyzes conspiratorial narratives in Singapore-based Telegram groups, showing that such content is woven into everyday discussions rather than confined to isolated echo chambers. We propose a two-stage computational framework. First, we fine-tune RoBERTa-large to classify messages as conspiratorial or not, achieving an F1-score of 0.866 on 2,000 expert-labeled messages. Second, we build a signed belief graph in which nodes represent messages and edge signs reflect alignment in belief labels, weighted by textual similarity. We introduce a Signed Belief Graph Neural Network (SiBeGNN) that uses a Sign Disentanglement Loss to learn embeddings that separate ideological alignment from stylistic features. Using hierarchical clustering on these embeddings, we identify seven narrative archetypes across 553,648 messages: legal topics, medical concerns, media discussions, finance, contradictions in authority, group moderation, and general chat. SiBeGNN yields stronger clustering quality (cDBI = 8.38) than baseline methods (13.60 to 67.27), supported by 88 percent inter-rater agreement in expert evaluations. Our analysis shows that conspiratorial messages appear not only in clusters focused on skepticism or distrust, but also within routine discussions of finance, law, and everyday matters. These findings challenge common assumptions about online radicalization by demonstrating that conspiratorial discourse operates within ordinary social interaction. The proposed framework advances computational methods for belief-driven discourse analysis and offers applications for stance detection, political communication studies, and content moderation policy.

[389] Modular and Multi-Path-Aware Offline Benchmarking for Mobile GUI Agents

Youngmin Im, Byeongung Jo, Jaeyoung Wi, Seungwoo Baek, Tae Hoon Min, Joo Hyung Lee, Sangeun Oh, Insik Shin, Sunjae Lee

Main category: cs.AI

TL;DR: MobiBench is a modular, multi-path aware offline benchmarking framework for mobile GUI agents that addresses limitations of current evaluation methods by enabling high-fidelity, scalable, and reproducible assessment while providing detailed component-level analysis.

Details

Motivation: Current evaluation practices for mobile GUI agents have two fundamental limitations: 1) Offline benchmarks use static, single-path datasets that unfairly penalize valid alternative actions, while online benchmarks suffer from poor scalability and reproducibility due to dynamic live environments. 2) Existing benchmarks treat agents as monolithic black boxes, overlooking individual component contributions, leading to unfair comparisons and obscuring performance bottlenecks.

Method: MobiBench is presented as a modular and multi-path aware offline benchmarking framework. It enables high-fidelity, scalable, and reproducible evaluation entirely in offline settings. The framework supports comprehensive module-level analysis to assess individual components of mobile GUI agents rather than treating them as black boxes.

Result: MobiBench achieves 94.72% agreement with human evaluators, matching the performance of carefully engineered online benchmarks while preserving the scalability and reproducibility of static offline benchmarks. The module-level analysis uncovers key insights including systematic evaluation of diverse techniques, optimal module configurations across model scales, inherent limitations of current LFMs (likely Large Foundation Models), and actionable guidelines for designing more capable and cost-efficient mobile agents.

Conclusion: MobiBench successfully addresses the limitations of current mobile GUI agent evaluation methods by providing a modular, multi-path aware offline benchmarking framework that combines the fidelity of online evaluation with the scalability and reproducibility of offline methods, while enabling detailed component-level analysis to guide future agent development.

Abstract: Mobile GUI Agents, AI agents capable of interacting with mobile applications on behalf of users, have the potential to transform human computer interaction. However, current evaluation practices for GUI agents face two fundamental limitations. First, they either rely on single path offline benchmarks or online live benchmarks. Offline benchmarks using static, single path annotated datasets unfairly penalize valid alternative actions, while online benchmarks suffer from poor scalability and reproducibility due to the dynamic and unpredictable nature of live evaluation. Second, existing benchmarks treat agents as monolithic black boxes, overlooking the contributions of individual components, which often leads to unfair comparisons or obscures key performance bottlenecks. To address these limitations, we present MobiBench, the first modular and multi path aware offline benchmarking framework for mobile GUI agents that enables high fidelity, scalable, and reproducible evaluation entirely in offline settings. Our experiments demonstrate that MobiBench achieves 94.72 percent agreement with human evaluators, on par with carefully engineered online benchmarks, while preserving the scalability and reproducibility of static offline benchmarks. Furthermore, our comprehensive module level analysis uncovers several key insights, including a systematic evaluation of diverse techniques used in mobile GUI agents, optimal module configurations across model scales, the inherent limitations of current LFMs, and actionable guidelines for designing more capable and cost efficient mobile agents.

[390] The Reward Model Selection Crisis in Personalized Alignment

Fady Rezk, Yuangang Pan, Chuan-Sheng Foo, Xun Xu, Nancy Chen, Henry Gouk, Timothy Hospedales

Main category: cs.AI

TL;DR: Standard reward model accuracy fails as deployment criterion for personalized alignment; policy accuracy and behavioral benchmarks reveal ranking-generation decoupling; simple in-context learning outperforms reward-guided methods.

Details

Motivation: Current personalized alignment focuses on improving reward model accuracy, assuming better ranking leads to better personalized behavior. However, deployment requires inference-time adaptation (reward-guided decoding), creating a need for reward models that effectively guide generation, not just rank preferences accurately.

Method: Introduces policy accuracy metric to measure whether reward-guided decoding adapted LLMs correctly discriminate between preferred/dispreferred responses. Creates Pref-LaMP benchmark with ground-truth user completions for direct behavioral evaluation. Compares reward-guided methods with simple in-context learning.

Result: RM accuracy correlates weakly with downstream policy accuracy (Kendall’s tau = 0.08-0.31). Methods with 20-point RM accuracy differences produce almost identical output quality. High ranking accuracy methods can fail to generate behaviorally aligned responses. In-context learning dominates all reward-guided methods for models ≥3B parameters, achieving ~3 point ROUGE-1 gains over best reward method at 7B scale.

Conclusion: The field has been optimizing for proxy metrics that don’t predict deployment performance. Current personalized alignment methods fail to operationalize preferences into behavioral adaptation under realistic constraints. Simple in-context learning is surprisingly effective and outperforms complex reward-guided approaches.

Abstract: Personalized alignment from preference data has focused primarily on improving personal reward model (RM) accuracy, with the implicit assumption that better preference ranking translates to better personalized behavior. However, in deployment, computational constraints necessitate inference-time adaptation such as reward-guided decoding (RGD) rather than per-user policy fine-tuning. This creates a critical but overlooked requirement: reward models must not only rank preferences accurately but also effectively guide generation. We demonstrate that standard RM accuracy fails catastrophically as a selection criterion for deployment-ready personalized rewards. We introduce policy accuracy; a metric quantifying whether RGD-adapted LLMs correctly discriminate between preferred and dispreferred responses and show that upstream RM accuracy correlates only weakly with downstream policy accuracy (Kendall’s tau = 0.08–0.31). More critically, we introduce Pref-LaMP the first personalized alignment benchmark with ground-truth user completions, enabling direct behavioural evaluation. On Pref-LaMP, we expose a complete decoupling between discriminative ranking and generation metrics: methods with 20-point RM accuracy differences produce almost identical output quality, and methods with high ranking accuracy can fail to generate behaviorally aligned responses. These findings reveal that the field has been optimizing for proxy metrics that do not predict deployment performance, and that current personalized alignment methods fail to operationalize preferences into behavioral adaptation under realistic deployment constraints. In contrast, we find simple in-context learning (ICL) to be highly effective - dominating all reward-guided methods for models $\geq$3B parameters, achieving $\sim$3 point ROUGE-1 gains over the best reward method at 7B scale.

[391] AMAP Agentic Planning Technical Report

AMAP AI Agent Team, Yulan Hu, Xiangwen Zhang, Sheng Ouyang, Hao Yi, Lu Xu, Qinglin Lang, Lide Tan, Xiang Cheng, Tianchen Ye, Zhicong Li, Ge Chen, Wenjin Yang, Zheng Pan, Shaopan Xiong, Siran Yang, Ju Huang, Yan Zhang, Jiamang Wang, Yong Liu, Yinfeng Huang, Ning Wang, Tucheng Lin, Xin Li, Ning Guo

Main category: cs.AI

TL;DR: STAgent is a specialized LLM agent for spatio-temporal tasks like POI discovery and itinerary planning, featuring tool interaction capabilities while maintaining general performance.

Details

Motivation: To create an agentic LLM specifically designed for complex spatio-temporal reasoning tasks that require interaction with multiple domain tools while preserving general capabilities.

Method: Three key contributions: 1) stable tool environment with 10+ domain tools supporting asynchronous rollout/training, 2) hierarchical data curation framework selecting <1% of raw data emphasizing diversity/difficulty, 3) cascaded training recipe with seed SFT, second SFT on high-certainty queries, and RL on low-certainty data.

Result: STAgent shows promising performance on TravelBench while maintaining general capabilities across wide range of benchmarks, demonstrating effectiveness of the agentic approach.

Conclusion: STAgent successfully combines specialized spatio-temporal reasoning with tool interaction while preserving general LLM capabilities through careful data curation and staged training approach.

Abstract: We present STAgent, an agentic large language model tailored for spatio-temporal understanding, designed to solve complex tasks such as constrained point-of-interest discovery and itinerary planning. STAgent is a specialized model capable of interacting with ten distinct tools within spatio-temporal scenarios, enabling it to explore, verify, and refine intermediate steps during complex reasoning. Notably, STAgent effectively preserves its general capabilities. We empower STAgent with these capabilities through three key contributions: (1) a stable tool environment that supports over ten domain-specific tools, enabling asynchronous rollout and training; (2) a hierarchical data curation framework that identifies high-quality data like a needle in a haystack, curating high-quality queries by retaining less than 1% of the raw data, emphasizing both diversity and difficulty; and (3) a cascaded training recipe that starts with a seed SFT stage acting as a guardian to measure query difficulty, followed by a second SFT stage fine-tuned on queries with high certainty, and an ultimate RL stage that leverages data of low certainty. Initialized with Qwen3-30B-A3B to establish a strong SFT foundation and leverage insights into sample difficulty, STAgent yields promising performance on TravelBench while maintaining its general capabilities across a wide range of general benchmarks, thereby demonstrating the effectiveness of our proposed agentic model.

[392] Logics-STEM: Empowering LLM Reasoning via Failure-Driven Post-Training and Document Knowledge Enhancement

Mingyu Xu, Cheng Fang, Keyue Jiang, Yuqian Zheng, Yanghua Xiao, Baojian Zhou, Qifang Zhao, Suhang Zheng, Xiuwen Zhu, Jiyang Tang, Yongchi Zhao, Yijia Luo, Zhiqi Bai, Yuchi Xu, Wenbo Su, Wei Wang, Bing Zhao, Lin Qu, Xiaoxiao Xu

Main category: cs.AI

TL;DR: Logics-STEM is a reasoning model fine-tuned on a 10M-scale dataset for STEM domains, achieving 4.68% average improvement over next-best 8B models through data-algorithm co-design.

Details

Motivation: To enhance reasoning capabilities in STEM domains by developing a high-performance model through systematic data-algorithm co-design, addressing the need for better reasoning in science, technology, engineering, and mathematics.

Method: 1) Created Logics-STEM-SFT-Dataset (10M scale) using 5-stage curation (annotation, deduplication, decontamination, distillation, stratified sampling). 2) Developed failure-driven post-training framework with targeted knowledge retrieval and data synthesis around model failure regions. 3) Used data-algorithm co-design engine for joint optimization.

Result: Logics-STEM achieves state-of-the-art performance on STEM benchmarks with 4.68% average improvement over next-best 8B-scale models. Both models (8B and 32B) and datasets (10M and 2.2M versions) are publicly released.

Conclusion: The success demonstrates the potential of combining large-scale open-source data with carefully designed synthetic data, highlighting the critical role of data-algorithm co-design in enhancing reasoning capabilities through post-training.

Abstract: We present Logics-STEM, a state-of-the-art reasoning model fine-tuned on Logics-STEM-SFT-Dataset, a high-quality and diverse dataset at 10M scale that represents one of the largest-scale open-source long chain-of-thought corpora. Logics-STEM targets reasoning tasks in the domains of Science, Technology, Engineering, and Mathematics (STEM), and exhibits exceptional performance on STEM-related benchmarks with an average improvement of 4.68% over the next-best model at 8B scale. We attribute the gains to our data-algorithm co-design engine, where they are jointly optimized to fit a gold-standard distribution behind reasoning. Data-wise, the Logics-STEM-SFT-Dataset is constructed from a meticulously designed data curation engine with 5 stages to ensure the quality, diversity, and scalability, including annotation, deduplication, decontamination, distillation, and stratified sampling. Algorithm-wise, our failure-driven post-training framework leverages targeted knowledge retrieval and data synthesis around model failure regions in the Supervised Fine-tuning (SFT) stage to effectively guide the second-stage SFT or the reinforcement learning (RL) for better fitting the target distribution. The superior empirical performance of Logics-STEM reveals the vast potential of combining large-scale open-source data with carefully designed synthetic data, underscoring the critical role of data-algorithm co-design in enhancing reasoning capabilities through post-training. We make both the Logics-STEM models (8B and 32B) and the Logics-STEM-SFT-Dataset (10M and downsampled 2.2M versions) publicly available to support future research in the open-source community.

[393] PsychEval: A Multi-Session and Multi-Therapy Benchmark for High-Realism AI Psychological Counselor

Qianjun Pan, Junyi Wang, Jie Zhou, Yutao Yang, Junsong Li, Kaiyin Xu, Yougen Zhou, Yihan Li, Jingyuan Zhao, Qin Chen, Ningning Zhou, Kai Chen, Liang He

Main category: cs.AI

TL;DR: PsychEval is a multi-session, multi-therapy benchmark for training realistic AI counselors with longitudinal memory, adaptive reasoning, and flexible therapeutic strategies across five modalities.

Details

Motivation: To develop reliable AI for psychological assessment by addressing three key challenges: training realistic AI counselors that handle longitudinal sessions, enabling multi-therapy flexibility for complex cases, and establishing systematic evaluation frameworks.

Method: Created a multi-session benchmark spanning 6-10 sessions across three stages, annotated with 677 meta-skills and 4577 atomic skills. Built diverse dataset covering five therapeutic modalities (Psychodynamic, Behaviorism, CBT, Humanistic Existentialist, Postmodernist) with integrative therapy framework across six psychological topics. Established evaluation framework with 18 therapy-specific and therapy-shared metrics across Client-Level and Counselor-Level dimensions, supported by 2,000+ diverse client profiles.

Result: Extensive experimental analysis validates the superior quality and clinical fidelity of the dataset. PsychEval serves as both a benchmark and a high-fidelity reinforcement learning environment for self-evolutionary training of clinically responsible AI counselors.

Conclusion: PsychEval provides a comprehensive solution for developing realistic, multi-therapy AI counselors with systematic evaluation capabilities, moving beyond static benchmarking to enable adaptive, clinically responsible AI counselor training through reinforcement learning.

Abstract: To develop a reliable AI for psychological assessment, we introduce \texttt{PsychEval}, a multi-session, multi-therapy, and highly realistic benchmark designed to address three key challenges: \textbf{1) Can we train a highly realistic AI counselor?} Realistic counseling is a longitudinal task requiring sustained memory and dynamic goal tracking. We propose a multi-session benchmark (spanning 6-10 sessions across three distinct stages) that demands critical capabilities such as memory continuity, adaptive reasoning, and longitudinal planning. The dataset is annotated with extensive professional skills, comprising over 677 meta-skills and 4577 atomic skills. \textbf{2) How to train a multi-therapy AI counselor?} While existing models often focus on a single therapy, complex cases frequently require flexible strategies among various therapies. We construct a diverse dataset covering five therapeutic modalities (Psychodynamic, Behaviorism, CBT, Humanistic Existentialist, and Postmodernist) alongside an integrative therapy with a unified three-stage clinical framework across six core psychological topics. \textbf{3) How to systematically evaluate an AI counselor?} We establish a holistic evaluation framework with 18 therapy-specific and therapy-shared metrics across Client-Level and Counselor-Level dimensions. To support this, we also construct over 2,000 diverse client profiles. Extensive experimental analysis fully validates the superior quality and clinical fidelity of our dataset. Crucially, \texttt{PsychEval} transcends static benchmarking to serve as a high-fidelity reinforcement learning environment that enables the self-evolutionary training of clinically responsible and adaptive AI counselors.

[394] Quantum-enhanced long short-term memory with attention for spatial permeability prediction in oilfield reservoirs

Muzhen Zhang, Yujie Cheng, Zhanxiang Lei

Main category: cs.AI

TL;DR: Quantum-enhanced LSTM with attention (QLSTMA) model improves permeability prediction in reservoirs using variational quantum circuits, outperforming traditional methods by 19-20% error reduction.

Details

Motivation: Spatial prediction of reservoir parameters like permeability is crucial for oil/gas exploration, but existing methods struggle with permeability's wide range and high variability. Quantum computing offers potential to handle this complexity.

Method: Developed QLSTMA model incorporating variational quantum circuits (VQCs) into recurrent cells, leveraging quantum entanglement and superposition. Created two variants: QLSTMA-SG (Shared Gates) and QLSTMA-IG (Independent Gates) to study quantum structure configurations and qubit effects.

Result: 8-qubit QLSTMA-IG model significantly outperformed traditional LSTMA, reducing MAE by 19% and RMSE by 20%, with particularly strong performance in regions with complex well-logging data. Increasing qubits yields further accuracy gains.

Conclusion: Quantum-classical hybrid neural networks show strong potential for reservoir prediction, establishing a framework for eventual deployment on real quantum hardware and extension to broader petroleum engineering and geoscience applications.

Abstract: Spatial prediction of reservoir parameters, especially permeability, is crucial for oil and gas exploration and development. However, the wide range and high variability of permeability prevent existing methods from providing reliable predictions. For the first time in subsurface spatial prediction, this study presents a quantum-enhanced long short-term memory with attention (QLSTMA) model that incorporates variational quantum circuits (VQCs) into the recurrent cell. Using quantum entanglement and superposition principles, the QLSTMA significantly improves the ability to predict complex geological parameters such as permeability. Two quantization structures, QLSTMA with Shared Gates (QLSTMA-SG) and with Independent Gates (QLSTMA-IG), are designed to investigate and evaluate the effects of quantum structure configurations and the number of qubits on model performance. Experimental results demonstrate that the 8-qubit QLSTMA-IG model significantly outperforms the traditional long short-term memory with attention (LSTMA), reducing Mean Absolute Error (MAE) by 19% and Root Mean Squared Error (RMSE) by 20%, with particularly strong performance in regions featuring complex well-logging data. These findings validate the potential of quantum-classical hybrid neural networks for reservoir prediction, indicating that increasing the number of qubits yields further accuracy gains despite the reliance on classical simulations. This study establishes a foundational framework for the eventual deployment of such models on real quantum hardware and their extension to broader applications in petroleum engineering and geoscience.

[395] SimRPD: Optimizing Recruitment Proactive Dialogue Agents through Simulator-Based Data Evaluation and Selection

Zhiyong Cao, Dunqiang Liu, Qi Dai, Haojun Xu, Huaiyan Xu, Huan He, Yafei Liu, Siyuan Liu, XiaoLin Lin, Ke Ma, Ruqian Shi, Sijia Yao, Hao Wang, Sicheng Zhou

Main category: cs.AI

TL;DR: SimRPD is a three-stage framework for training recruitment proactive dialogue agents using synthetic data generation and quality selection to overcome domain-specific data scarcity.

Details

Motivation: Task-oriented proactive dialogue agents are crucial for recruitment (e.g., acquiring social-media contacts for conversion), but their performance is limited by the scarcity of high-quality, goal-oriented domain-specific training data.

Method: Three-stage framework: 1) Develop a high-fidelity user simulator to synthesize large-scale conversational data through multi-turn online dialogue; 2) Introduce multi-dimensional evaluation framework based on Chain-of-Intention (CoI) with global-level and instance-level metrics to assess simulator and select high-quality data; 3) Train recruitment proactive dialogue agent on selected dataset.

Result: Experiments in real-world recruitment scenario show SimRPD outperforms existing simulator-based data selection strategies, demonstrating practical value for industrial deployment and potential applicability to other business-oriented dialogue scenarios.

Conclusion: SimRPD effectively addresses data scarcity in recruitment proactive dialogue systems through synthetic data generation and quality selection, offering a practical solution with broader business applications.

Abstract: Task-oriented proactive dialogue agents play a pivotal role in recruitment, particularly for steering conversations towards specific business outcomes, such as acquiring social-media contacts for private-channel conversion. Although supervised fine-tuning and reinforcement learning have proven effective for training such agents, their performance is heavily constrained by the scarcity of high-quality, goal-oriented domain-specific training data. To address this challenge, we propose SimRPD, a three-stage framework for training recruitment proactive dialogue agents. First, we develop a high-fidelity user simulator to synthesize large-scale conversational data through multi-turn online dialogue. Then we introduce a multi-dimensional evaluation framework based on Chain-of-Intention (CoI) to comprehensively assess the simulator and effectively select high-quality data, incorporating both global-level and instance-level metrics. Finally, we train the recruitment proactive dialogue agent on the selected dataset. Experiments in a real-world recruitment scenario demonstrate that SimRPD outperforms existing simulator-based data selection strategies, highlighting its practical value for industrial deployment and its potential applicability to other business-oriented dialogue scenarios.

[396] Toward Maturity-Based Certification of Embodied AI: Quantifying Trustworthiness Through Measurement Mechanisms

Michael C. Darling, Alan H. Hesu, Michael A. Mardikes, Brian C. McGuigan, Reed M. Milewicz

Main category: cs.AI

TL;DR: A maturity-based certification framework for embodied AI systems using structured assessment, quantitative scoring, and multi-objective trade-off navigation, demonstrated through uncertainty quantification in UAS detection.

Details

Motivation: The paper addresses the need for certifiable embodied AI systems by proposing structured frameworks that can provide explicit measurement and assessment mechanisms for trustworthiness evaluation.

Method: A maturity-based certification framework with three key components: structured assessment frameworks, quantitative scoring mechanisms, and methods for navigating multi-objective trade-offs in trustworthiness evaluation. Uncertainty quantification serves as an exemplar measurement mechanism.

Result: The approach is demonstrated through an Uncrewed Aircraft System (UAS) detection case study, showing feasibility of the proposed certification framework.

Conclusion: Embodied AI systems can be certified through structured maturity-based frameworks with explicit measurement mechanisms, enabling systematic trustworthiness evaluation as shown in the UAS detection application.

Abstract: We propose a maturity-based framework for certifying embodied AI systems through explicit measurement mechanisms. We argue that certifiable embodied AI requires structured assessment frameworks, quantitative scoring mechanisms, and methods for navigating multi-objective trade-offs inherent in trustworthiness evaluation. We demonstrate this approach using uncertainty quantification as an exemplar measurement mechanism and illustrate feasibility through an Uncrewed Aircraft System (UAS) detection case study.

[397] EntroCoT: Enhancing Chain-of-Thought via Adaptive Entropy-Guided Segmentation

Zihang Li, Yuhang Wang, Yikun Zong, Wenhan Yu, Xiaokun Yuan, Runhan Jiang, Zirui Liu, Tong Yang, Arthur Jiang

Main category: cs.AI

TL;DR: EntroCoT is a framework that automatically identifies and filters low-quality Chain-of-Thought reasoning traces by segmenting them at uncertain points and evaluating step contributions, creating higher-quality training data for mathematical reasoning.

Details

Motivation: Existing fine-tuning datasets for Chain-of-Thought prompting often contain "answer right but reasoning wrong" problems, where correct final answers are derived from hallucinated, redundant, or logically invalid intermediate steps, leading to poor reasoning quality in trained models.

Method: EntroCoT uses an entropy-based mechanism to segment reasoning traces into steps at uncertain junctures, then employs Monte Carlo rollout-based evaluation to assess the marginal contribution of each step, filtering out deceptive reasoning samples.

Result: Extensive experiments on mathematical benchmarks show that fine-tuning on the subset constructed by EntroCoT consistently outperforms baselines using full-dataset supervision.

Conclusion: EntroCoT effectively addresses the quality issues in CoT supervision data by automatically identifying and refining reasoning traces, resulting in improved mathematical reasoning performance when used for fine-tuning.

Abstract: Chain-of-Thought (CoT) prompting has significantly enhanced the mathematical reasoning capabilities of Large Language Models. We find existing fine-tuning datasets frequently suffer from the “answer right but reasoning wrong” probelm, where correct final answers are derived from hallucinated, redundant, or logically invalid intermediate steps. This paper proposes EntroCoT, a unified framework for automatically identifying and refining low-quality CoT supervision traces. EntroCoT first proposes an entropy-based mechanism to segment the reasoning trace into multiple steps at uncertain junctures, and then introduces a Monte Carlo rollout-based mechanism to evaluate the marginal contribution of each step. By accurately filtering deceptive reasoning samples, EntroCoT constructs a high-quality dataset where every intermediate step in each reasoning trace facilitates the final answer. Extensive experiments on mathematical benchmarks demonstrate that fine-tuning on the subset constructed by EntroCoT consistently outperforms the baseslines of full-dataset supervision.

[398] Current Agents Fail to Leverage World Model as Tool for Foresight

Cheng Qian, Emre Can Acikgoz, Bingxuan Li, Xiusi Chen, Yuji Zhang, Bingxiang He, Qinyu Luo, Dilek Hakkani-Tür, Gokhan Tur, Yunzhu Li, Heng Ji

Main category: cs.AI

TL;DR: Current agents struggle to effectively use generative world models as cognitive tools for anticipatory reasoning, showing low simulation usage, frequent misuse, and inconsistent performance.

Details

Motivation: Agents need to anticipate future states for complex tasks, but current vision-language models rely on short-horizon reasoning. Generative world models could serve as external simulators to enhance agent cognition, but it's unclear if current agents can effectively leverage them.

Method: Empirical examination across diverse agentic and visual question answering tasks, analyzing how agents interact with world models, measuring simulation invocation rates, misuse of predicted rollouts, and performance changes when simulation is available or enforced.

Result: Agents rarely invoke simulation (<1%), frequently misuse predicted rollouts (~15%), and often show inconsistent or degraded performance (up to 5% worse) when simulation is available. The main bottleneck is agents’ inability to decide when to simulate, interpret outcomes, and integrate foresight into reasoning.

Conclusion: Current agents lack the capacity to effectively use world models as cognitive tools. The findings highlight the need for mechanisms that enable calibrated, strategic interaction with world models to achieve reliable anticipatory cognition in future agent systems.

Abstract: Agents built on vision-language models increasingly face tasks that demand anticipating future states rather than relying on short-horizon reasoning. Generative world models offer a promising remedy: agents could use them as external simulators to foresee outcomes before acting. This paper empirically examines whether current agents can leverage such world models as tools to enhance their cognition. Across diverse agentic and visual question answering tasks, we observe that some agents rarely invoke simulation (fewer than 1%), frequently misuse predicted rollouts (approximately 15%), and often exhibit inconsistent or even degraded performance (up to 5%) when simulation is available or enforced. Attribution analysis further indicates that the primary bottleneck lies in the agents’ capacity to decide when to simulate, how to interpret predicted outcomes, and how to integrate foresight into downstream reasoning. These findings underscore the need for mechanisms that foster calibrated, strategic interaction with world models, paving the way toward more reliable anticipatory cognition in future agent systems.

[399] Trade-R1: Bridging Verifiable Rewards to Stochastic Environments via Process-Level Reasoning Verification

Rui Sun, Yifan Sun, Sheng Xu, Li Zhao, Jing Li, Daxin Jiang, Cheng Hua, Zuo Bai

Main category: cs.AI

TL;DR: Trade-R1: A framework using process-level reasoning verification to apply RL to financial decisions, overcoming noisy market rewards via structured RAG verification and triangular consistency metrics.

Details

Motivation: Standard RL works well for domains like math/coding with clear verifiable rewards, but fails in finance due to market stochasticity - noisy rewards cause reward hacking. Need to bridge verifiable rewards to stochastic financial environments.

Method: Proposes Trade-R1 framework with key innovation: verification method transforming reasoning evaluation over financial documents into structured RAG task. Uses triangular consistency metric assessing pairwise alignment between retrieved evidence, reasoning chains, and decisions as validity filter for noisy market returns. Two reward strategies: Fixed-effect Semantic Reward (FSR) for stable alignment, and Dynamic-effect Semantic Reward (DSR) for coupled magnitude optimization.

Result: Experiments on different country asset selection show paradigm reduces reward hacking. DSR achieves superior cross-market generalization while maintaining highest reasoning consistency.

Conclusion: Trade-R1 successfully bridges verifiable rewards to stochastic financial environments via process-level reasoning verification, enabling effective RL application to financial decision-making while mitigating reward hacking through structured verification mechanisms.

Abstract: Reinforcement Learning (RL) has enabled Large Language Models (LLMs) to achieve remarkable reasoning in domains like mathematics and coding, where verifiable rewards provide clear signals. However, extending this paradigm to financial decision is challenged by the market’s stochastic nature: rewards are verifiable but inherently noisy, causing standard RL to degenerate into reward hacking. To address this, we propose Trade-R1, a model training framework that bridges verifiable rewards to stochastic environments via process-level reasoning verification. Our key innovation is a verification method that transforms the problem of evaluating reasoning over lengthy financial documents into a structured Retrieval-Augmented Generation (RAG) task. We construct a triangular consistency metric, assessing pairwise alignment between retrieved evidence, reasoning chains, and decisions to serve as a validity filter for noisy market returns. We explore two reward integration strategies: Fixed-effect Semantic Reward (FSR) for stable alignment signals, and Dynamic-effect Semantic Reward (DSR) for coupled magnitude optimization. Experiments on different country asset selection demonstrate that our paradigm reduces reward hacking, with DSR achieving superior cross-market generalization while maintaining the highest reasoning consistency.

cs.SD

[400] Predictive Controlled Music

Midhun T. Augustine

Main category: cs.SD

TL;DR: PCM combines model predictive control with neural networks for algorithmic music composition, optimizing notes through receding-horizon prediction.

Details

Motivation: To create a systematic approach to algorithmic music composition that combines control theory with machine learning for more structured and optimized music generation.

Method: Uses model predictive control (MPC) framework with feedforward neural network assessment function as objective and recurrent neural network model for constraint definition, computing notes in receding-horizon manner.

Result: Developed PCM framework that generates music through feedback-controlled prediction, with numerical examples demonstrating the method’s effectiveness.

Conclusion: PCM successfully integrates control theory and neural networks for algorithmic composition, providing a structured optimization-based approach to music generation.

Abstract: This paper presents a new approach to algorithmic composition, called predictive controlled music (PCM), which combines model predictive control (MPC) with music generation. PCM uses dynamic models to predict and optimize the music generation process, where musical notes are computed in a manner similar to an MPC problem by optimizing a performance measure. A feedforward neural network-based assessment function is used to evaluate the generated musical score, which serves as the objective function of the PCM optimization problem. Furthermore, a recurrent neural network model is employed to capture the relationships among the variables in the musical notes, and this model is then used to define the constraints in the PCM. Similar to MPC, the proposed PCM computes musical notes in a receding-horizon manner, leading to feedback controlled prediction. Numerical examples are presented to illustrate the PCM generation method.

[401] From Imitation to Innovation: The Divergent Paths of Techno in Germany and the USA

Tim Ziemer, Simon Linke

Main category: cs.SD

TL;DR: Audio analysis of 9,000+ early house/techno tracks reveals distinct German vs US styles, with US music showing less evolution over time, explaining why techno became mainstream in Germany but remained niche in the USA.

Details

Motivation: To validate historical claims about early house and techno music evolution through objective audio analysis rather than relying solely on subjective documentary accounts from scene protagonists.

Method: Analysis of over 9,000 early house and techno tracks from Germany and USA using recording studio features, machine learning, and inferential statistics.

Result: 1) German and US house/techno are distinct, 2) US styles are more similar to each other, 3) US music evolved less over time compared to German house/techno regarding recording studio features.

Conclusion: Audio-based findings validate documentary statements and explain why techno became a mass phenomenon in Germany but remained fringe in the USA, with potential applications for the music industry to predict trend breakthroughs.

Abstract: Many documentaries on early house and techno music exist. Here, protagonists from the scenes describe key elements and events that affected the evolution of the music. In the research community, there is consensus that such descriptions have to be examined critically. Yet, there have not been attempts to validate such statements on the basis of audio analyses. In this study, over 9,000 early house and techno tracks from Germany and the United States of America are analyzed using recording studio features, machine learning and inferential statistics. Three observations can be made: 1.) German and US house/techno music are distinct, 2.) US styles are much more alike, and 3.) scarcely evolved over time compared to German house/techno regarding the recording studio features. These findings are in agreement with documented statements and thus provide an audio-based perspective on why techno became a mass phenomenon in Germany but remained a fringe phenomenon in the USA. Observations like these can help the music industry estimate whether new trends will experience a breakthrough or disappear.

[402] Defense Against Synthetic Speech: Real-Time Detection of RVC Voice Conversion Attacks

Prajwal Chinchmalatpure, Suyash Chinchmalatpure, Siddharth Chavan

Main category: cs.SD

TL;DR: Real-time detection of AI-generated speech using Retrieval-based Voice Conversion (RVC) with streaming classification on 1-second audio segments, achieving reliable detection even in noisy conditions.

Details

Motivation: Generative audio technologies enable highly realistic voice cloning and conversion, increasing risks of impersonation, fraud, and misinformation in communication channels like phone/video calls, necessitating real-time detection solutions.

Method: Frame detection as streaming classification using 1-second audio segments; extract time-frequency and cepstral features; train supervised ML models; simulate realistic conditions by applying deepfake generation to isolated vocals then reintroducing background ambiance to suppress trivial artifacts.

Result: Short-window acoustic features can reliably capture discriminative patterns associated with RVC speech even in noisy backgrounds; system enables low-latency inference with segment-level decisions and call-level aggregation.

Conclusion: Demonstrates feasibility of practical, real-time deepfake speech detection and underscores importance of evaluating under realistic audio mixing conditions for robust deployment.

Abstract: Generative audio technologies now enable highly realistic voice cloning and real-time voice conversion, increasing the risk of impersonation, fraud, and misinformation in communication channels such as phone and video calls. This study investigates real-time detection of AI-generated speech produced using Retrieval-based Voice Conversion (RVC), evaluated on the DEEP-VOICE dataset, which includes authentic and voice-converted speech samples from multiple well-known speakers. To simulate realistic conditions, deepfake generation is applied to isolated vocal components, followed by the reintroduction of background ambiance to suppress trivial artifacts and emphasize conversion-specific cues. We frame detection as a streaming classification task by dividing audio into one-second segments, extracting time-frequency and cepstral features, and training supervised machine learning models to classify each segment as real or voice-converted. The proposed system enables low-latency inference, supporting both segment-level decisions and call-level aggregation. Experimental results show that short-window acoustic features can reliably capture discriminative patterns associated with RVC speech, even in noisy backgrounds. These findings demonstrate the feasibility of practical, real-time deepfake speech detection and underscore the importance of evaluating under realistic audio mixing conditions for robust deployment.

[403] LEMAS: Large A 150K-Hour Large-scale Extensible Multilingual Audio Suite with Generative Speech Models

Zhiyuan Zhao, Lijian Lin, Ye Zhu, Kai Xie, Yunfei Liu, Yu Li

Main category: cs.SD

TL;DR: LEMAS-Dataset is the largest open-source multilingual speech corpus with word-level timestamps (150k+ hours across 10 languages). Two benchmark models trained on it show high-quality multilingual synthesis and speech editing.

Details

Motivation: Need for large-scale, high-quality multilingual speech datasets with precise word-level timestamps to advance prompt-based speech generation systems.

Method: 1) Created LEMAS-Dataset via efficient processing pipeline; 2) Trained LEMAS-TTS using non-autoregressive flow-matching with accent-adversarial training and CTC loss; 3) Developed LEMAS-Edit using autoregressive decoder-only architecture for masked token infilling with adaptive decoding.

Result: Models achieve robust zero-shot multilingual synthesis (LEMAS-TTS) and seamless speech editing with natural transitions (LEMAS-Edit), confirming dataset quality and effectiveness.

Conclusion: LEMAS-Dataset’s rich timestamp annotations and fine-grained multilingual corpus will drive future advances in prompt-based speech generation systems.

Abstract: We present the LEMAS-Dataset, which, to our knowledge, is currently the largest open-source multilingual speech corpus with word-level timestamps. Covering over 150,000 hours across 10 major languages, LEMAS-Dataset is constructed via a efficient data processing pipeline that ensures high-quality data and annotations. To validate the effectiveness of LEMAS-Dataset across diverse generative paradigms, we train two benchmark models with distinct architectures and task specializations on this dataset. LEMAS-TTS, built upon a non-autoregressive flow-matching framework, leverages the dataset’s massive scale and linguistic diversity to achieve robust zero-shot multilingual synthesis. Our proposed accent-adversarial training and CTC loss mitigate cross-lingual accent issues, enhancing synthesis stability. Complementarily, LEMAS-Edit employs an autoregressive decoder-only architecture that formulates speech editing as a masked token infilling task. By exploiting precise word-level alignments to construct training masks and adopting adaptive decoding strategies, it achieves seamless, smooth-boundary speech editing with natural transitions. Experimental results demonstrate that models trained on LEMAS-Dataset deliver high-quality synthesis and editing performance, confirming the dataset’s quality. We envision that this richly timestamp-annotated, fine-grained multilingual corpus will drive future advances in prompt-based speech generation systems.

[404] SmoothSync: Dual-Stream Diffusion Transformers for Jitter-Robust Beat-Synchronized Gesture Generation from Quantized Audio

Yujiao Jiang, Qingmin Liao, Zongqing Lu

Main category: cs.SD

TL;DR: SmoothSync is a novel framework for co-speech gesture generation that uses quantized audio tokens in a dual-stream Diffusion Transformer architecture to produce synchronized, smooth, and diverse gestures while addressing issues like motion jitter and foot sliding.

Details

Motivation: Existing co-speech gesture generation methods suffer from rhythmic inconsistency, motion jitter, foot sliding, and limited multi-sampling diversity, creating a need for improved synchronization and motion quality.

Method: Uses quantized audio tokens in a dual-stream Diffusion Transformer (DiT) architecture with: (1) complementary transformer streams for audio-motion feature fusion, (2) jitter-suppression loss for temporal smoothness, (3) probabilistic audio quantization for diverse gesture sequences from identical inputs.

Result: Outperforms state-of-the-art methods by -30.6% FGD, 10.3% Smooth-BC, and 8.4% Diversity on BEAT2 dataset, while reducing jitter and foot sliding by -62.9% and -17.1% respectively.

Conclusion: SmoothSync effectively addresses key challenges in co-speech gesture generation, achieving superior synchronization, smoothness, and diversity while introducing a robust evaluation metric (Smooth-BC) for beat synchronization assessment.

Abstract: Co-speech gesture generation is a critical area of research aimed at synthesizing speech-synchronized human-like gestures. Existing methods often suffer from issues such as rhythmic inconsistency, motion jitter, foot sliding and limited multi-sampling diversity. In this paper, we present SmoothSync, a novel framework that leverages quantized audio tokens in a novel dual-stream Diffusion Transformer (DiT) architecture to synthesis holistic gestures and enhance sampling variation. Specifically, we (1) fuse audio-motion features via complementary transformer streams to achieve superior synchronization, (2) introduce a jitter-suppression loss to improve temporal smoothness, (3) implement probabilistic audio quantization to generate distinct gesture sequences from identical inputs. To reliably evaluate beat synchronization under jitter, we introduce Smooth-BC, a robust variant of the beat consistency metric less sensitive to motion noise. Comprehensive experiments on the BEAT2 and SHOW datasets demonstrate SmoothSync’s superiority, outperforming state-of-the-art methods by -30.6% FGD, 10.3% Smooth-BC, and 8.4% Diversity on BEAT2, while reducing jitter and foot sliding by -62.9% and -17.1% respectively. The code will be released to facilitate future research.

[405] Summary of The Inaugural Music Source Restoration Challenge

Yongyi Zang, Jiarui Hai, Wanying Ge, Qiuqiang Kong, Zheqi Dai, Helin Wang, Yuki Mitsufuji, Mark D. Plumbley

Main category: cs.SD

TL;DR: The inaugural Music Source Restoration Challenge introduced evaluation of systems that restore original instrument stems from professionally mixed and degraded audio, with one system achieving 4.46 dB Multi-Mel-SNR and 3.47 MOS-Overall.

Details

Motivation: Music Source Restoration (MSR) aims to recover original instrument stems from professionally mixed and degraded audio, requiring reversal of both production effects and real-world degradations. There was a need for standardized evaluation in this domain.

Method: The MSR Challenge featured objective evaluation on studio-produced mixtures using Multi-Mel-SNR, Zimtohrli, and FAD-CLAP metrics, alongside subjective evaluation on real-world degraded recordings. Five teams participated with their restoration systems.

Result: The winning system achieved 4.46 dB Multi-Mel-SNR and 3.47 MOS-Overall, representing 91% and 18% relative improvements over the second-place system. Restoration difficulty varied significantly by instrument: bass averaged 4.59 dB across teams while percussion averaged only 0.29 dB.

Conclusion: The MSR Challenge established benchmarks for music source restoration, revealing substantial variation in restoration difficulty across different instruments. The dataset, evaluation protocols, and baselines are publicly available for further research.

Abstract: Music Source Restoration (MSR) aims to recover original, unprocessed instrument stems from professionally mixed and degraded audio, requiring the reversal of both production effects and real-world degradations. We present the inaugural MSR Challenge, which features objective evaluation on studio-produced mixtures using Multi-Mel-SNR, Zimtohrli, and FAD-CLAP, alongside subjective evaluation on real-world degraded recordings. Five teams participated in the challenge. The winning system achieved 4.46 dB Multi-Mel-SNR and 3.47 MOS-Overall, corresponding to relative improvements of 91% and 18% over the second-place system, respectively. Per-stem analysis reveals substantial variation in restoration difficulty across instruments, with bass averaging 4.59 dB across all teams, while percussion averages only 0.29 dB. The dataset, evaluation protocols, and baselines are available at https://msrchallenge.com/.

[406] When Tone and Words Disagree: Towards Robust Speech Emotion Recognition under Acoustic-Semantic Conflict

Dawei Huang, Yongjie Lv, Ruijie Xiong, Chunxiang Jin, Xiaojiang Peng

Main category: cs.SD

TL;DR: Proposes FAS framework to handle acoustic-semantic conflicts in speech emotion recognition, introduces CASE dataset for evaluation, achieves SOTA 59.38% accuracy where conventional models fail.

Details

Motivation: Real-world speech often contains conflicts between vocal emotion (acoustic) and literal word meaning (semantic), but current SER models overlook this issue, leading to performance degradation due to semantic bias or entangled representations.

Method: FAS framework explicitly disentangles acoustic and semantic pathways using separate encoders, then bridges them through a lightweight query-based attention module to handle conflicts.

Result: FAS consistently outperforms existing methods (ASR-based, SSL, ALMs) in both in-domain and zero-shot settings. On the new CASE benchmark, conventional SER models fail dramatically while FAS achieves SOTA 59.38% accuracy.

Conclusion: The paper addresses the critical but overlooked problem of acoustic-semantic conflicts in SER, proposes an effective disentanglement framework, and provides a valuable benchmark dataset for future research in this area.

Abstract: Speech Emotion Recognition (SER) systems often assume congruence between vocal emotion and lexical semantics. However, in real-world interactions, acoustic-semantic conflict is common yet overlooked, where the emotion conveyed by tone contradicts the literal meaning of spoken words. We show that state-of-the-art SER models, including ASR-based, self-supervised learning (SSL) approaches and Audio Language Models (ALMs), suffer performance degradation under such conflicts due to semantic bias or entangled acoustic-semantic representations. To address this, we propose the Fusion Acoustic-Semantic (FAS) framework, which explicitly disentangles acoustic and semantic pathways and bridges them through a lightweight, query-based attention module. To enable systematic evaluation, we introduce the Conflict in Acoustic-Semantic Emotion (CASE), the first dataset dominated by clear and interpretable acoustic-semantic conflicts in varied scenarios. Extensive experiments demonstrate that FAS consistently outperforms existing methods in both in-domain and zero-shot settings. Notably, on the CASE benchmark, conventional SER models fail dramatically, while FAS sets a new SOTA with 59.38% accuracy. Our code and datasets is available at https://github.com/24DavidHuang/FAS.

[407] FlexiVoice: Enabling Flexible Style Control in Zero-Shot TTS with Natural Language Instructions

Dekun Chen, Xueyao Zhang, Yuancheng Wang, Kenan Dai, Li Ma, Zhizheng Wu

Main category: cs.SD

TL;DR: FlexiVoice is a TTS system with LLM core that enables flexible style control via natural language instructions and zero-shot voice cloning via speech references, using progressive post-training for accurate controllability.

Details

Motivation: To create a TTS system that can flexibly control speaking style through natural language instructions while also supporting zero-shot voice cloning from speech references, enabling decoupled control of style, timbre, and content.

Method: Built with LLM core taking text input plus optional style instructions and speech references. Uses Progressive Post-Training (PPT): 1) DPO for accurate following of instructions and references, 2) multi-objective GRPO to disentangle style, timbre, and content, 3) instruction GRPO for advanced instruction following.

Result: FlexiVoice surpasses competing baselines, demonstrates strong capability in decoupling control factors, and human evaluations confirm its naturalness, controllability, and robustness.

Conclusion: FlexiVoice successfully achieves flexible style control with zero-shot voice cloning through its LLM architecture and progressive training approach, offering a powerful TTS solution with disentangled control over multiple factors.

Abstract: This study proposes FlexiVoice, a text-to-speech (TTS) synthesis system capable of flexible style control with zero-shot voice cloning. The speaking style is controlled by a natural-language instruction and the voice timbre is provided by a speech reference in zero-shot manner. FlexiVoice is built with an LLM core, which takes text as input, and also takes an optional natural language instruction and an optional speech reference to control style and timbre, respectively. FlexiVoice is equipped with a novel Progressive Post-Training (PPT) scheme that progressively unlocks accurate and flexible controllability. In particular, it first employs Direct Preference Optimization (DPO) to enable FlexiVoice to accurately follow both natural language instruction and speech reference simultaneously. It then uses a multi-objective Group Relative Policy Optimization (GRPO) to disentangle style instruction, reference timbre, and textual content. Finally, it adapts instruction GRPO for more advanced instruction following. Experimental results show that FlexiVoice surpasses competing baselines and demonstrates strong capability in decoupling control factors. Human evaluations further confirm its naturalness, controllability, and robustness. Audio samples are available at https://flexi-voice.github.io.

[408] MM-Sonate: Multimodal Controllable Audio-Video Generation with Zero-Shot Voice Cloning

Chunyu Qiang, Jun Wang, Xiaopeng Wang, Kang Yin, Yuxin Guo

Main category: cs.SD

TL;DR: MM-Sonate is a multimodal flow-matching framework that unifies controllable audio-video joint generation with zero-shot voice cloning, achieving state-of-the-art performance in lip sync and speech quality.

Details

Motivation: Current joint audio-video generation models struggle with fine-grained acoustic control and identity-preserving speech. Existing approaches have temporal misalignment issues from cascaded generation or lack zero-shot voice cloning capabilities within a unified framework.

Method: MM-Sonate uses a multimodal flow-matching framework with unified instruction-phoneme input for strict linguistic/temporal alignment. It introduces a timbre injection mechanism to decouple speaker identity from content, and a noise-based negative conditioning strategy to enhance acoustic fidelity.

Result: MM-Sonate establishes new SOTA in joint generation benchmarks, significantly outperforming baselines in lip synchronization and speech intelligibility, while achieving voice cloning fidelity comparable to specialized TTS systems.

Conclusion: The proposed framework successfully addresses key limitations in joint audio-video generation by enabling fine-grained acoustic control, zero-shot voice cloning, and improved temporal alignment through innovative architectural components.

Abstract: Joint audio-video generation aims to synthesize synchronized multisensory content, yet current unified models struggle with fine-grained acoustic control, particularly for identity-preserving speech. Existing approaches either suffer from temporal misalignment due to cascaded generation or lack the capability to perform zero-shot voice cloning within a joint synthesis framework. In this work, we present MM-Sonate, a multimodal flow-matching framework that unifies controllable audio-video joint generation with zero-shot voice cloning capabilities. Unlike prior works that rely on coarse semantic descriptions, MM-Sonate utilizes a unified instruction-phoneme input to enforce strict linguistic and temporal alignment. To enable zero-shot voice cloning, we introduce a timbre injection mechanism that effectively decouples speaker identity from linguistic content. Furthermore, addressing the limitations of standard classifier-free guidance in multimodal settings, we propose a noise-based negative conditioning strategy that utilizes natural noise priors to significantly enhance acoustic fidelity. Empirical evaluations demonstrate that MM-Sonate establishes new state-of-the-art performance in joint generation benchmarks, significantly outperforming baselines in lip synchronization and speech intelligibility, while achieving voice cloning fidelity comparable to specialized Text-to-Speech systems.

[409] MOSS Transcribe Diarize: Accurate Transcription with Speaker Diarization

MOSI. AI, :, Donghua Yu, Zhengyuan Lin, Chen Yang, Yiyang Zhang, Hanfu Chen, Jingqi Chen, Ke Chen, Liwei Fan, Yi Jiang, Jie Zhu, Muchen Li, Wenxuan Wang, Yang Wang, Zhe Xu, Yitian Gong, Yuqian Zhang, Wenbo Zhang, Zhaoye Fei, Songlin Wang, Zhiyu Wu, Qinyuan Cheng, Shimin Li, Xipeng Qiu

Main category: cs.SD

TL;DR: MOSS Transcribe Diarize is a unified multimodal LLM that performs end-to-end speaker-attributed, time-stamped transcription with 128k context window for 90-minute inputs, outperforming commercial systems.

Details

Motivation: Existing SATS systems lack end-to-end formulation, have limited context windows, weak long-range speaker memory, and cannot output timestamps, creating limitations for meeting transcription needs.

Method: Developed MOSS Transcribe Diarize, a unified multimodal large language model trained on extensive real wild data with 128k context window for up to 90-minute inputs, performing joint speaker-attributed, time-stamped transcription end-to-end.

Result: Outperforms state-of-the-art commercial systems on multiple public and in-house benchmarks, demonstrating strong scaling and robust generalization capabilities.

Conclusion: MOSS Transcribe Diarize successfully addresses limitations of existing SATS systems through an end-to-end multimodal LLM approach with large context windows, achieving superior performance for meeting transcription tasks.

Abstract: Speaker-Attributed, Time-Stamped Transcription (SATS) aims to transcribe what is said and to precisely determine the timing of each speaker, which is particularly valuable for meeting transcription. Existing SATS systems rarely adopt an end-to-end formulation and are further constrained by limited context windows, weak long-range speaker memory, and the inability to output timestamps. To address these limitations, we present MOSS Transcribe Diarize, a unified multimodal large language model that jointly performs Speaker-Attributed, Time-Stamped Transcription in an end-to-end paradigm. Trained on extensive real wild data and equipped with a 128k context window for up to 90-minute inputs, MOSS Transcribe Diarize scales well and generalizes robustly. Across comprehensive evaluations, it outperforms state-of-the-art commercial systems on multiple public and in-house benchmarks.

[410] LAMB: LLM-based Audio Captioning with Modality Gap Bridging via Cauchy-Schwarz Divergence

Hyeongkeun Lee, Jongmin Choi, KiHyun Nam, Joon Son Chung

Main category: cs.SD

TL;DR: LAMB is an LLM-based audio captioning framework that bridges the modality gap between audio and text embeddings through cross-modal alignment and semantic enrichment, achieving state-of-the-art performance.

Details

Motivation: Prior approaches project audio features into LLM embedding space without proper cross-modal alignment, failing to fully utilize LLMs' reasoning capabilities for audio captioning.

Method: Proposes LAMB framework with: 1) Cross-Modal Aligner minimizing Cauchy-Schwarz divergence while maximizing mutual information for global and token-level alignment; 2) Two-Stream Adapter extracting semantically enriched audio embeddings; 3) Token Guide computing scores in LLM text embedding space to steer caption generation.

Result: Experimental results confirm the framework strengthens LLM decoder reasoning capabilities and achieves state-of-the-art performance on AudioCaps dataset.

Conclusion: LAMB effectively bridges the modality gap between audio and text embeddings, enabling better utilization of LLM reasoning capabilities for automated audio captioning.

Abstract: Automated Audio Captioning aims to describe the semantic content of input audio. Recent works have employed large language models (LLMs) as a text decoder to leverage their reasoning capabilities. However, prior approaches that project audio features into the LLM embedding space without considering cross-modal alignment fail to fully utilize these capabilities. To address this, we propose LAMB, an LLM-based audio captioning framework that bridges the modality gap between audio embeddings and the LLM text embedding space. LAMB incorporates a Cross-Modal Aligner that minimizes Cauchy-Schwarz divergence while maximizing mutual information, yielding tighter alignment between audio and text at both global and token levels. We further design a Two-Stream Adapter that extracts semantically enriched audio embeddings, thereby delivering richer information to the Cross-Modal Aligner. Finally, leveraging the aligned audio embeddings, a proposed Token Guide directly computes scores within the LLM text embedding space to steer the output logits of generated captions. Experimental results confirm that our framework strengthens the reasoning capabilities of the LLM decoder, achieving state-of-the-art performance on AudioCaps.

[411] MoE Adapter for Large Audio Language Models: Sparsity, Disentanglement, and Gradient-Conflict-Free

Yishu Lei, Shuwei He, Jing Hu, Dan Zhang, Xianlong Luo, Danxiang Zhu, Shikun Feng, Rui Liu, Jingzhou He, Yu Sun, Hua Wu, Haifeng Wang

Main category: cs.SD

TL;DR: MoE-Adapter: A sparse Mixture-of-Experts architecture that decouples heterogeneous acoustic information in audio LLMs to mitigate gradient conflicts and improve performance on audio tasks.

Details

Motivation: Audio information is intrinsically heterogeneous (speech, music, environmental sounds), but existing dense parameter-shared adapters cause gradient conflicts during optimization because parameter updates for different acoustic attributes contradict each other.

Method: Introduces MoE-Adapter, a sparse Mixture-of-Experts architecture with dynamic gating that routes audio tokens to specialized experts for complementary feature subspaces while retaining shared experts for global context.

Result: Achieves superior performance on both audio semantic and paralinguistic tasks, consistently outperforming dense linear baselines with comparable computational costs.

Conclusion: The MoE-Adapter effectively addresses gradient conflicts in audio LLMs by decoupling heterogeneous acoustic information, enabling fine-grained feature learning while maintaining computational efficiency.

Abstract: Extending the input modality of Large Language Models~(LLMs) to the audio domain is essential for achieving comprehensive multimodal perception. However, it is well-known that acoustic information is intrinsically \textit{heterogeneous}, entangling attributes such as speech, music, and environmental context. Existing research is limited to a dense, parameter-shared adapter to model these diverse patterns, which induces \textit{gradient conflict} during optimization, as parameter updates required for distinct attributes contradict each other. To address this limitation, we introduce the \textit{\textbf{MoE-Adapter}}, a sparse Mixture-of-Experts~(MoE) architecture designed to decouple acoustic information. Specifically, it employs a dynamic gating mechanism that routes audio tokens to specialized experts capturing complementary feature subspaces while retaining shared experts for global context, thereby mitigating gradient conflicts and enabling fine-grained feature learning. Comprehensive experiments show that the MoE-Adapter achieves superior performance on both audio semantic and paralinguistic tasks, consistently outperforming dense linear baselines with comparable computational costs. Furthermore, we will release the related code and models to facilitate future research.

[412] Semi-Supervised Diseased Detection from Speech Dialogues with Multi-Level Data Modeling

Xingyuan Li, Mengyue Wu

Main category: cs.SD

TL;DR: A novel semi-supervised learning framework for medical speech analysis that addresses weak supervision by modeling hierarchical representations (frame, segment, session levels) to detect pathological traits in clinical dialogues.

Details

Motivation: Medical speech analysis faces weak supervision problems: session-level labels must link to nuanced patterns in long recordings, compounded by data scarcity and subjective clinical annotations. Existing SSL methods fail to address that pathological traits aren't uniformly expressed in speech.

Method: Proposes an audio-only SSL framework that jointly learns from frame-level, segment-level, and session-level representations within unsegmented clinical dialogues. Uses dynamic aggregation of multi-granularity features and generates high-quality pseudo-labels to efficiently utilize unlabeled data.

Result: Framework is model-agnostic, robust across languages and conditions, and highly data-efficient - achieving 90% of fully-supervised performance using only 11 labeled samples.

Conclusion: Provides a principled approach to learning from weak, far-end supervision in medical speech analysis by explicitly modeling the hierarchy of pathological trait expression in speech.

Abstract: Detecting medical conditions from speech acoustics is fundamentally a weakly-supervised learning problem: a single, often noisy, session-level label must be linked to nuanced patterns within a long, complex audio recording. This task is further hampered by severe data scarcity and the subjective nature of clinical annotations. While semi-supervised learning (SSL) offers a viable path to leverage unlabeled data, existing audio methods often fail to address the core challenge that pathological traits are not uniformly expressed in a patient’s speech. We propose a novel, audio-only SSL framework that explicitly models this hierarchy by jointly learning from frame-level, segment-level, and session-level representations within unsegmented clinical dialogues. Our end-to-end approach dynamically aggregates these multi-granularity features and generates high-quality pseudo-labels to efficiently utilize unlabeled data. Extensive experiments show the framework is model-agnostic, robust across languages and conditions, and highly data-efficient-achieving, for instance, 90% of fully-supervised performance using only 11 labeled samples. This work provides a principled approach to learning from weak, far-end supervision in medical speech analysis.

[413] ChronosAudio: A Comprehensive Long-Audio Benchmark for Evaluating Audio-Large Language Models

Kaiwen Luo, Liang Lin, Yibo Zhang, Moayad Aloqaily, Dexian Wang, Zhenhong Zhou, Junwei Zhang, Kun Wang, Li Sun, Qingsong Wen

Main category: cs.SD

TL;DR: ChronosAudio is the first multi-task benchmark for evaluating long-audio understanding in Audio LLMs, revealing severe performance degradation in long contexts with current models struggling to maintain temporal locality.

Details

Motivation: Despite substantial advancements in Audio Large Language Models (ALLMs), their long audio understanding capabilities remain unexplored. Existing benchmarks focus on short-form clips, leaving no consensus on evaluating ALLMs over extended durations.

Method: Proposes ChronosAudio, a multi-task benchmark with six major task categories comprising 36,000 test instances totaling over 200 hours of audio, stratified into short, middle, and long-form categories. Evaluates 16 state-of-the-art models using this benchmark.

Result: Three critical findings: 1) Precipitous Long-Context Collapse: ALLMs show over 90% performance degradation from short to long contexts; 2) Structural Attention Dilution: Attention mechanisms suffer significant diffusion in later sequences; 3) Restorative Ceiling of Mitigation: Current strategies only offer 50% recovery.

Conclusion: Reveals significant challenges in long-audio understanding, underscoring the urgent need for new approaches to achieve robust, document-level audio reasoning in Audio LLMs.

Abstract: Although Audio Large Language Models (ALLMs) have witnessed substantial advancements, their long audio understanding capabilities remain unexplored. A plethora of benchmarks have been proposed for general audio tasks, they predominantly focus on short-form clips, leaving without a consensus on evaluating ALLMs over extended durations. This paper proposes ChronosAudio, the first multi-task benchmark tailored for long-audio understanding in ALLMs. It encompasses six major task categories and comprises 36,000 test instances totaling over 200 hours audio, stratified into short, middle, and long-form categories to comprehensively evaluate length generalization. Extensive experiments on 16 state-of-the-art models using ChronosAudio yield three critical findings: 1.Precipitous Long-Context Collapse: ALLMs exhibit a severe inability to sustain performance, with the transition from short to long contexts triggering a staggering performance degradation of over 90% in specific tasks. 2.Structural Attention Dilution: Performance degradation stems from a fundamental failure in maintaining temporal locality; attention mechanisms suffer from significant diffusion in later sequences. 3.Restorative Ceiling of Mitigation: Current strategies only offer 50% recovery. These findings reveal significant challenges in long-audio, underscoring the urgent need for approaches to achieve robust, document-level audio reasoning.

[414] Leveraging Prediction Entropy for Automatic Prompt Weighting in Zero-Shot Audio-Language Classification

Karim El Khoury, Maxime Zanella, Tiffanie Godelaine, Christophe De Vleeschouwer, Benoit Macq

Main category: cs.SD

TL;DR: Entropy-guided prompt weighting improves zero-shot audio classification by minimizing prediction entropy to find optimal prompt combinations without labeled data.

Details

Motivation: Audio-language models show strong zero-shot capabilities but are highly sensitive to text prompt wording, with small variations causing large accuracy fluctuations. Existing solutions like prompt learning require annotated data, while prompt ensembling doesn't account for potentially harmful prompts.

Method: Proposes an entropy-guided prompt weighting approach that formulates an objective function to minimize prediction entropy, using low entropy as a proxy for high confidence. The method finds robust combinations of prompt contributions and can be applied to individual samples or batches without additional labels.

Result: Experiments on five audio classification datasets (environmental, urban, and vocal sounds) show consistent gains over classical prompt ensembling methods in zero-shot settings, with accuracy improvements 5-times larger across the benchmark.

Conclusion: The entropy-guided prompt weighting approach effectively addresses prompt sensitivity in audio-language models, providing robust performance improvements without requiring labeled data or significant computational overhead.

Abstract: Audio-language models have recently demonstrated strong zero-shot capabilities by leveraging natural-language supervision to classify audio events without labeled training data. Yet, their performance is highly sensitive to the wording of text prompts, with small variations leading to large fluctuations in accuracy. Prior work has mitigated this issue through prompt learning or prompt ensembling. However, these strategies either require annotated data or fail to account for the fact that some prompts may negatively impact performance. In this work, we present an entropy-guided prompt weighting approach that aims to find a robust combination of prompt contributions to maximize prediction confidence. To this end, we formulate a tailored objective function that minimizes prediction entropy to yield new prompt weights, utilizing low-entropy as a proxy for high confidence. Our approach can be applied to individual samples or a batch of audio samples, requiring no additional labels and incurring negligible computational overhead. Experiments on five audio classification datasets covering environmental, urban, and vocal sounds, demonstrate consistent gains compared to classical prompt ensembling methods in a zero-shot setting, with accuracy improvements 5-times larger across the whole benchmark.

[415] VCB Bench: An Evaluation Benchmark for Audio-Grounded Large Language Model Conversational Agents

Jiliang Hu, Wenfu Wang, Zuchao Li, Chenxing Li, Yiyang Zhao, Hanzhao Li, Liqiang Zhang, Meng Yu, Dong Yu

Main category: cs.SD

TL;DR: VCB Bench is a Chinese speech benchmark for evaluating large audio language models using real human speech across instruction following, knowledge understanding, and robustness dimensions.

Details

Motivation: Existing benchmarks for large audio language models are limited: they are mainly English-centric, rely on synthetic speech, and lack comprehensive, discriminative evaluation across multiple dimensions.

Method: VCB Bench is built entirely on real human speech and evaluates LALMs from three complementary perspectives: instruction following (including speech-level control beyond text commands), knowledge understanding (general knowledge, reasoning, and daily dialogue), and robustness (stability under perturbations in content, environment, and speaker traits).

Result: Experiments on representative LALMs reveal notable performance gaps and highlight future directions for improvement.

Conclusion: VCB Bench provides a reproducible and fine-grained evaluation framework, offering standardized methodology and practical insights for advancing Chinese voice conversational models.

Abstract: Recent advances in large audio language models (LALMs) have greatly enhanced multimodal conversational systems. However, existing benchmarks remain limited – they are mainly English-centric, rely on synthetic speech, and lack comprehensive, discriminative evaluation across multiple dimensions. To address these gaps, we present Voice Chat Bot Bench (VCB Bench) – a high-quality Chinese benchmark built entirely on real human speech. VCB Bench evaluates LALMs from three complementary perspectives: instruction following (including speech-level control beyond text commands), knowledge understanding (general knowledge, reasoning, and daily dialogue), and robustness (stability under perturbations in content, environment, and speaker traits). Experiments on representative LALMs reveal notable performance gaps and highlight future directions for improvement. VCB Bench provides a reproducible and fine-grained evaluation framework, offering standardized methodology and practical insights for advancing Chinese voice conversational models.

[416] IndexTTS 2.5 Technical Report

Yunpei Li, Xun Zhou, Jinchao Wang, Lu Wang, Yong Wu, Siyi Zhou, Yiquan Zhou, Jingchen Shu

Main category: cs.SD

TL;DR: IndexTTS 2.5 enhances the zero-shot TTS foundation model with 4 key improvements: semantic codec compression (25Hz), Zipformer architecture, multilingual extension strategies, and RL optimization, achieving 2.28× faster inference while maintaining quality.

Details

Motivation: To improve upon IndexTTS 2 by enhancing multilingual coverage, inference speed, and overall synthesis quality while maintaining zero-shot emotional TTS capabilities across languages.

Method: Four key improvements: 1) Semantic codec compression from 50Hz to 25Hz, 2) Replacing U-DiT with Zipformer architecture, 3) Three cross-lingual strategies (boundary-aware alignment, token-level concatenation, instruction-guided generation), 4) GRPO reinforcement learning for T2S module.

Result: Achieves 2.28× improvement in real-time factor (RTF) while maintaining comparable WER and speaker similarity to IndexTTS 2. Supports Chinese, English, Japanese, Spanish with robust emotion transfer without target-language emotional training data.

Conclusion: IndexTTS 2.5 successfully enhances multilingual coverage, inference speed, and synthesis quality while maintaining zero-shot emotional TTS capabilities, establishing practical design principles for multilingual emotional TTS systems.

Abstract: In prior work, we introduced IndexTTS 2, a zero-shot neural text-to-speech foundation model comprising two core components: a transformer-based Text-to-Semantic (T2S) module and a non-autoregressive Semantic-to-Mel (S2M) module, which together enable faithful emotion replication and establish the first autoregressive duration-controllable generative paradigm. Building upon this, we present IndexTTS 2.5, which significantly enhances multilingual coverage, inference speed, and overall synthesis quality through four key improvements: 1) Semantic Codec Compression: we reduce the semantic codec frame rate from 50 Hz to 25 Hz, halving sequence length and substantially lowering both training and inference costs; 2) Architectural Upgrade: we replace the U-DiT-based backbone of the S2M module with a more efficient Zipformer-based modeling architecture, achieving notable parameter reduction and faster mel-spectrogram generation; 3) Multilingual Extension: We propose three explicit cross-lingual modeling strategies, boundary-aware alignment, token-level concatenation, and instruction-guided generation, establishing practical design principles for zero-shot multilingual emotional TTS that supports Chinese, English, Japanese, and Spanish, and enables robust emotion transfer even without target-language emotional training data; 4) Reinforcement Learning Optimization: we apply GRPO in post-training of the T2S module, improving pronunciation accuracy and natrualness. Experiments show that IndexTTS 2.5 not only supports broader language coverage but also replicates emotional prosody in unseen languages under the same zero-shot setting. IndexTTS 2.5 achieves a 2.28 times improvement in RTF while maintaining comparable WER and speaker similarity to IndexTTS 2.

[417] Muse: Towards Reproducible Long-Form Song Generation with Fine-Grained Style Control

Changhao Jiang, Jiahao Chen, Zhenghao Xiang, Zhixiong Yang, Hanchen Wang, Jiabao Zhuang, Xinmeng Che, Jiajun Sun, Hui Li, Yifei Cao, Shihan Dou, Ming Zhang, Junjie Ye, Tao Ji, Tao Gui, Qi Zhang, Xuanjing Huang

Main category: cs.SD

TL;DR: Open-source system for long-form song generation with style conditioning, including synthetic dataset, training pipeline, and Muse model that achieves competitive performance despite modest scale.

Details

Motivation: Commercial systems like Suno show strong song generation capabilities but academic research is non-reproducible due to lack of public training data, hindering fair comparison and progress.

Method: Release fully open-source system with: 1) 116k licensed synthetic songs with auto-generated lyrics and style descriptions paired with SunoV5 audio, 2) Muse model trained via single-stage supervised finetuning of Qwen-based LM extended with MuCodec audio tokens, without task-specific losses or additional components.

Result: Muse achieves competitive performance on phoneme error rate, text-music style similarity, and audio aesthetic quality despite modest data scale and model size, enabling controllable segment-level generation across different musical structures.

Conclusion: All data, model weights, and pipelines will be publicly released to enable reproducible research and continued progress in controllable long-form song generation.

Abstract: Recent commercial systems such as Suno demonstrate strong capabilities in long-form song generation, while academic research remains largely non-reproducible due to the lack of publicly available training data, hindering fair comparison and progress. To this end, we release a fully open-source system for long-form song generation with fine-grained style conditioning, including a licensed synthetic dataset, training and evaluation pipelines, and Muse, an easy-to-deploy song generation model. The dataset consists of 116k fully licensed synthetic songs with automatically generated lyrics and style descriptions paired with audio synthesized by SunoV5. We train Muse via single-stage supervised finetuning of a Qwen-based language model extended with discrete audio tokens using MuCodec, without task-specific losses, auxiliary objectives, or additional architectural components. Our evaluations find that although Muse is trained with a modest data scale and model size, it achieves competitive performance on phoneme error rate, text–music style similarity, and audio aesthetic quality, while enabling controllable segment-level generation across different musical structures. All data, model weights, and training and evaluation pipelines will be publicly released, paving the way for continued progress in controllable long-form song generation research. The project repository is available at https://github.com/yuhui1038/Muse.

cs.LG

[418] The Forgotten Shield: Safety Grafting in Parameter-Space for Medical MLLMs

Jiale Zhao, Xing Mou, Jinlin Wu, Hongyuan Yu, Mingrui Sun, Yang Shi, Xuanwu Yin, Zhen Chen, Zhen Lei, Yaohua Wang

Main category: cs.LG

TL;DR: Medical MLLMs have safety vulnerabilities, especially against cross-modality jailbreak attacks, and suffer from catastrophic forgetting during medical fine-tuning. The paper proposes Parameter-Space Intervention to extract and inject safety knowledge from base models during medical capability construction.

Details

Motivation: Medical MLLMs have achieved remarkable progress but lack sufficient safety research, posing risks for real-world deployment. Current models show pervasive vulnerabilities in both general and medical-specific safety dimensions, with particular fragility against cross-modality jailbreak attacks.

Method: 1) Establish multidimensional evaluation framework to benchmark Medical MLLM safety; 2) Propose Parameter-Space Intervention approach that extracts intrinsic safety knowledge representations from original base models and injects them into target models during medical capability construction; 3) Design fine-grained parameter search algorithm to optimize safety-performance trade-off.

Result: Empirical analysis reveals pervasive safety vulnerabilities in existing Medical MLLMs. The proposed approach significantly bolsters safety guardrails without requiring additional domain-specific safety data, while minimizing degradation to core medical performance.

Conclusion: Medical MLLMs have significant safety vulnerabilities that need addressing. The Parameter-Space Intervention method provides an effective solution for safety re-alignment that maintains medical performance, offering a practical approach for safer deployment of medical AI systems.

Abstract: Medical Multimodal Large Language Models (Medical MLLMs) have achieved remarkable progress in specialized medical tasks; however, research into their safety has lagged, posing potential risks for real-world deployment. In this paper, we first establish a multidimensional evaluation framework to systematically benchmark the safety of current SOTA Medical MLLMs. Our empirical analysis reveals pervasive vulnerabilities across both general and medical-specific safety dimensions in existing models, particularly highlighting their fragility against cross-modality jailbreak attacks. Furthermore, we find that the medical fine-tuning process frequently induces catastrophic forgetting of the model’s original safety alignment. To address this challenge, we propose a novel “Parameter-Space Intervention” approach for efficient safety re-alignment. This method extracts intrinsic safety knowledge representations from original base models and concurrently injects them into the target model during the construction of medical capabilities. Additionally, we design a fine-grained parameter search algorithm to achieve an optimal trade-off between safety and medical performance. Experimental results demonstrate that our approach significantly bolsters the safety guardrails of Medical MLLMs without relying on additional domain-specific safety data, while minimizing degradation to core medical performance.

[419] Green MLOps: Closed-Loop, Energy-Aware Inference with NVIDIA Triton, FastAPI, and Bio-Inspired Thresholding

Mustapha Hamdi, Mourad Jabou

Main category: cs.LG

TL;DR: Bio-inspired framework maps protein-folding energy basins to inference cost landscapes, using decaying threshold to admit requests only when utility-to-energy trade-off is favorable, reducing processing time by 42% with minimal accuracy loss.

Details

Motivation: Energy efficiency is critical for AI deployment since long-running inference can exceed training in cumulative carbon impact. Need practical solutions for green MLOps.

Method: Bio-inspired framework that maps protein-folding energy basins to inference cost landscapes, controls execution via decaying closed-loop threshold. Requests admitted only when expected utility-to-energy trade-off is favorable (high confidence/utility at low marginal energy and congestion). Biases operation toward first acceptable local basin rather than costly global minima.

Result: Bio-controller reduces processing time by 42% compared to standard open-loop execution (0.50s vs 0.29s on A100 test set) with minimal accuracy degradation (<0.5%). Established efficiency boundaries between lightweight local serving (ORT) and managed batching (Triton).

Conclusion: Connects biophysical energy models to Green MLOps, offers practical, auditable basis for closed-loop energy-aware inference in production. Bio-inspired approach effectively balances accuracy and energy efficiency.

Abstract: Energy efficiency is a first-order concern in AI deployment, as long-running inference can exceed training in cumulative carbon impact. We propose a bio-inspired framework that maps protein-folding energy basins to inference cost landscapes and controls execution via a decaying, closed-loop threshold. A request is admitted only when the expected utility-to-energy trade-off is favorable (high confidence/utility at low marginal energy and congestion), biasing operation toward the first acceptable local basin rather than pursuing costly global minima. We evaluate DistilBERT and ResNet-18 served through FastAPI with ONNX Runtime and NVIDIA Triton on an RTX 4000 Ada GPU. Our ablation study reveals that the bio-controller reduces processing time by 42% compared to standard open-loop execution (0.50s vs 0.29s on A100 test set), with a minimal accuracy degradation (<0.5%). Furthermore, we establish the efficiency boundaries between lightweight local serving (ORT) and managed batching (Triton). The results connect biophysical energy models to Green MLOps and offer a practical, auditable basis for closed-loop energy-aware inference in production.

[420] Safety-Utility Conflicts Are Not Global: Surgical Alignment via Head-Level Diagnosis

Wang Cai, Yilin Wen, Jinchang Hou, Du Su, Guoqiu Wang, Zhonghou Lv, Chenfu Bao, Yunfang Wu

Main category: cs.LG

TL;DR: CAST framework uses head-level diagnosis and sparse fine-tuning to selectively update parameters, avoiding high-conflict attention heads to improve safety-utility trade-off in LLMs.

Details

Motivation: Existing safety alignment methods use global gradient geometry but overlook modular heterogeneity in Transformers, where functional sensitivity and conflict vary across attention heads, leading to suboptimal trade-offs by indiscriminately updating utility-sensitive heads with intense gradient conflicts.

Method: CAST constructs a pre-alignment conflict map by synthesizing Optimization Conflict and Functional Sensitivity, then guides selective parameter updates through sparse fine-tuning that skips high-conflict attention heads during training.

Result: Experiments show alignment conflicts are not uniformly distributed, and general capability degradation mainly comes from updating a small group of high-conflict heads. Skipping these heads significantly reduces capability loss without compromising safety.

Conclusion: CAST offers an interpretable and parameter-efficient approach to improving safety-utility trade-off by addressing modular heterogeneity in Transformers through conflict-aware sparse tuning.

Abstract: Safety alignment in Large Language Models (LLMs) inherently presents a multi-objective optimization conflict, often accompanied by an unintended degradation of general capabilities. Existing mitigation strategies typically rely on global gradient geometry to resolve these conflicts, yet they overlook Modular Heterogeneity within Transformers, specifically that the functional sensitivity and degree of conflict vary substantially across different attention heads. Such global approaches impose uniform update rules across all parameters, often resulting in suboptimal trade-offs by indiscriminately updating utility sensitive heads that exhibit intense gradient conflicts. To address this limitation, we propose Conflict-Aware Sparse Tuning (CAST), a framework that integrates head-level diagnosis with sparse fine-tuning. CAST first constructs a pre-alignment conflict map by synthesizing Optimization Conflict and Functional Sensitivity, which then guides the selective update of parameters. Experiments reveal that alignment conflicts in LLMs are not uniformly distributed. We find that the drop in general capabilities mainly comes from updating a small group of ``high-conflict’’ heads. By simply skipping these heads during training, we significantly reduce this loss without compromising safety, offering an interpretable and parameter-efficient approach to improving the safety-utility trade-off.

[421] Learning to Reason: Temporal Saliency Distillation for Interpretable Knowledge Transfer

Nilushika Udayangani Hewa Dehigahawattage, Kishor Nandakishor, Marimuthu Palaniswami

Main category: cs.LG

TL;DR: Proposes Temporal Saliency Distillation (TSD) for time series knowledge distillation, transferring interpretable temporal importance patterns from teacher to student models instead of just logits/features.

Details

Motivation: Current knowledge distillation methods for time series (adapted from computer vision) have two key limitations: 1) they're uninterpretable - unclear how transferred knowledge helps student learning, and 2) they transfer limited knowledge, mainly replicating teacher accuracy. This leads to student predictive distributions differing significantly from teachers, preventing safe substitution.

Method: Extends conventional logit transfer to convey teacher’s reasoning via temporal saliency - importance of each input timestep to teacher predictions. Temporal Saliency Distillation trains students to make predictions based on same input features as teachers, requiring no additional parameters or architecture assumptions.

Result: Temporal Saliency Distillation effectively improves baseline method performance while achieving desirable properties beyond predictive accuracy, establishing interpretable knowledge distillation for time series.

Conclusion: Proposes a new paradigm for interpretable knowledge distillation in time series analysis by transferring temporal saliency patterns, enabling students to learn not just what teachers predict but how they reason about temporal data.

Abstract: Knowledge distillation has proven effective for model compression by transferring knowledge from a larger network called the teacher to a smaller network called the student. Current knowledge distillation in time series is predominantly based on logit and feature aligning techniques originally developed for computer vision tasks. These methods do not explicitly account for temporal data and fall short in two key aspects. First, the mechanisms by which the transferred knowledge helps the student model learning process remain unclear due to uninterpretability of logits and features. Second, these methods transfer only limited knowledge, primarily replicating the teacher predictive accuracy. As a result, student models often produce predictive distributions that differ significantly from those of their teachers, hindering their safe substitution for teacher models. In this work, we propose transferring interpretable knowledge by extending conventional logit transfer to convey not just the right prediction but also the right reasoning of the teacher. Specifically, we induce other useful knowledge from the teacher logits termed temporal saliency which captures the importance of each input timestep to the teacher prediction. By training the student with Temporal Saliency Distillation we encourage it to make predictions based on the same input features as the teacher. Temporal Saliency Distillation requires no additional parameters or architecture specific assumptions. We demonstrate that Temporal Saliency Distillation effectively improves the performance of baseline methods while also achieving desirable properties beyond predictive accuracy. We hope our work establishes a new paradigm for interpretable knowledge distillation in time series analysis.

[422] MemKD: Memory-Discrepancy Knowledge Distillation for Efficient Time Series Classification

Nilushika Udayangani, Kishor Nandakishor, Marimuthu Palaniswami

Main category: cs.LG

TL;DR: MemKD is a novel knowledge distillation framework that addresses temporal dependencies in time series models by capturing memory retention discrepancies between teacher and student models, enabling 500x parameter reduction while maintaining performance.

Details

Motivation: Current knowledge distillation methods designed for computer vision tasks fail to address the unique temporal dependencies and memory retention characteristics of time series models like RNNs and LSTMs, making them unsuitable for deploying compact models in resource-constrained environments.

Method: Proposes Memory-Discrepancy Knowledge Distillation (MemKD) framework with a specialized loss function that captures memory retention discrepancies between teacher and student models across subsequences within time series data, ensuring the student effectively mimics the teacher’s temporal behavior.

Result: MemKD significantly outperforms state-of-the-art KD methods, reducing parameter size and memory usage by approximately 500 times while maintaining comparable performance to the teacher model.

Conclusion: MemKD enables the development of compact, high-performing recurrent neural networks suitable for real-time time series analysis in resource-constrained environments like wearable devices and edge computing platforms.

Abstract: Deep learning models, particularly recurrent neural networks and their variants, such as long short-term memory, have significantly advanced time series data analysis. These models capture complex, sequential patterns in time series, enabling real-time assessments. However, their high computational complexity and large model sizes pose challenges for deployment in resource-constrained environments, such as wearable devices and edge computing platforms. Knowledge Distillation (KD) offers a solution by transferring knowledge from a large, complex model (teacher) to a smaller, more efficient model (student), thereby retaining high performance while reducing computational demands. Current KD methods, originally designed for computer vision tasks, neglect the unique temporal dependencies and memory retention characteristics of time series models. To this end, we propose a novel KD framework termed Memory-Discrepancy Knowledge Distillation (MemKD). MemKD leverages a specialized loss function to capture memory retention discrepancies between the teacher and student models across subsequences within time series data, ensuring that the student model effectively mimics the teacher model’s behaviour. This approach facilitates the development of compact, high-performing recurrent neural networks suitable for real-time, time series analysis tasks. Our extensive experiments demonstrate that MemKD significantly outperforms state-of-the-art KD methods. It reduces parameter size and memory usage by approximately 500 times while maintaining comparable performance to the teacher model.

[423] Making Tunable Parameters State-Dependent in Weather and Climate Models with Reinforcement Learning

Pritthijit Nath, Sebastian Schemm, Henry Moss, Peter Haynes, Emily Shuckburgh, Mark J. Webb

Main category: cs.LG

TL;DR: RL framework learns weather/climate parametrizations online, outperforming static tuning across testbeds, with TQC/DDPG/TD3 performing best and federated RL enabling specialized control.

Details

Motivation: Traditional parametrization schemes use fixed coefficients that are weakly constrained and tuned offline, leading to persistent biases and limited adaptability to underlying physics.

Method: Uses reinforcement learning to learn parametrization components online as function of evolving model state across three testbeds: simple climate bias correction, radiative-convective equilibrium, and zonal mean energy balance model, with both single-agent and federated multi-agent settings.

Result: TQC, DDPG, and TD3 achieved highest skill and most stable convergence; single-agent RL outperformed static tuning in EBM, while federated RL enabled geographically specialized control and faster convergence; six-agent DDPG with frequent aggregation yielded lowest RMSE.

Conclusion: RL delivers skillful state-dependent, regime-aware parametrizations, offering a scalable pathway for online learning within numerical weather and climate models.

Abstract: Weather and climate models rely on parametrisations to represent unresolved sub-grid processes. Traditional schemes rely on fixed coefficients that are weakly constrained and tuned offline, contributing to persistent biases that limit their ability to adapt to the underlying physics. This study presents a framework that learns components of parametrisation schemes online as a function of the evolving model state using reinforcement learning (RL) and evaluates the resulting RL-driven parameter updates across a hierarchy of idealised testbeds spanning a simple climate bias correction (SCBC), a radiative-convective equilibrium (RCE), and a zonal mean energy balance model (EBM) with both single-agent and federated multi-agent settings. Across nine RL algorithms, Truncated Quantile Critics (TQC), Deep Deterministic Policy Gradient (DDPG), and Twin Delayed DDPG (TD3) achieved the highest skill and the most stable convergence across configurations, with performance assessed against a static baseline using area-weighted RMSE, temperature profile and pressure-level diagnostics. For the EBM, single-agent RL outperformed static parameter tuning with the strongest gains in tropical and mid-latitude bands, while federated RL on multi-agent setups enabled geographically specialised control and faster convergence, with a six-agent DDPG configuration using frequent aggregation yielding the lowest area-weighted RMSE across the tropics and mid-latitudes. The learnt corrections were also physically meaningful as agents modulated EBM radiative parameters to reduce meridional biases, adjusted RCE lapse rates to match vertical temperature errors, and stabilised SCBC heating increments to limit drift. Overall, results highlight RL to deliver skilful state-dependent, and regime-aware parametrisations, offering a scalable pathway for online learning within numerical models.

[424] Predictable Gradient Manifolds in Deep Learning: Temporal Path-Length and Intrinsic Rank as a Complexity Regime

Anherutowa Calvo

Main category: cs.LG

TL;DR: Gradients in deep learning optimization are temporally predictable and low-dimensional, enabling new convergence guarantees based on measurable gradient structure rather than worst-case bounds.

Details

Motivation: Deep learning optimization exhibits structured gradient behavior not captured by worst-case analysis - gradients are often temporally predictable and evolve in low-dimensional subspaces, suggesting optimization occurs in a low-complexity temporal regime.

Method: Introduce two computable metrics: prediction-based path length (measures gradient forecastability from past info) and predictable rank (quantifies intrinsic temporal dimension of gradient increments). Use these to reformulate optimization guarantees and analyze gradient trajectories across various architectures.

Result: Gradient trajectories are locally predictable with strong low-rank structure across CNNs, vision transformers, language models, and synthetic tasks. These properties are stable across architectures and optimizers, and can be diagnosed using lightweight random projections.

Conclusion: Optimization in modern deep learning operates in a low-complexity temporal regime, providing a unifying framework for understanding optimization dynamics and suggesting new directions for adaptive optimizers, rank-aware tracking, and prediction-based algorithm design.

Abstract: Deep learning optimization exhibits structure that is not captured by worst-case gradient bounds. Empirically, gradients along training trajectories are often temporally predictable and evolve within a low-dimensional subspace. In this work we formalize this observation through a measurable framework for predictable gradient manifolds. We introduce two computable quantities: a prediction-based path length that measures how well gradients can be forecast from past information, and a predictable rank that quantifies the intrinsic temporal dimension of gradient increments. We show how classical online and nonconvex optimization guarantees can be restated so that convergence and regret depend explicitly on these quantities, rather than on worst-case variation. Across convolutional networks, vision transformers, language models, and synthetic control tasks, we find that gradient trajectories are locally predictable and exhibit strong low-rank structure over time. These properties are stable across architectures and optimizers, and can be diagnosed directly from logged gradients using lightweight random projections. Our results provide a unifying lens for understanding optimization dynamics in modern deep learning, reframing standard training as operating in a low-complexity temporal regime. This perspective suggests new directions for adaptive optimizers, rank-aware tracking, and prediction-based algorithm design grounded in measurable properties of real training runs.

[425] Quaternion-Hadamard Network: A Novel Defense Against Adversarial Attacks with a New Dataset

Vladimir Frants, Sos Agaian

Main category: cs.LG

TL;DR: QHNet is a purification-based defense that protects adverse-weather image restoration models from adversarial attacks by suppressing perturbations in transform and quaternion domains, achieving superior robustness and restoration quality.

Details

Motivation: Adverse-weather image restoration models (for rain, snow, haze) are highly vulnerable to gradient-based white-box adversarial attacks, where minimal perturbations cause substantial degradation in restored outputs, creating a security concern for low-level vision pipelines.

Method: QHNet uses a computationally efficient purification-based defense that precedes restoration networks. It incorporates Quaternion Hadamard Polynomial Denoising Block (QHPDB) and Quaternion Denoising Residual Block (QDRB) within an encoder-decoder framework to remove high-frequency adversarial noise while preserving structural details, operating in transform and quaternion domains.

Result: QHNet demonstrates superior restoration fidelity (measured by PSNR and SSIM) across rain, snow, and haze removal tasks, with significantly improved robustness against adaptive white-box attacks including PGD, BPDA, and EOT, outperforming state-of-the-art purification baselines.

Conclusion: QHNet effectively protects adverse-weather image restoration models from adversarial attacks while maintaining high restoration quality, confirming its effectiveness as a robust defense for low-level vision pipelines.

Abstract: Adverse-weather image restoration (e.g., rain, snow, haze) models remain highly vulnerable to gradient-based white-box adversarial attacks, wherein minimal loss-aligned perturbations cause substantial degradation in the restored output. This paper presents QHNet, a computationally efficient purification-based defense that precedes the restoration network and targets perturbation suppression in the transform and quaternion domains. QHNet incorporates a Quaternion Hadamard Polynomial Denoising Block (QHPDB) and a Quaternion Denoising Residual Block (QDRB) within an encoder-decoder framework to remove high-frequency adversarial noise while preserving fine structural details. Robustness is evaluated using PSNR and SSIM across rain, snow, and haze removal tasks, and further validated under adaptive, defense-aware white-box attacks employing Projected Gradient Descent (PGD), Backward Pass Differentiable Approximation (BPDA), and Expectation Over Transformation (EOT). Experimental results demonstrate that QHNet delivers superior restoration fidelity and significantly improved robustness compared to state-of-the-art purification baselines, confirming its effectiveness for low-level vision pipelines.

[426] Unlocking the Pre-Trained Model as a Dual-Alignment Calibrator for Post-Trained LLMs

Beier Luo, Cheng Wang, Hongxin Wei, Sharon Li, Xuefeng Du

Main category: cs.LG

TL;DR: Post-training worsens LLM calibration, causing overconfidence. Dual-Align fixes this via dual alignment: confidence alignment corrects final output drift, and process alignment fixes intermediate inference divergence, using a single temperature parameter.

Details

Motivation: Post-training improves LLM performance but systematically worsens confidence calibration, making models overconfident. Existing unsupervised methods only do static output-distribution matching, ignoring inference-time dynamics where calibration errors arise from two regimes: confidence drift and process drift.

Method: Dual-Align: an unsupervised post-hoc framework with dual alignment. 1) Confidence alignment corrects confidence drift via final-distribution matching. 2) Process alignment addresses process drift by locating the layer where inference trajectories diverge and realigning stability of subsequent inference. Uses a single temperature parameter to correct both drift types.

Result: Experiments show consistent improvements over baselines, reducing calibration errors and approaching supervised oracle performance, without sacrificing post-training performance gains.

Conclusion: Calibration errors in post-trained LLMs stem from both confidence drift and process drift. Dual-Align effectively addresses both through dual alignment, improving calibration while preserving performance improvements from post-training.

Abstract: Post-training improves large language models (LLMs) but often worsens confidence calibration, leading to systematic overconfidence. Recent unsupervised post-hoc methods for post-trained LMs (PoLMs) mitigate this by aligning PoLM confidence to that of well-calibrated pre-trained counterparts. However, framing calibration as static output-distribution matching overlooks the inference-time dynamics introduced by post-training. In particular, we show that calibration errors arise from two regimes: (i) confidence drift, where final confidence inflates despite largely consistent intermediate decision processes, and (ii) process drift, where intermediate inference pathways diverge. Guided by this diagnosis, we propose Dual-Align, an unsupervised post-hoc framework for dual alignment in confidence calibration. Dual-Align performs confidence alignment to correct confidence drift via final-distribution matching, and introduces process alignment to address process drift by locating the layer where trajectories diverge and realigning the stability of subsequent inference. This dual strategy learns a single temperature parameter that corrects both drift types without sacrificing post-training performance gains. Experiments show consistent improvements over baselines, reducing calibration errors and approaching a supervised oracle.

[427] Online Action-Stacking Improves Reinforcement Learning Performance for Air Traffic Control

Ben Carvell, George De Ath, Eseoghene Benjamin, Richard Everson

Main category: cs.LG

TL;DR: Online action-stacking is an inference-time wrapper that compiles simple RL policy outputs into realistic air traffic control commands, enabling training with small action spaces while producing operational clearances.

Details

Motivation: There's a gap between standard RL formulations (with small discrete action spaces) and operational ATC requirements (needing complex compound clearances). Training directly with large action spaces is challenging.

Method: Train policies with simple incremental actions (heading/level adjustments) plus action-damping penalty. At inference, use online action-stacking to compile primitive action bursts into domain-appropriate compound clearances. Use PPO on BluebirdDT platform.

Result: Action stacking greatly reduces instruction count vs damped baseline and achieves comparable performance to 37-action policy using only 5 actions. Successfully navigates aircraft along routes, manages climb/descent, and performs collision avoidance.

Conclusion: Online action-stacking bridges the gap between standard RL and operational ATC, providing a simple mechanism for scaling to complex control scenarios while maintaining realistic command generation.

Abstract: We introduce online action-stacking, an inference-time wrapper for reinforcement learning policies that produces realistic air traffic control commands while allowing training on a much smaller discrete action space. Policies are trained with simple incremental heading or level adjustments, together with an action-damping penalty that reduces instruction frequency and leads agents to issue commands in short bursts. At inference, online action-stacking compiles these bursts of primitive actions into domain-appropriate compound clearances. Using Proximal Policy Optimisation and the BluebirdDT digital twin platform, we train agents to navigate aircraft along lateral routes, manage climb and descent to target flight levels, and perform two-aircraft collision avoidance under a minimum separation constraint. In our lateral navigation experiments, action stacking greatly reduces the number of issued instructions relative to a damped baseline and achieves comparable performance to a policy trained with a 37-dimensional action space, despite operating with only five actions. These results indicate that online action-stacking helps bridge a key gap between standard reinforcement learning formulations and operational ATC requirements, and provides a simple mechanism for scaling to more complex control scenarios.

[428] Generation of synthetic delay time series for air transport applications

Pau Esteve, Massimiliano Zanin

Main category: cs.LG

TL;DR: This paper compares three models for generating realistic synthetic time series of airport delays, finding that a simplified Genetic Algorithm approach produces highly realistic and variable delay data that can be used for delay propagation analysis.

Details

Motivation: The paper addresses the need for synthetic data generation in air transport to solve problems like data scarcity and privacy concerns, while enabling research on airport delay patterns and propagation.

Method: The researchers compare three models: two based on state-of-the-art Deep Learning algorithms and one simplified Genetic Algorithm approach, using large collections of airport operations data from Europe and the US to generate synthetic delay time series.

Result: The simplified Genetic Algorithm approach generates time series that are almost indistinguishable from real delay data while maintaining high variability. The synthetic data is validated in a delay propagation detection problem between airports.

Conclusion: The Genetic Algorithm approach effectively generates realistic synthetic airport delay time series, and the resulting data is made publicly available to support further research in air transport analytics.

Abstract: The generation of synthetic data is receiving increasing attention from the scientific community, thanks to its ability to solve problems like data scarcity and privacy, and is starting to find applications in air transport. We here tackle the problem of generating synthetic, yet realistic, time series of delays at airports, starting from large collections of operations in Europe and the US. We specifically compare three models, two of them based on state of the art Deep Learning algorithms, and one simplified Genetic Algorithm approach. We show how the latter can generate time series that are almost indistinguishable from real ones, while maintaining a high variability. We further validate the resulting time series in a problem of detecting delay propagations between airports. We finally make the synthetic data available to the scientific community.

[429] LEGATO: Good Identity Unlearning Is Continuous

Qiang Chen, Chun-Wun Cheng, Xiu Su, Hongyan Xu, Xi Lin, Shan You, Angelica I. Aviles-Rivero, Yi Chen

Main category: cs.LG

TL;DR: LEGATO introduces a continuous trajectory approach using Neural ODE adapters for efficient, controllable identity unlearning in generative models, avoiding catastrophic collapse while reducing fine-tuned parameters.

Details

Motivation: Existing machine unlearning methods for generative models are inefficient (require full model fine-tuning), lack controllability over forgetting intensity, and suffer from catastrophic collapse where model retention capability degrades drastically during forgetting.

Method: LEGATO models identity forgetting as a continuous trajectory using Neural ODE adapters attached to pre-trained generators. It keeps original model weights frozen while fine-tuning lightweight adapters, allowing precise control of forgetting intensity via ODE step size. Trajectory consistency constraints prevent catastrophic collapse.

Result: Extensive experiments across in-domain and out-of-domain identity unlearning benchmarks show LEGATO achieves state-of-the-art forgetting performance, avoids catastrophic collapse, and reduces fine-tuned parameters compared to existing methods.

Conclusion: LEGATO provides an efficient, controllable, and stable approach to identity unlearning in generative models by treating forgetting as a continuous trajectory with Neural ODE adapters, addressing key limitations of existing discrete unlearning methods.

Abstract: Machine unlearning has become a crucial role in enabling generative models trained on large datasets to remove sensitive, private, or copyright-protected data. However, existing machine unlearning methods face three challenges in learning to forget identity of generative models: 1) inefficient, where identity erasure requires fine-tuning all the model’s parameters; 2) limited controllability, where forgetting intensity cannot be controlled and explainability is lacking; 3) catastrophic collapse, where the model’s retention capability undergoes drastic degradation as forgetting progresses. Forgetting has typically been handled through discrete and unstable updates, often requiring full-model fine-tuning and leading to catastrophic collapse. In this work, we argue that identity forgetting should be modeled as a continuous trajectory, and introduce LEGATO - Learn to ForgEt Identity in GenerAtive Models via Trajectory-consistent Neural Ordinary Differential Equations. LEGATO augments pre-trained generators with fine-tunable lightweight Neural ODE adapters, enabling smooth, controllable forgetting while keeping the original model weights frozen. This formulation allows forgetting intensity to be precisely modulated via ODE step size, offering interpretability and robustness. To further ensure stability, we introduce trajectory consistency constraints that explicitly prevent catastrophic collapse during unlearning. Extensive experiments across in-domain and out-of-domain identity unlearning benchmarks show that LEGATO achieves state-of-the-art forgetting performance, avoids catastrophic collapse and reduces fine-tuned parameters.

[430] Density Matrix RNN (DM-RNN): A Quantum Information Theoretic Framework for Modeling Musical Context and Polyphony

Joonwon Seo, Mariana Montiel

Main category: cs.LG

TL;DR: The paper proposes DM-RNN, a novel RNN architecture using density matrices and quantum channels to model musical ambiguity, replacing deterministic hidden states with statistical ensembles of interpretations.

Details

Motivation: Classical RNNs fail to capture musical ambiguity because they compress context into deterministic hidden states, creating an information bottleneck. Music inherently contains multiple valid interpretations that need statistical representation.

Method: Uses density matrices to represent mixed states (statistical ensembles of interpretations), quantum channels (CPTP maps) for temporal dynamics, Choi-Jamiolkowski isomorphism for parameterization ensuring physical validity, and quantum information measures for analysis.

Result: The DM-RNN provides a mathematically rigorous framework that can capture both classical probabilities and quantum coherences in music, with built-in physical validity through CPTP constraints.

Conclusion: The proposed architecture offers a principled approach to modeling ambiguous musical structures using quantum-inspired mathematics, enabling better representation of musical uncertainty and voice entanglement.

Abstract: Classical Recurrent Neural Networks (RNNs) summarize musical context into a deterministic hidden state vector, imposing an information bottleneck that fails to capture the inherent ambiguity in music. We propose the Density Matrix RNN (DM-RNN), a novel theoretical architecture utilizing the Density Matrix. This allows the model to maintain a statistical ensemble of musical interpretations (a mixed state), capturing both classical probabilities and quantum coherences. We rigorously define the temporal dynamics using Quantum Channels (CPTP maps). Crucially, we detail a parameterization strategy based on the Choi-Jamiolkowski isomorphism, ensuring the learned dynamics remain physically valid (CPTP) by construction. We introduce an analytical framework using Von Neumann Entropy to quantify musical uncertainty and Quantum Mutual Information (QMI) to measure entanglement between voices. The DM-RNN provides a mathematically rigorous framework for modeling complex, ambiguous musical structures.

[431] Mitigating Position-Shift Failures in Text-Based Modular Arithmetic via Position Curriculum and Template Diversity

Nikolay Yudin

Main category: cs.LG

TL;DR: Transformers trained on modular addition fail catastrophically under input format variations despite high in-distribution accuracy. A training recipe with boundary markers, position curriculum, diverse templates, and consistency training improves robustness.

Details

Motivation: The paper addresses a critical gap in evaluating neural networks: while models may achieve high in-distribution accuracy, they often fail catastrophically under input format variations like position shifts or different natural-language templates. This reveals an under-emphasized failure mode in procedural generalization.

Method: The authors study character-level Transformers trained on modular addition from text. They use a disjoint-pair split over all ordered pairs for p=97. They introduce a training recipe combining: (1) explicit expression boundary markers, (2) position curriculum broadening absolute position range, (3) diverse template mixtures, and (4) consistency training across multiple variants per example.

Result: Baseline models achieve strong in-distribution performance but collapse under position shift and template OOD. The proposed intervention substantially improves robustness to both position shift and template OOD while maintaining high in-distribution accuracy. An ALiBi-style ablation fails to learn the task under their setup.

Conclusion: Procedural generalization under noisy supervision benefits from explicitly training invariances absent from the data distribution. The paper provides a reproducible evaluation protocol and artifacts for studying robustness to input format variations.

Abstract: Building on insights from the grokking literature, we study character-level Transformers trained to compute modular addition from text, and focus on robustness under input-format variation rather than only in-distribution accuracy. We identify a previously under-emphasized failure mode: models that achieve high in-distribution accuracy can fail catastrophically when the same expression is shifted to different absolute character positions (“position shift”) or presented under out-of-distribution natural-language templates. Using a disjoint-pair split over all ordered pairs for p=97, we show that a baseline model reaches strong in-distribution performance yet collapses under position shift and template OOD. We then introduce a simple training recipe that combines (i) explicit expression boundary markers, (ii) position curriculum that broadens the range of absolute positions seen during training, (iii) diverse template mixtures, and (iv) consistency training across multiple variants per example. Across three seeds, this intervention substantially improves robustness to position shift and template OOD while maintaining high in-distribution accuracy, whereas an ALiBi-style ablation fails to learn the task under our setup. Our results suggest that steering procedural generalization under noisy supervision benefits from explicitly training invariances that are otherwise absent from the data distribution, and we provide a reproducible evaluation protocol and artifacts.

[432] Enhancing Robustness of Asynchronous EEG-Based Movement Prediction using Classifier Ensembles

Niklas Kueper, Kartik Chari, Elsa Andrea Kirchner

Main category: cs.LG

TL;DR: Classifier ensembles with sliding-window postprocessing improve asynchronous EEG-based movement intention detection for robot-assisted stroke rehabilitation.

Details

Motivation: Stroke rehabilitation needs better methods for detecting patient movement intentions to trigger robotic assistance. EEG signals can detect these intentions, but asynchronous online classification is challenging and requires more robust methods.

Method: Analyzed EEG datasets from 14 healthy subjects performing self-initiated arm movements. Compared ensemble combinations of SVM, MLP, and EEGNet classifiers with offline and pseudo-online evaluations using sliding-window postprocessing.

Result: Classifier ensembles significantly outperformed single models in pseudo-online evaluation with optimal postprocessing windows. Increased postprocessing windows improved single model performance, but no significant difference between best single model and ensembles in offline evaluation.

Conclusion: Classifier ensembles with appropriate postprocessing effectively enhance asynchronous movement intention detection from EEG, particularly improving online classification and reducing false detections.

Abstract: Objective: Stroke is one of the leading causes of disabilities. One promising approach is to extend the rehabilitation with self-initiated robot-assisted movement therapy. To enable this, it is required to detect the patient’s intention to move to trigger the assistance of a robotic device. This intention to move can be detected from human surface electroencephalography (EEG) signals; however, it is particularly challenging to decode when classifications are performed online and asynchronously. In this work, the effectiveness of classifier ensembles and a sliding-window postprocessing technique was investigated to enhance the robustness of such asynchronous classification. Approach: To investigate the effectiveness of classifier ensembles and a sliding-window postprocessing, two EEG datasets with 14 healthy subjects who performed self-initiated arm movements were analyzed. Offline and pseudo-online evaluations were conducted to compare ensemble combinations of the support vector machine (SVM), multilayer perceptron (MLP), and EEGNet classification models. Results: The results of the pseudo-online evaluation show that the two model ensembles significantly outperformed the best single model for the optimal number of postprocessing windows. In particular, for single models, an increased number of postprocessing windows significantly improved classification performances. Interestingly, we found no significant improvements between performances of the best single model and classifier ensembles in the offline evaluation. Significance: We demonstrated that classifier ensembles and appropriate postprocessing methods effectively enhance the asynchronous detection of movement intentions from EEG signals. In particular, the classifier ensemble approach yields greater improvements in online classification than in offline classification, and reduces false detections, i.e., early false positives.

[433] ArtCognition: A Multimodal AI Framework for Affective State Sensing from Visual and Kinematic Drawing Cues

Behrad Binaei-Haghighi, Nafiseh Sadat Sajadi, Mehrad Liviyan, Reyhane Akhavan Kharazi, Fatemeh Amirkhani, Behnam Bahrak

Main category: cs.LG

TL;DR: ArtCognition: A multimodal framework using digital drawing analysis (House-Tree-Person test) with computer vision and behavioral kinematics, enhanced by RAG for psychological interpretation.

Details

Motivation: Objective assessment of affective/psychological states is challenging, especially through non-verbal channels. Digital drawing offers an underexplored modality for affective sensing that could provide richer insights than traditional methods.

Method: Multimodal framework fusing: 1) Static visual features from final artwork using computer vision models, 2) Dynamic behavioral kinematic cues from drawing process (stroke speed, pauses, smoothness), 3) Retrieval-Augmented Generation (RAG) architecture to bridge low-level features with psychological knowledge.

Result: Fusion of visual and behavioral cues provides more nuanced assessment than single modalities. Significant correlations found between multimodal features and standardized psychological metrics, validating framework’s potential as scalable clinical tool.

Conclusion: ArtCognition contributes new methodology for non-intrusive affective state assessment, opens avenues for technology-assisted mental healthcare, and demonstrates value of multimodal digital drawing analysis with psychological grounding.

Abstract: The objective assessment of human affective and psychological states presents a significant challenge, particularly through non-verbal channels. This paper introduces digital drawing as a rich and underexplored modality for affective sensing. We present a novel multimodal framework, named ArtCognition, for the automated analysis of the House-Tree-Person (HTP) test, a widely used psychological instrument. ArtCognition uniquely fuses two distinct data streams: static visual features from the final artwork, captured by computer vision models, and dynamic behavioral kinematic cues derived from the drawing process itself, such as stroke speed, pauses, and smoothness. To bridge the gap between low-level features and high-level psychological interpretation, we employ a Retrieval-Augmented Generation (RAG) architecture. This grounds the analysis in established psychological knowledge, enhancing explainability and reducing the potential for model hallucination. Our results demonstrate that the fusion of visual and behavioral kinematic cues provides a more nuanced assessment than either modality alone. We show significant correlations between the extracted multimodal features and standardized psychological metrics, validating the framework’s potential as a scalable tool to support clinicians. This work contributes a new methodology for non-intrusive affective state assessment and opens new avenues for technology-assisted mental healthcare.

Pir Bakhsh Khokhar, Carmine Gravino, Fabio Palomba, Sule Yildrim Yayilgan, Sarang Shaikh

Main category: cs.LG

TL;DR: Explainable deep learning framework integrates CGM and lab data to identify 5 distinct metabolic phenotypes in Type 1 diabetes, revealing subgroups with varying cardiometabolic risk beyond HbA1c alone.

Details

Motivation: Type 1 diabetes is metabolically heterogeneous and cannot be adequately characterized by conventional biomarkers like HbA1c alone. There's a need for better metabolic characterization and risk stratification using multimodal data.

Method: Proposed an explainable deep learning framework integrating continuous glucose monitoring (CGM) data with laboratory profiles. Used transformer encoder to model temporal dependencies across modalities, Gaussian mixture modeling to identify latent metabolic phenotypes, and attention visualization plus SHAP analysis for interpretability.

Result: Identified 5 latent metabolic phenotypes ranging from metabolic stability to elevated cardiometabolic risk among 577 individuals with T1D. Phenotypes showed distinct biochemical profiles in glycemic control, lipid metabolism, renal markers, and TSH levels. Glucose variability was dominant temporal factor; HbA1c, triglycerides, cholesterol, creatinine, and TSH were key phenotype differentiators. Phenotype membership showed modest but significant associations with hypertension, myocardial infarction, and heart failure.

Conclusion: The explainable multimodal temporal embedding framework reveals physiologically coherent metabolic subgroups in T1D and supports risk stratification beyond single biomarkers, providing a more comprehensive approach to metabolic characterization.

Abstract: Type 1 diabetes (T1D) is a highly metabolically heterogeneous disease that cannot be adequately characterized by conventional biomarkers such as glycated hemoglobin (HbA1c). This study proposes an explainable deep learning framework that integrates continuous glucose monitoring (CGM) data with laboratory profiles to learn multimodal temporal embeddings of individual metabolic status. Temporal dependencies across modalities are modeled using a transformer encoder, while latent metabolic phenotypes are identified via Gaussian mixture modeling. Model interpretability is achieved through transformer attention visualization and SHAP-based feature attribution. Five latent metabolic phenotypes, ranging from metabolic stability to elevated cardiometabolic risk, were identified among 577 individuals with T1D. These phenotypes exhibit distinct biochemical profiles, including differences in glycemic control, lipid metabolism, renal markers, and thyrotropin (TSH) levels. Attention analysis highlights glucose variability as a dominant temporal factor, while SHAP analysis identifies HbA1c, triglycerides, cholesterol, creatinine, and TSH as key contributors to phenotype differentiation. Phenotype membership shows statistically significant, albeit modest, associations with hypertension, myocardial infarction, and heart failure. Overall, this explainable multimodal temporal embedding framework reveals physiologically coherent metabolic subgroups in T1D and supports risk stratification beyond single biomarkers.

[435] Quantifying the Effect of Test Set Contamination on Generative Evaluations

Rylan Schaeffer, Joshua Kazdan, Baber Abbasi, Ken Ziyu Liu, Brando Miranda, Ahmed Ahmed, Abhay Puri, Niloofar Mireshghallah, Sanmi Koyejo

Main category: cs.LG

TL;DR: Test set contamination significantly boosts generative evaluation performance, enabling models to achieve lower loss than irreducible error with just one test replica, with effects modulated by model size, training methods, and inference parameters.

Details

Motivation: While test set contamination's impact on discriminative evaluations is well-studied, its effects on generative evaluations remain under-explored, creating a critical gap for accurately assessing frontier AI systems trained on web-scale data.

Method: Pretrained language models on mixtures of web data and MATH benchmark, sweeping model sizes and test set replicas; used scaling laws to analyze contamination effects; studied further training (overtraining with fresh data vs supervised finetuning); examined inference factors like temperature and solution length.

Result: Performance improves with contamination and model size; even one test replica enables lower loss than irreducible error; overtraining reduces contamination effects while finetuning can increase or decrease performance; high temperatures mitigate contamination, longer solutions are exponentially harder to memorize.

Conclusion: Test set contamination introduces complex interactions between generation and memorization, adding new complexity for trustworthy AI evaluation that differs significantly from discriminative settings.

Abstract: As frontier AI systems are pretrained on web-scale data, test set contamination has become a critical concern for accurately assessing their capabilities. While research has thoroughly investigated the impact of test set contamination on discriminative evaluations like multiple-choice question-answering, comparatively little research has studied the impact of test set contamination on generative evaluations. In this work, we quantitatively assess the effect of test set contamination on generative evaluations through the language model lifecycle. We pretrain language models on mixtures of web data and the MATH benchmark, sweeping model sizes and number of test set replicas contaminating the pretraining corpus; performance improves with contamination and model size. Using scaling laws, we make a surprising discovery: including even a single test set replica enables models to achieve lower loss than the irreducible error of training on the uncontaminated corpus. We then study further training: overtraining with fresh data reduces the effects of contamination, whereas supervised finetuning on the training set can either increase or decrease performance on test data, depending on the amount of pretraining contamination. Finally, at inference, we identify factors that modulate memorization: high sampling temperatures mitigate contamination effects, and longer solutions are exponentially more difficult to memorize than shorter ones, presenting a contrast with discriminative evaluations, where solutions are only a few tokens in length. By characterizing how generation and memorization interact, we highlight a new layer of complexity for trustworthy evaluation of AI systems.

[436] Causally-Aware Information Bottleneck for Domain Adaptation

Mohammad Ali Javidian

Main category: cs.LG

TL;DR: The paper proposes a causal domain adaptation method for imputing missing target variables in target domains using mechanism-stable representations via Information Bottleneck approaches.

Details

Motivation: Addresses domain adaptation in causal systems where target variables are observed in source domains but entirely missing in target domains, requiring imputation under various distribution shifts.

Method: For linear Gaussian causal models: closed-form Gaussian Information Bottleneck (GIB) solution reduces to CCA-style projection with optional DAG-awareness. For nonlinear/non-Gaussian data: Variational Information Bottleneck (VIB) encoder-predictor that scales to high dimensions and enables zero-shot deployment.

Result: Consistently achieves accurate imputations across synthetic and real datasets, supporting practical use in high-dimensional causal models.

Conclusion: Provides a unified, lightweight toolkit for causal domain adaptation that learns compact, mechanism-stable representations preserving target-relevant information while discarding spurious variation.

Abstract: We tackle a common domain adaptation setting in causal systems. In this setting, the target variable is observed in the source domain but is entirely missing in the target domain. We aim to impute the target variable in the target domain from the remaining observed variables under various shifts. We frame this as learning a compact, mechanism-stable representation. This representation preserves information relevant for predicting the target while discarding spurious variation. For linear Gaussian causal models, we derive a closed-form Gaussian Information Bottleneck (GIB) solution. This solution reduces to a canonical correlation analysis (CCA)-style projection and offers Directed Acyclic Graph (DAG)-aware options when desired. For nonlinear or non-Gaussian data, we introduce a Variational Information Bottleneck (VIB) encoder-predictor. This approach scales to high dimensions and can be trained on source data and deployed zero-shot to the target domain. Across synthetic and real datasets, our approach consistently attains accurate imputations, supporting practical use in high-dimensional causal models and furnishing a unified, lightweight toolkit for causal domain adaptation.

[437] Phasor Agents: Oscillatory Graphs with Three-Factor Plasticity and Sleep-Staged Learning

Rodja Trappe

Main category: cs.LG

TL;DR: Phasor Agents are dynamical systems using coupled Stuart-Landau oscillators as computational units, with phase for timing coherence and amplitude for activity. The system uses three-factor local plasticity without backpropagation, separates wake tagging from offline consolidation inspired by sleep dynamics, and shows improved stability and planning capabilities.

Details

Motivation: To create stable oscillatory computational systems that can learn without backpropagation while avoiding collapse into global synchrony, inspired by biological oscillatory populations and sleep-stage dynamics for memory consolidation.

Method: Uses Phasor Graphs - weighted graphs of coupled Stuart-Landau oscillators with three-factor local plasticity (eligibility traces gated by sparse global modulators and oscillation-timed write windows). Separates wake tagging from offline consolidation with deep-sleep-like gated capture and REM-like replay for planning.

Result: Eligibility traces preserve credit under delayed modulation; compression-progress signals pass controls; phase-coherent retrieval reaches 4x diffusive baselines; wake/sleep separation expands stable learning by 67%; REM replay improves maze success by +45.5 percentage points; shows Tolman-style latent learning signature.

Conclusion: Phasor Agents demonstrate stable oscillatory computation with biologically-inspired learning mechanisms, achieving improved planning and latent learning through wake/sleep separation and replay, offering a promising alternative to backpropagation-based approaches.

Abstract: Phasor Agents are dynamical systems whose internal state is a Phasor Graph: a weighted graph of coupled Stuart-Landau oscillators. A Stuart-Landau oscillator is a minimal stable “rhythm generator” (the normal form near a Hopf bifurcation); each oscillator is treated as an abstract computational unit (inspired by, but not claiming to model, biological oscillatory populations). In this interpretation, oscillator phase tracks relative timing (coherence), while amplitude tracks local gain or activity. Relative phase structure serves as a representational medium; coupling weights are learned via three-factor local plasticity - eligibility traces gated by sparse global modulators and oscillation-timed write windows - without backpropagation. A central challenge in oscillatory substrates is stability: online weight updates can drive the network into unwanted regimes (e.g., global synchrony), collapsing representational diversity. We therefore separate wake tagging from offline consolidation, inspired by synaptic tagging-and-capture and sleep-stage dynamics: deep-sleep-like gated capture commits tagged changes safely, while REM-like replay reconstructs and perturbs experience for planning. A staged experiment suite validates each mechanism with ablations and falsifiers: eligibility traces preserve credit under delayed modulation; compression-progress signals pass timestamp-shuffle controls; phase-coherent retrieval reaches 4x diffusive baselines under noise; wake/sleep separation expands stable learning by 67 percent under matched weight-norm budgets; REM replay improves maze success rate by +45.5 percentage points; and a Tolman-style latent-learning signature - immediate competence and detour advantage after unrewarded exploration, consistent with an internal model - emerges from replay (Tolman, 1948). The codebase and all artifacts are open-source.

[438] Survival Dynamics of Neural and Programmatic Policies in Evolutionary Reinforcement Learning

Anton Roupassov-Ruiz, Yiyang Zuo

Main category: cs.LG

TL;DR: Programmatic policies (PERL) using soft differentiable decision lists outperform neural policies (NERL) in evolutionary reinforcement learning tasks, surviving 201.69 steps longer on average in ALife testbed.

Details

Motivation: Neural representations in evolutionary RL lack explicit modular structure, limiting behavioral interpretation. The paper investigates whether programmatic policies can match or exceed neural policy performance while providing better interpretability.

Method: Used programmatic policies implemented as soft, differentiable decision lists (SDDL) compared against neural policies. Conducted rigorous survival analysis across 4000 independent trials using Kaplan-Meier curves and Restricted Mean Survival Time metrics on a fully specified open-source reimplementation of the 1992 ALife testbed.

Result: Statistically significant difference in survival probability: PERL agents survive 201.69 steps longer than NERL agents. SDDL agents using learning alone survive 73.67 steps longer than neural agents using both learning and evolution.

Conclusion: Programmatic policies can exceed the survival performance of neural policies in ALife, demonstrating the viability of interpretable programmatic representations in evolutionary reinforcement learning.

Abstract: In evolutionary reinforcement learning tasks (ERL), agent policies are often encoded as small artificial neural networks (NERL). Such representations lack explicit modular structure, limiting behavioral interpretation. We investigate whether programmatic policies (PERL), implemented as soft, differentiable decision lists (SDDL), can match the performance of NERL. To support reproducible evaluation, we provide the first fully specified and open-source reimplementation of the classic 1992 Artificial Life (ALife) ERL testbed. We conduct a rigorous survival analysis across 4000 independent trials utilizing Kaplan-Meier curves and Restricted Mean Survival Time (RMST) metrics absent in the original study. We find a statistically significant difference in survival probability between PERL and NERL. PERL agents survive on average 201.69 steps longer than NERL agents. Moreover, SDDL agents using learning alone (no evolution) survive on average 73.67 steps longer than neural agents using both learning and evaluation. These results demonstrate that programmatic policies can exceed the survival performance of neural policies in ALife.

[439] Machine Learning Model for Sparse PCM Completion

Selcuk Koyuncu, Ronak Nouri, Stephen Providence

Main category: cs.LG

TL;DR: A machine learning model for sparse pairwise comparison matrices that combines classical PCM approaches with graph-based learning techniques.

Details

Motivation: To address the challenge of working with sparse pairwise comparison matrices, which are common in real-world applications but difficult to analyze using traditional methods.

Method: Combines classical pairwise comparison matrix approaches with graph-based learning techniques to create a machine learning model specifically designed for sparse PCMs.

Result: Numerical results demonstrate the effectiveness and scalability of the proposed method.

Conclusion: The proposed hybrid approach successfully addresses sparse PCM analysis challenges by integrating classical and graph-based methods, showing promising results in terms of both effectiveness and scalability.

Abstract: In this paper, we propose a machine learning model for sparse pairwise comparison matrices (PCMs), combining classical PCM approaches with graph-based learning techniques. Numerical results are provided to demonstrate the effectiveness and scalability of the proposed method.

[440] Aligned explanations in neural networks

Corentin Lobet, Francesca Chiaromonte

Main category: cs.LG

TL;DR: PiNets are pseudo-linear networks that produce instance-wise linear predictions for better explanatory alignment, making deep learning models more trustworthy through faithful explanations directly linked to predictions.

Details

Motivation: Current feature attribution methods for explaining neural networks are often post-hoc rationalizations that don't truly reflect the model's prediction process, creating a "white-painted black box" problem. The authors argue that explanatory alignment - directly linking explanations to predictions - is crucial for trustworthiness in prediction tasks.

Method: Propose PiNets (pseudo-linear networks) as a modeling framework that produces instance-wise linear predictions in arbitrary feature spaces, making them linearly readable. This enables model readability as a design principle for achieving explanatory alignment in deep learning.

Result: Demonstrated PiNets on image classification and segmentation tasks, showing they produce explanations that are faithful across multiple criteria in addition to achieving alignment between explanations and predictions.

Conclusion: PiNets provide a framework for building more trustworthy deep learning models through explanatory alignment, moving beyond post-hoc feature attribution methods to create explanations that are directly linked to the model’s prediction-making process.

Abstract: Feature attribution is the dominant paradigm for explaining deep neural networks. However, most existing methods only loosely reflect the model’s prediction-making process, thereby merely white-painting the black box. We argue that explanatory alignment is a key aspect of trustworthiness in prediction tasks: explanations must be directly linked to predictions, rather than serving as post-hoc rationalizations. We present model readability as a design principle enabling alignment, and PiNets as a modeling framework to pursue it in a deep learning context. PiNets are pseudo-linear networks that produce instance-wise linear predictions in an arbitrary feature space, making them linearly readable. We illustrate their use on image classification and segmentation tasks, demonstrating how PiNets produce explanations that are faithful across multiple criteria in addition to alignment.

[441] Enhanced-FQL($λ$), an Efficient and Interpretable RL with novel Fuzzy Eligibility Traces and Segmented Experience Replay

Mohsen Jalaeian-Farimani

Main category: cs.LG

TL;DR: Enhanced-FQL(λ) is a fuzzy reinforcement learning framework that combines Fuzzified Eligibility Traces and Segmented Experience Replay for continuous control tasks, offering interpretability, computational efficiency, and theoretical convergence guarantees.

Details

Motivation: To develop a reinforcement learning approach for continuous control that maintains competitive performance while being interpretable and computationally efficient, particularly for safety-critical applications where transparency and resource constraints are essential.

Method: Integrates Fuzzified Eligibility Traces (FET) and Segmented Experience Replay (SER) into fuzzy Q-learning with Fuzzified Bellman Equation (FBE). Uses an interpretable fuzzy rule base instead of neural networks, with multi-step credit assignment via fuzzified eligibility traces and memory-efficient segment-based experience replay.

Result: Achieves superior sample efficiency and reduced variance compared to n-step fuzzy TD and fuzzy SARSA(λ) baselines, while maintaining substantially lower computational complexity than deep RL alternatives like DDPG. Theoretical convergence is proven under standard assumptions.

Conclusion: Enhanced-FQL(λ) provides an effective framework for continuous control that balances performance, interpretability, and computational efficiency, making it particularly suitable for safety-critical applications requiring transparency and resource efficiency.

Abstract: This paper introduces a fuzzy reinforcement learning framework, Enhanced-FQL($λ$), that integrates novel Fuzzified Eligibility Traces (FET) and Segmented Experience Replay (SER) into fuzzy Q-learning with Fuzzified Bellman Equation (FBE) for continuous control tasks. The proposed approach employs an interpretable fuzzy rule base instead of complex neural architectures, while maintaining competitive performance through two key innovations: a fuzzified Bellman equation with eligibility traces for stable multi-step credit assignment, and a memory-efficient segment-based experience replay mechanism for enhanced sample efficiency. Theoretical analysis proves the proposed method convergence under standard assumptions. Extensive evaluations in continuous control domains demonstrate that Enhanced-FQL($λ$) achieves superior sample efficiency and reduced variance compared to n-step fuzzy TD and fuzzy SARSA($λ$) baselines, while maintaining substantially lower computational complexity than deep RL alternatives such as DDPG. The framework’s inherent interpretability, combined with its computational efficiency and theoretical convergence guarantees, makes it particularly suitable for safety-critical applications where transparency and resource constraints are essential.

[442] Rate or Fate? RLV$^\varepsilon$R: Reinforcement Learning with Verifiable Noisy Rewards

Ali Rad, Khashayar Filom, Darioush Keivan, Peyman Mohajerin Esfahani, Ehsan Kamalinejad

Main category: cs.LG

TL;DR: RLVR with noisy verification can lead to learning, neutral dynamics, or anti-learning collapse depending on Youden’s index (J=TPR-FPR). When J>0, learning occurs; J=0 is neutral; J<0 causes collapse.

Details

Motivation: Real-world RLVR faces noisy verification (imperfect tests, human labels, LLM judges) that worsens in hard domains like coding. Need to understand if noise merely slows learning or fundamentally changes outcomes.

Method: Developed an analytically tractable multi-armed bandit view of RLVR dynamics, instantiated with GRPO. Modeled false positives/negatives and grouped completions into reasoning modes, yielding replicator-style flow on probability simplex.

Result: Found sharp phase transition: J>0 drives incorrect mass to extinction (learning); J=0 yields neutral dynamics; J<0 amplifies incorrect modes until domination (anti-learning collapse). Experiments on programming tasks validate J=0 boundary.

Conclusion: Verification noise determines fate, not just rate: J>0 enables learning, J<0 causes collapse. Framework provides general lens for analyzing RLVR stability, convergence, and algorithmic interventions beyond noise.

Abstract: Reinforcement learning with verifiable rewards (RLVR) is a simple but powerful paradigm for training LLMs: sample a completion, verify it, and update. In practice, however, the verifier is almost never clean–unit tests probe only limited corner cases; human and synthetic labels are imperfect; and LLM judges (e.g., RLAIF) are noisy and can be exploited–and this problem worsens on harder domains (especially coding) where tests are sparse and increasingly model-generated. We ask a pragmatic question: does the verification noise merely slow down the learning (rate), or can it flip the outcome (fate)? To address this, we develop an analytically tractable multi-armed bandit view of RLVR dynamics, instantiated with GRPO and validated in controlled experiments. Modeling false positives and false negatives and grouping completions into recurring reasoning modes yields a replicator-style (natural-selection) flow on the probability simplex. The dynamics decouples into within-correct-mode competition and a one-dimensional evolution for the mass on incorrect modes, whose drift is determined solely by Youden’s index J=TPR-FPR. This yields a sharp phase transition: when J>0, the incorrect mass is driven toward extinction (learning); when J=0, the process is neutral; and when J<0, incorrect modes amplify until they dominate (anti-learning and collapse). In the learning regime J>0, noise primarily rescales convergence time (“rate, not fate”). Experiments on verifiable programming tasks under synthetic noise reproduce the predicted J=0 boundary. Beyond noise, the framework offers a general lens for analyzing RLVR stability, convergence, and algorithmic interventions.

[443] Distribution-Guided and Constrained Quantum Machine Unlearning

Nausherwan Malik, Zubair Khalid, Muhammad Faryad

Main category: cs.LG

TL;DR: Distribution-guided quantum machine unlearning framework using tunable target distributions and anchor-based constraints for controlled forgetting of specific classes while preserving retained model behavior.

Details

Motivation: Existing quantum machine unlearning approaches rely on fixed, uniform target distributions and lack explicit control over the trade-off between forgetting and preserving model behavior, limiting reliability and interpretability.

Method: Proposes a distribution-guided framework treating unlearning as constrained optimization: 1) Uses tunable target distribution derived from model similarity statistics to decouple forgotten-class suppression from redistribution assumptions, 2) Incorporates anchor-based preservation constraint to maintain predictive behavior on selected retained data, enabling controlled optimization trajectory.

Result: Evaluation on variational quantum classifiers (Iris and Covertype datasets) shows: sharp suppression of forgotten-class confidence, minimal degradation of retained-class performance, and closer alignment with gold retrained model baselines compared to uniform-target unlearning.

Conclusion: Target design and constraint-based formulations are crucial for reliable and interpretable quantum machine unlearning, with the proposed framework demonstrating effective controlled forgetting while preserving model behavior.

Abstract: Machine unlearning aims to remove the influence of specific training data from a learned model without full retraining. While recent work has begun to explore unlearning in quantum machine learning, existing approaches largely rely on fixed, uniform target distributions and do not explicitly control the trade-off between forgetting and retained model behaviour. In this work, we propose a distribution-guided framework for class-level quantum machine unlearning that treats unlearning as a constrained optimization problem. Our method introduces a tunable target distribution derived from model similarity statistics, decoupling the suppression of forgotten-class confidence from assumptions about redistribution among retained classes. We further incorporate an anchor-based preservation constraint that explicitly maintains predictive behaviour on selected retained data, yielding a controlled optimization trajectory that limits deviation from the original model. We evaluate the approach on variational quantum classifiers trained on the Iris and Covertype datasets. Results demonstrate sharp suppression of forgotten-class confidence, minimal degradation of retained-class performance, and closer alignment with the gold retrained model baselines compared to uniform-target unlearning. These findings highlight the importance of target design and constraint-based formulations for reliable and interpretable quantum machine unlearning.

[444] Improving and Accelerating Offline RL in Large Discrete Action Spaces with Structured Policy Initialization

Matthew Landers, Taylor W. Killian, Thomas Hartvigsen, Afsaneh Doryab

Main category: cs.LG

TL;DR: SPIN introduces a two-stage framework that pre-trains an Action Structure Model to learn valid action manifolds, then freezes it to train lightweight policy heads, improving performance and convergence speed in discrete combinatorial action spaces.

Details

Motivation: Reinforcement learning in discrete combinatorial action spaces faces challenges: existing methods either assume independence across sub-actions (leading to incoherent/invalid actions) or jointly learn action structure and control (slow and unstable).

Method: Two-stage Structured Policy Initialization (SPIN): 1) Pre-train an Action Structure Model (ASM) to capture the manifold of valid actions, 2) Freeze this representation and train lightweight policy heads for control.

Result: On challenging discrete DM Control benchmarks, SPIN improves average return by up to 39% over state-of-the-art methods while reducing time to convergence by up to 12.8×.

Conclusion: SPIN effectively addresses the combinatorial action space problem by separating action structure learning from policy learning, achieving better performance and faster convergence than existing approaches.

Abstract: Reinforcement learning in discrete combinatorial action spaces requires searching over exponentially many joint actions to simultaneously select multiple sub-actions that form coherent combinations. Existing approaches either simplify policy learning by assuming independence across sub-actions, which often yields incoherent or invalid actions, or attempt to learn action structure and control jointly, which is slow and unstable. We introduce Structured Policy Initialization (SPIN), a two-stage framework that first pre-trains an Action Structure Model (ASM) to capture the manifold of valid actions, then freezes this representation and trains lightweight policy heads for control. On challenging discrete DM Control benchmarks, SPIN improves average return by up to 39% over the state of the art while reducing time to convergence by up to 12.8$\times$.

[445] When Predictions Shape Reality: A Socio-Technical Synthesis of Performative Predictions in Machine Learning

Gal Fybish, Teo Susnjak

Main category: cs.LG

TL;DR: A Systematisation of Knowledge (SoK) paper that provides a comprehensive review of performative prediction literature, introducing a practical assessment framework for practitioners to evaluate and address performativity risks in ML models.

Details

Motivation: Machine learning models in high-stakes domains create performative predictions where model deployment influences the outcomes they predict, leading to feedback loops, performance issues, and societal risks. There's a lack of socio-technical synthesis and practical guidance for practitioners dealing with these dynamics.

Method: Conducted a comprehensive literature review and systematisation of knowledge on performative predictions. Developed a typology of risks, surveyed proposed solutions, and created the “Performative Strength vs. Impact Matrix” assessment framework.

Result: Provides an overview of performativity mechanisms, a typology of associated risks, and a survey of literature solutions. The primary contribution is the practical assessment framework that helps practitioners evaluate performativity influence and select appropriate interventions.

Conclusion: This SoK addresses the gap in performative prediction literature by systematizing concepts and providing practical tools for practitioners to assess and mitigate risks associated with ML models that actively shape their deployment environments.

Abstract: Machine learning models are increasingly used in high-stakes domains where their predictions can actively shape the environments in which they operate, a phenomenon known as performative prediction. This dynamic, in which the deployment of the model influences the very outcome it seeks to predict, can lead to unintended consequences, including feedback loops, performance issues, and significant societal risks. While the literature in the field has grown rapidly in recent years, a socio-technical synthesis that systemises the phenomenon concepts and provides practical guidance has been lacking. This Systematisation of Knowledge (SoK) addresses this gap by providing a comprehensive review of the literature on performative predictions. We provide an overview of the primary mechanisms through which performativity manifests, present a typology of associated risks, and survey the proposed solutions offered in the literature. Our primary contribution is the ``Performative Strength vs. Impact Matrix" assessment framework. This practical tool is designed to help practitioners assess the potential influence and severity of performativity on their deployed predictive models and select the appropriate level of algorithmic or human intervention.

[446] Explainable Admission-Level Predictive Modeling for Prolonged Hospital Stay in Elderly Populations: Challenges in Low- and Middle-Income Countries

Daniel Sierra-Botero, Ana Molina-Taborda, Leonardo Espinosa-Leal, Alexander Karpenko, Alejandro Hernandez, Olga Lopez-Acevedo

Main category: cs.LG

TL;DR: Developed a predictive model for prolonged hospital length of stay (>7 days) using feature selection with graph theory and logistic regression, achieving AUC-ROC of 0.82 on 120k admissions.

Details

Motivation: Prolonged length of stay (pLoS) is associated with adverse in-hospital events and represents a significant challenge for hospital management and patient outcomes.

Method: Used feature selection method based on information value and graph theory (clique selection) to identify non-correlated features. Trained logistic regression model on 120,354 hospital admissions (2017-2022) split into training (67%), test (22%), and validation (11%) cohorts to predict pLoS (>7 days vs ≤7 days).

Result: Model achieved specificity 0.83, sensitivity 0.64, accuracy 0.76, precision 0.67, and AUC-ROC 0.82 on validation cohort. Feature selection identified 9 interpretable variables, enhancing model transparency.

Conclusion: The model demonstrates strong predictive performance and provides insights into factors influencing prolonged hospital stays, making it valuable for hospital management and future intervention studies to reduce pLoS.

Abstract: Prolonged length of stay (pLoS) is a significant factor associated with the risk of adverse in-hospital events. We develop and explain a predictive model for pLos using admission-level patient and hospital administrative data. The approach includes a feature selection method by selecting non-correlated features with the highest information value. The method uses features weights of evidence to select a representative within cliques from graph theory. The prognosis study analyzed the records from 120,354 hospital admissions at the Hospital Alma Mater de Antioquia between January 2017 and March 2022. After a cleaning process the dataset was split into training (67%), test (22%), and validation (11%) cohorts. A logistic regression model was trained to predict the pLoS in two classes: less than or greater than 7 days. The performance of the model was evaluated using accuracy, precision, sensitivity, specificity, and AUC-ROC metrics. The feature selection method returns nine interpretable variables, enhancing the models’ transparency. In the validation cohort, the pLoS model achieved a specificity of 0.83 (95% CI, 0.82-0.84), sensitivity of 0.64 (95% CI, 0.62-0.65), accuracy of 0.76 (95% CI, 0.76-0.77), precision of 0.67 (95% CI, 0.66-0.69), and AUC-ROC of 0.82 (95% CI, 0.81-0.83). The model exhibits strong predictive performance and offers insights into the factors that influence prolonged hospital stays. This makes it a valuable tool for hospital management and for developing future intervention studies aimed at reducing pLoS.

[447] Using Large Language Models to Detect Socially Shared Regulation of Collaborative Learning

Jiayi Zhang, Conrad Borchers, Clayton Cohn, Namrata Srivastava, Caitlin Snyder, Siyuan Guo, Ashwin T S, Naveeduddin Mohammed, Haley Noh, Gautam Biswas

Main category: cs.LG

TL;DR: The paper develops embedding-based models using LLM summaries and multimodal features to automatically detect socially shared regulation of learning (SSRL) behaviors in collaborative computational modeling environments.

Details

Motivation: Learning analytics has advanced in detecting complex learning processes but focuses mainly on individualized problem-solving rather than collaborative, open-ended problem-solving. Collaborative environments offer richer data but present challenges like low cohesion for behavioral prediction.

Method: Used LLMs as summarization tools to generate task-aware representations of student dialogue aligned with system logs. Combined these summaries with text-only embeddings, context-enriched embeddings, and log-derived features to train predictive models for SSRL behavior detection.

Result: Text-only embeddings performed better for detecting SSRL behaviors related to enactment or group dynamics (off-task behavior, requesting assistance). Contextual and multimodal features provided complementary benefits for constructs like planning and reflection.

Conclusion: Embedding-based models show promise for extending learning analytics by enabling scalable detection of SSRL behaviors, supporting real-time feedback and adaptive scaffolding in collaborative learning environments valued by teachers.

Abstract: The field of learning analytics has made notable strides in automating the detection of complex learning processes in multimodal data. However, most advancements have focused on individualized problem-solving instead of collaborative, open-ended problem-solving, which may offer both affordances (richer data) and challenges (low cohesion) to behavioral prediction. Here, we extend predictive models to automatically detect socially shared regulation of learning (SSRL) behaviors in collaborative computational modeling environments using embedding-based approaches. We leverage large language models (LLMs) as summarization tools to generate task-aware representations of student dialogue aligned with system logs. These summaries, combined with text-only embeddings, context-enriched embeddings, and log-derived features, were used to train predictive models. Results show that text-only embeddings often achieve stronger performance in detecting SSRL behaviors related to enactment or group dynamics (e.g., off-task behavior or requesting assistance). In contrast, contextual and multimodal features provide complementary benefits for constructs such as planning and reflection. Overall, our findings highlight the promise of embedding-based models for extending learning analytics by enabling scalable detection of SSRL behaviors, ultimately supporting real-time feedback and adaptive scaffolding in collaborative learning environments that teachers value.

[448] Meta-probabilistic Modeling

Kevin Zhang, Yixin Wang

Main category: cs.LG

TL;DR: Meta-learning algorithm learns generative model structure from multiple related datasets using hierarchical architecture and bi-level optimization.

Details

Motivation: Choosing well-specified probabilistic graphical models is challenging and requires iterative trial-and-error. Need to learn model structure directly from data rather than manually specifying.

Method: Meta-Probabilistic Modeling (MPM) uses hierarchical architecture with global model specifications shared across datasets and local parameters dataset-specific. Uses VAE-inspired surrogate objective with bi-level optimization: local variables updated analytically via coordinate ascent, global parameters trained with gradient methods.

Result: MPM successfully adapts generative models to data while recovering meaningful latent representations in object-centric image modeling and sequential text modeling tasks.

Conclusion: MPM provides an effective meta-learning approach for learning generative model structure from multiple related datasets, overcoming the manual specification challenges of traditional probabilistic graphical models.

Abstract: While probabilistic graphical models can discover latent structure in data, their effectiveness hinges on choosing well-specified models. Identifying such models is challenging in practice, often requiring iterative checking and revision through trial and error. To this end, we propose meta-probabilistic modeling (MPM), a meta-learning algorithm that learns generative model structure directly from multiple related datasets. MPM uses a hierarchical architecture where global model specifications are shared across datasets while local parameters remain dataset-specific. For learning and inference, we propose a tractable VAE-inspired surrogate objective, and optimize it through bi-level optimization: local variables are updated analytically via coordinate ascent, while global parameters are trained with gradient-based methods. We evaluate MPM on object-centric image modeling and sequential text modeling, demonstrating that it adapts generative models to data while recovering meaningful latent representations.

[449] When Models Manipulate Manifolds: The Geometry of a Counting Task

Wes Gurnee, Emmanuel Ameisen, Isaac Kauvar, Julius Tarng, Adam Pearce, Chris Olah, Joshua Batson

Main category: cs.LG

TL;DR: Claude 3.5 Haiku’s early layers perform visual processing to detect linebreaks in fixed-width text using geometric representations of character counts, similar to biological place cells.

Details

Motivation: To understand how language models can perceive visual properties of text (like linebreaking) despite only receiving token sequences, and to investigate the mechanistic basis of this sensory processing capability.

Method: Mechanistic investigation of Claude 3.5 Haiku’s linebreaking task, analyzing how character counts are represented on low-dimensional curved manifolds using sparse feature families, and examining the geometric transformations involved in linebreak decisions.

Result: Found that character counts are represented on discretized curved manifolds analogous to biological place cells. Linebreak decisions emerge through a sequence of geometric transformations: token length accumulation, attention head manipulation of manifolds to estimate distance to boundary, and orthogonal arrangement of estimates creating linear decision boundaries. Validated through causal interventions and discovered visual illusions that hijack the counting mechanism.

Conclusion: Demonstrates rich sensory processing in early layers of language models, intricate attention algorithms, and the importance of combining feature-based and geometric perspectives for interpretability. Shows how models develop internal representations for visual text properties despite token-only input.

Abstract: Language models can perceive visual properties of text despite receiving only sequences of tokens-we mechanistically investigate how Claude 3.5 Haiku accomplishes one such task: linebreaking in fixed-width text. We find that character counts are represented on low-dimensional curved manifolds discretized by sparse feature families, analogous to biological place cells. Accurate predictions emerge from a sequence of geometric transformations: token lengths are accumulated into character count manifolds, attention heads twist these manifolds to estimate distance to the line boundary, and the decision to break the line is enabled by arranging estimates orthogonally to create a linear decision boundary. We validate our findings through causal interventions and discover visual illusions–character sequences that hijack the counting mechanism. Our work demonstrates the rich sensory processing of early layers, the intricacy of attention algorithms, and the importance of combining feature-based and geometric views of interpretability.

[450] Hybrid Federated Learning for Noise-Robust Training

Yongjun Kim, Hyeongjun Park, Hwanjin Kim, Junil Choi

Main category: cs.LG

TL;DR: Hybrid Federated Learning (HFL) combines FL and FD to balance noise robustness and learning speed, using adaptive UE clustering and weight selection to achieve better accuracy at low SNR.

Details

Motivation: Federated learning (FL) and federated distillation (FD) each have trade-offs: FL is noise-robust but slow, FD is faster but less robust. The paper aims to combine their strengths while mitigating weaknesses.

Method: Proposes HFL framework where UEs transmit either gradients or logits, BS selects per-round weights for FL/FD updates. Uses two DoF exploitation methods: (1) adaptive UE clustering via Jenks optimization, (2) adaptive weight selection via damped Newton method.

Result: Numerical results show HFL achieves superior test accuracy at low SNR when both DoF (adaptive clustering and weight selection) are exploited.

Conclusion: HFL effectively combines FL and FD advantages, with adaptive techniques enabling better performance in noisy environments.

Abstract: Federated learning (FL) and federated distillation (FD) are distributed learning paradigms that train UE models with enhanced privacy, each offering different trade-offs between noise robustness and learning speed. To mitigate their respective weaknesses, we propose a hybrid federated learning (HFL) framework in which each user equipment (UE) transmits either gradients or logits, and the base station (BS) selects the per-round weights of FL and FD updates. We derive convergence of HFL framework and introduce two methods to exploit degrees of freedom (DoF) in HFL, which are (i) adaptive UE clustering via Jenks optimization and (ii) adaptive weight selection via a damped Newton method. Numerical results show that HFL achieves superior test accuracy at low SNR when both DoF are exploited.

[451] IGenBench: Benchmarking the Reliability of Text-to-Infographic Generation

Yinghao Tang, Xueding Liu, Boyuan Zhang, Tingfeng Lan, Yupeng Xie, Jiale Lao, Yiyao Wang, Haoxuan Li, Tingting Gao, Bo Pan, Luoxuan Weng, Xiuqi Huang, Minfeng Zhu, Yingchaojie Feng, Yuyu Luo, Wei Chen

Main category: cs.LG

TL;DR: IGENBENCH is the first benchmark for evaluating text-to-infographic generation reliability, revealing significant issues in current T2I models despite their aesthetic appeal.

Details

Motivation: While T2I models can generate visually appealing images, their reliability for infographic generation remains unclear, as generated infographics may contain subtle but critical errors in data encoding and textual content that are easily overlooked.

Method: Created IGENBENCH with 600 curated test cases across 30 infographic types, designed an automated evaluation framework that decomposes reliability verification into atomic yes/no questions based on 10 question types, and used MLLMs to verify each question.

Result: Evaluation of 10 SOTA T2I models shows: (i) three-tier performance hierarchy with top model achieving Q-ACC of 0.90 but I-ACC of only 0.49; (ii) data-related dimensions are universal bottlenecks (Data Completeness: 0.21); (iii) end-to-end correctness remains challenging across all models.

Conclusion: IGENBENCH provides the first systematic benchmark for evaluating infographic generation reliability, revealing significant gaps in current T2I models’ ability to produce accurate infographics, particularly in data-related aspects, highlighting the need for improved model development.

Abstract: Infographics are composite visual artifacts that combine data visualizations with textual and illustrative elements to communicate information. While recent text-to-image (T2I) models can generate aesthetically appealing images, their reliability in generating infographics remains unclear. Generated infographics may appear correct at first glance but contain easily overlooked issues, such as distorted data encoding or incorrect textual content. We present IGENBENCH, the first benchmark for evaluating the reliability of text-to-infographic generation, comprising 600 curated test cases spanning 30 infographic types. We design an automated evaluation framework that decomposes reliability verification into atomic yes/no questions based on a taxonomy of 10 question types. We employ multimodal large language models (MLLMs) to verify each question, yielding question-level accuracy (Q-ACC) and infographic-level accuracy (I-ACC). We comprehensively evaluate 10 state-of-the-art T2I models on IGENBENCH. Our systematic analysis reveals key insights for future model development: (i) a three-tier performance hierarchy with the top model achieving Q-ACC of 0.90 but I-ACC of only 0.49; (ii) data-related dimensions emerging as universal bottlenecks (e.g., Data Completeness: 0.21); and (iii) the challenge of achieving end-to-end correctness across all models. We release IGENBENCH at https://igen-bench.vercel.app/.

Fang Wu, Zhengyuan Zhou, Shuting Jin, Xiangxiang Zeng, Jure Leskovec, Jinbo Xu

Main category: cs.LG

TL;DR: SurfFlow is a novel surface-based generative algorithm for comprehensive peptide co-design that outperforms full-atom baselines by incorporating molecular surface information.

Details

Motivation: Therapeutic peptides can target undruggable binding sites, but current deep generative models for peptide co-design have underexplored the critical role of molecular surfaces in protein-protein interactions.

Method: SurfFlow uses a multi-modality conditional flow matching (CFM) architecture to learn distributions of surface geometries and biochemical properties, enabling comprehensive co-design of peptide sequence, structure, and surface.

Result: On the comprehensive PepMerge benchmark, SurfFlow consistently outperforms full-atom baselines across all metrics, demonstrating superior peptide binding accuracy.

Conclusion: Molecular surfaces play a crucial role in de novo peptide discovery, and integrating multiple protein modalities through surface-based approaches like SurfFlow can significantly enhance therapeutic peptide discovery effectiveness.

Abstract: Therapeutic peptides show promise in targeting previously undruggable binding sites, with recent advancements in deep generative models enabling full-atom peptide co-design for specific protein receptors. However, the critical role of molecular surfaces in protein-protein interactions (PPIs) has been underexplored. To bridge this gap, we propose an omni-design peptides generation paradigm, called SurfFlow, a novel surface-based generative algorithm that enables comprehensive co-design of sequence, structure, and surface for peptides. SurfFlow employs a multi-modality conditional flow matching (CFM) architecture to learn distributions of surface geometries and biochemical properties, enhancing peptide binding accuracy. Evaluated on the comprehensive PepMerge benchmark, SurfFlow consistently outperforms full-atom baselines across all metrics. These results highlight the advantages of considering molecular surfaces in de novo peptide discovery and demonstrate the potential of integrating multiple protein modalities for more effective therapeutic peptide discovery.

[453] TSSR: Two-Stage Swap-Reward-Driven Reinforcement Learning for Character-Level SMILES Generation

Jacob Ede Levine, Yun Lyan Luo, Sai Chandra Kosaraju

Main category: cs.LG

TL;DR: TSSR is a two-stage RL framework for SMILES generation that improves molecular validity and novelty through token-swap rewards and chemistry-aware feedback.

Details

Motivation: Current chemical language models generating SMILES strings suffer from compounding token errors, producing unparseable or chemically implausible molecules. Hard constraints to prevent failure restrict exploration, creating a need for better generation methods.

Method: Two-stage RL framework: Stage 1 rewards local token swaps that repair syntax (invalid to parseable strings). Stage 2 provides chemistry-aware feedback from RDKit diagnostics, rewarding reductions in valence, aromaticity, and connectivity issues. Uses interpretable reward terms (swap efficiency, error reduction, distance to validity), model-agnostic, no task-specific labels or hand-crafted grammars.

Result: In pure RL (P-RL): significantly improves syntactic validity, chemical validity, and novelty. In fine-tuning RL (F-RL): preserves drug-likeness and synthesizability while increasing validity and novelty. Token-level analysis shows syntax edits and chemistry fixes jointly reduce RDKit detected errors.

Conclusion: TSSR converts sparse terminal objectives into denser, interpretable rewards, improving both syntactic and chemical quality without reducing diversity. It’s dataset-agnostic and adaptable to various RL approaches, addressing key limitations of current SMILES generation methods.

Abstract: The design of reliable, valid, and diverse molecules is fundamental to modern drug discovery, as improved molecular generation supports efficient exploration of the chemical space for potential drug candidates and reduces the cost of early design efforts. Despite these needs, current chemical language models that generate molecules as SMILES strings are vulnerable to compounding token errors: many samples are unparseable or chemically implausible, and hard constraints meant to prevent failure can restrict exploration. To address this gap, we introduce TSSR, a Two-Stage, Swap-Reward-driven reinforcement learning (RL) framework for character-level SMILES generation. Stage one rewards local token swaps that repair syntax, promoting transitions from invalid to parseable strings. Stage two provides chemistry-aware feedback from RDKit diagnostics, rewarding reductions in valence, aromaticity, and connectivity issues. The reward decomposes into interpretable terms (swap efficiency, error reduction, distance to validity), is model agnostic, and requires no task-specific labels or hand-crafted grammars. We evaluated TSSR on the MOSES benchmark using a GRU policy trained with PPO in both pure RL (P-RL) from random initialization and fine-tuning RL (F-RL) starting from a pretrained chemical language model, assessing 10,000 generated SMILES per run. In P-RL, TSSR significantly improves syntactic validity, chemical validity, and novelty. In F-RL, TSSR preserves drug-likeness and synthesizability while increasing validity and novelty. Token-level analysis shows that syntax edits and chemistry fixes act jointly to reduce RDKit detected errors. TSSR converts a sparse terminal objective into a denser and more interpretable reward, improving both syntactic and chemical quality without reducing diversity. TSSR is dataset-agnostic and can be adapted to various reinforcement learning approaches.

[454] Not All Steps are Informative: On the Linearity of LLMs’ RLVR Training

Tianle Wang, Zhongyuan Wu, Shenghao Jin, Hao Xu, Wei Chen, Ning Miao

Main category: cs.LG

TL;DR: RL training for LLMs shows strong linear evolution, enabling weight/logits extrapolation to skip expensive training while maintaining or improving performance.

Details

Motivation: RLVR training for LLMs requires thousands of steps with substantial computation due to prolonged exploration. The authors discovered that LLMs evolve in a strongly linear manner during RLVR, suggesting that RLVR mainly amplifies early trends rather than discovering new behaviors throughout training.

Method: The paper investigates whether future model states can be predicted from intermediate checkpoints via extrapolation. Two approaches are proposed: Weight Extrapolation (predicting future model weights) and Logits Extrapolation (predicting future output log-probabilities).

Result: Weight Extrapolation produces models with performance comparable to standard RL training while requiring significantly less computation. Logits Extrapolation consistently outperforms continued RL training on all four benchmarks by extrapolating beyond the step range where RL training remains stable.

Conclusion: The linear evolution of LLMs during RLVR enables efficient extrapolation techniques that can dramatically reduce computational costs while maintaining or even improving performance compared to standard RL training.

Abstract: Reinforcement learning with verifiable rewards (RLVR) has become a central component of large language model (LLM) post-training. Unlike supervised fine-tuning (SFT), RLVR lets an LLM generate multiple candidate solutions and reinforces those that lead to a verifiably correct final answer. However, in practice, RLVR often requires thousands of training steps to reach strong performance, incurring substantial computation largely attributed to prolonged exploration. In this work, we make a surprising observation: during RLVR, LLMs evolve in a strongly linear manner. Specifically, both model weights and model output log-probabilities exhibit strong linear correlations with RL training steps. This suggests that RLVR predominantly amplifies trends that emerge early in training, rather than continuously discovering new behaviors throughout the entire optimization trajectory. Motivated by this linearity, we investigate whether future model states can be predicted from intermediate checkpoints via extrapolation, avoiding continued expensive training. We show that Weight Extrapolation produces models with performance comparable to standard RL training while requiring significantly less computation. Moreover, Logits Extrapolation consistently outperforms continued RL training on all four benchmarks by extrapolating beyond the step range where RL training remains stable.

[455] Timeliness-Oriented Scheduling and Resource Allocation in Multi-Region Collaborative Perception

Mengmeng Zhu, Yuxuan Sun, Yukuan Jia, Wei Chen, Bo Ai, Sheng Zhou

Main category: cs.LG

TL;DR: TAMP scheduling algorithm optimizes collaborative perception by balancing timeliness (AoI) and communication volume to maximize perception accuracy while minimizing resource usage.

Details

Motivation: Collaborative perception faces challenges with information timeliness (dynamic environments) and communication constraints (limited bandwidth/computation), requiring intelligent scheduling to balance perception accuracy and resource usage.

Method: Proposes TAMP scheduling algorithm using Lyapunov-based optimization that decomposes long-term average objective into per-slot prioritization, balancing scheduling worth against resource cost with empirical penalty function mapping AoI and communication volume to perception performance.

Result: Extensive simulations on RCooper dataset show TAMP outperforms baselines with up to 27% AP improvement across various configurations in intersection and corridor scenarios.

Conclusion: TAMP effectively addresses dynamic scheduling in multi-region CP by optimizing timeliness-aware prioritization, achieving significant perception accuracy improvements while managing communication resources efficiently.

Abstract: Collaborative perception (CP) is a critical technology in applications like autonomous driving and smart cities. It involves the sharing and fusion of information among sensors to overcome the limitations of individual perception, such as blind spots and range limitations. However, CP faces two primary challenges. First, due to the dynamic nature of the environment, the timeliness of the transmitted information is critical to perception performance. Second, with limited computational power at the sensors and constrained wireless bandwidth, the communication volume must be carefully designed to ensure feature representations are both effective and sufficient. This work studies the dynamic scheduling problem in a multi-region CP scenario, and presents a Timeliness-Aware Multi-region Prioritized (TAMP) scheduling algorithm to trade-off perception accuracy and communication resource usage. Timeliness reflects the utility of information that decays as time elapses, which is manifested by the perception performance in CP tasks. We propose an empirical penalty function that maps the joint impact of Age of Information (AoI) and communication volume to perception performance. Aiming to minimize this timeliness-oriented penalty in the long-term, and recognizing that scheduling decisions have a cumulative effect on subsequent system states, we propose the TAMP scheduling algorithm. TAMP is a Lyapunov-based optimization policy that decomposes the long-term average objective into a per-slot prioritization problem, balancing the scheduling worth against resource cost. We validate our algorithm in both intersection and corridor scenarios with the real-world Roadside Cooperative perception (RCooper) dataset. Extensive simulations demonstrate that TAMP outperforms the best-performing baseline, achieving an Average Precision (AP) improvement of up to 27% across various configurations.

[456] GEnSHIN: Graphical Enhanced Spatio-temporal Hierarchical Inference Network for Traffic Flow Prediction

Zhiyan Zhou, Junjie Liao, Manho Zhang, Yingyi Liao, Ziai Wang

Main category: cs.LG

TL;DR: GEnSHIN is a novel graph-enhanced spatio-temporal hierarchical inference network for traffic flow prediction that integrates attention-enhanced GCRU, asymmetric dual-embedding graph generation, and dynamic memory bank modules to handle complex spatio-temporal dependencies.

Details

Motivation: With accelerating urbanization, intelligent transportation systems require accurate traffic flow prediction to manage complex spatio-temporal dependencies in urban traffic networks.

Method: Three innovative designs: 1) Attention-enhanced Graph Convolutional Recurrent Unit (GCRU) with Transformer modules for long-term temporal dependencies; 2) Asymmetric dual-embedding graph generation using real road network and data-driven latent asymmetric topology; 3) Dynamic memory bank with learnable traffic pattern prototypes and lightweight graph updater for personalized representations and dynamic adaptation.

Result: Extensive experiments on METR-LA dataset show GEnSHIN achieves or surpasses comparative models across MAE, RMSE, and MAPE metrics, with excellent prediction stability during peak traffic hours. Ablation experiments validate each core module’s effectiveness.

Conclusion: GEnSHIN effectively handles complex spatio-temporal dependencies in traffic flow prediction through its integrated architecture, demonstrating superior performance and stability, particularly during challenging peak traffic periods.

Abstract: With the acceleration of urbanization, intelligent transportation systems have an increasing demand for accurate traffic flow prediction. This paper proposes a novel Graph Enhanced Spatio-temporal Hierarchical Inference Network (GEnSHIN) to handle the complex spatio-temporal dependencies in traffic flow prediction. The model integrates three innovative designs: 1) An attention-enhanced Graph Convolutional Recurrent Unit (GCRU), which strengthens the modeling capability for long-term temporal dependencies by introducing Transformer modules; 2) An asymmetric dual-embedding graph generation mechanism, which leverages the real road network and data-driven latent asymmetric topology to generate graph structures that better fit the characteristics of actual traffic flow; 3) A dynamic memory bank module, which utilizes learnable traffic pattern prototypes to provide personalized traffic pattern representations for each sensor node, and introduces a lightweight graph updater during the decoding phase to adapt to dynamic changes in road network states. Extensive experiments on the public dataset METR-LA show that GEnSHIN achieves or surpasses the performance of comparative models across multiple metrics such as Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and Mean Absolute Percentage Error (MAPE). Notably, the model demonstrates excellent prediction stability during peak morning and evening traffic hours. Ablation experiments further validate the effectiveness of each core module and its contribution to the final performance.

[457] Improving Semi-Supervised Contrastive Learning via Entropy-Weighted Confidence Integration of Anchor-Positive Pairs

Shogo Nakayama, Masahiro Okuda

Main category: cs.LG

TL;DR: Novel semi-supervised contrastive learning method using entropy-based confidence estimation and adaptive weighting for pseudo-label assignment, improving accuracy and stability under low-label conditions.

Details

Motivation: Conventional semi-supervised contrastive learning methods are limited by threshold-based pseudo-label assignment, which excludes many potentially useful samples from training. This restricts learning effectiveness, especially in low-label scenarios where labeled data is scarce.

Method: Proposes a novel loss function that estimates sample confidence based on the entropy of predicted probability distributions, applies confidence-based adaptive weighting, enables pseudo-label assignment to previously excluded samples, and performs contrastive learning considering both anchor and positive sample confidence in a principled manner.

Result: Experimental results show improved classification accuracy and more stable learning performance, particularly under low-label conditions where labeled data is limited.

Conclusion: The proposed entropy-based confidence estimation and adaptive weighting approach provides a more effective and principled method for semi-supervised contrastive learning, overcoming limitations of threshold-based pseudo-label assignment and enhancing performance in data-scarce scenarios.

Abstract: Conventional semi-supervised contrastive learning methods assign pseudo-labels only to samples whose highest predicted class probability exceeds a predefined threshold, and then perform supervised contrastive learning using those selected samples. In this study, we propose a novel loss function that estimates the confidence of each sample based on the entropy of its predicted probability distribution and applies confidence-based adaptive weighting. This approach enables pseudo-label assignment even to samples that were previously excluded from training and facilitates contrastive learning that accounts for the confidence of both anchor and positive samples in a more principled manner. Experimental results demonstrate that the proposed method improves classification accuracy and achieves more stable learning performance even under low-label conditions.

[458] A Vision for Multisensory Intelligence: Sensing, Synergy, and Science

Paul Pu Liang

Main category: cs.LG

TL;DR: This paper presents a 10-year research vision for multisensory AI that goes beyond current digital modalities to incorporate all human senses and physical/environmental signals, aiming to transform human-AI interaction.

Details

Motivation: Current AI has primarily advanced in digital modalities (text, vision, audio), but human experience is fundamentally multisensory. There's a need to connect AI to the full spectrum of human senses and physical/environmental signals to create more natural and comprehensive human-AI interaction.

Method: Proposes advancing through three interrelated themes: 1) Sensing - extending AI’s ability to capture richer world data beyond digital media, 2) Science - developing principled frameworks for quantifying multimodal heterogeneity, unified architectures, and cross-modal transfer, and 3) Synergy - addressing technical challenges in multisensory integration, alignment, reasoning, generation, generalization, and experience.

Result: The paper outlines a comprehensive research roadmap for multisensory AI over the next decade, accompanied by projects, resources, and demos from the MIT Media Lab’s Multisensory Intelligence group.

Conclusion: Multisensory AI represents a transformative direction that can fundamentally change how humans and AI experience and interact with each other by connecting AI to the full spectrum of human senses and environmental signals, requiring coordinated advances in sensing, scientific foundations, and synergistic learning.

Abstract: Our experience of the world is multisensory, spanning a synthesis of language, sight, sound, touch, taste, and smell. Yet, artificial intelligence has primarily advanced in digital modalities like text, vision, and audio. This paper outlines a research vision for multisensory artificial intelligence over the next decade. This new set of technologies can change how humans and AI experience and interact with one another, by connecting AI to the human senses and a rich spectrum of signals from physiological and tactile cues on the body, to physical and social signals in homes, cities, and the environment. We outline how this field must advance through three interrelated themes of sensing, science, and synergy. Firstly, research in sensing should extend how AI captures the world in richer ways beyond the digital medium. Secondly, developing a principled science for quantifying multimodal heterogeneity and interactions, developing unified modeling architectures and representations, and understanding cross-modal transfer. Finally, we present new technical challenges to learn synergy between modalities and between humans and AI, covering multisensory integration, alignment, reasoning, generation, generalization, and experience. Accompanying this vision paper are a series of projects, resources, and demos of latest advances from the Multisensory Intelligence group at the MIT Media Lab, see https://mit-mi.github.io/.

[459] Spatial-Temporal Feedback Diffusion Guidance for Controlled Traffic Imputation

Xiaowei Mao, Huihu Ding, Yan Lin, Tingrui Wu, Shengnan Guo, Dazhuo Qiu, Feiling Fang, Jilin Hu, Huaiyu Wan

Main category: cs.LG

TL;DR: FENCE: A spatial-temporal feedback diffusion guidance method that adaptively adjusts guidance scales for missing data imputation in traffic systems, improving accuracy by preventing generative drift in sparse observation scenarios.

Details

Motivation: Existing diffusion models for traffic data imputation use uniform guidance scales across spatial-temporal dimensions, which fails for nodes with high missing data rates. Sparse observations provide insufficient conditional guidance, causing the generative process to drift toward the learned prior distribution rather than following observations, leading to suboptimal imputation.

Method: FENCE introduces: 1) Dynamic feedback mechanism that adjusts guidance scale based on posterior likelihood approximations - increases when generated values diverge from observations, reduces when alignment improves to prevent overcorrection; 2) Cluster-level guidance scales by grouping nodes based on attention scores, leveraging spatial-temporal correlations for more accurate guidance.

Result: Experimental results on real-world traffic datasets show that FENCE significantly enhances imputation accuracy compared to existing methods.

Conclusion: FENCE effectively addresses the limitations of uniform guidance scales in diffusion models for spatial-temporal traffic data imputation by introducing adaptive feedback mechanisms and cluster-based guidance, leading to improved performance especially for nodes with high missing data rates.

Abstract: Imputing missing values in spatial-temporal traffic data is essential for intelligent transportation systems. Among advanced imputation methods, score-based diffusion models have demonstrated competitive performance. These models generate data by reversing a noising process, using observed values as conditional guidance. However, existing diffusion models typically apply a uniform guidance scale across both spatial and temporal dimensions, which is inadequate for nodes with high missing data rates. Sparse observations provide insufficient conditional guidance, causing the generative process to drift toward the learned prior distribution rather than closely following the conditional observations, resulting in suboptimal imputation performance. To address this, we propose FENCE, a spatial-temporal feedback diffusion guidance method designed to adaptively control guidance scales during imputation. First, FENCE introduces a dynamic feedback mechanism that adjusts the guidance scale based on the posterior likelihood approximations. The guidance scale is increased when generated values diverge from observations and reduced when alignment improves, preventing overcorrection. Second, because alignment to observations varies across nodes and denoising steps, a global guidance scale for all nodes is suboptimal. FENCE computes guidance scales at the cluster level by grouping nodes based on their attention scores, leveraging spatial-temporal correlations to provide more accurate guidance. Experimental results on real-world traffic datasets show that FENCE significantly enhances imputation accuracy.

[460] FedKDX: Federated Learning with Negative Knowledge Distillation for Enhanced Healthcare AI Systems

Quang-Tu Pham, Hoang-Dieu Vu, Dinh-Dat Pham, Hieu H. Pham

Main category: cs.LG

TL;DR: FedKDX is a federated learning framework using Negative Knowledge Distillation to improve healthcare AI by capturing both target and non-target information, enhancing generalization while maintaining privacy and reducing communication costs.

Details

Motivation: Existing federated learning approaches in healthcare focus only on positive knowledge transfer, limiting model generalization. Healthcare applications face challenges with statistical heterogeneity in distributed data, privacy requirements under regulations like HIPAA/GDPR, and the need for practical implementation in decentralized settings.

Method: FedKDX integrates multiple knowledge transfer techniques: traditional knowledge distillation, contrastive learning, and novel Negative Knowledge Distillation (NKD) that captures both target and non-target information. The framework maintains privacy while reducing communication costs through a unified architecture designed for federated learning.

Result: Experiments on healthcare datasets (SLEEP, UCI-HAR, PAMAP2) show FedKDX achieves up to 2.53% higher accuracy than state-of-the-art methods, faster convergence, and better performance on non-IID data distributions. Theoretical analysis supports NKD’s effectiveness in addressing statistical heterogeneity.

Conclusion: FedKDX offers a balanced solution for privacy-sensitive medical applications, showing promise for regulatory compliance while improving performance in decentralized healthcare settings. The framework addresses both technical challenges and practical implementation requirements.

Abstract: This paper introduces FedKDX, a federated learning framework that addresses limitations in healthcare AI through Negative Knowledge Distillation (NKD). Unlike existing approaches that focus solely on positive knowledge transfer, FedKDX captures both target and non-target information to improve model generalization in healthcare applications. The framework integrates multiple knowledge transfer techniques–including traditional knowledge distillation, contrastive learning, and NKD–within a unified architecture that maintains privacy while reducing communication costs. Through experiments on healthcare datasets (SLEEP, UCI-HAR, and PAMAP2), FedKDX demonstrates improved accuracy (up to 2.53% over state-of-the-art methods), faster convergence, and better performance on non-IID data distributions. Theoretical analysis supports NKD’s contribution to addressing statistical heterogeneity in distributed healthcare data. The approach shows promise for privacy-sensitive medical applications under regulatory frameworks like HIPAA and GDPR, offering a balanced solution between performance and practical implementation requirements in decentralized healthcare settings. The code and model are available at https://github.com/phamdinhdat-ai/Fed_2024.

[461] DeepHalo: A Neural Choice Model with Controllable Context Effects

Shuhan Zhang, Zhi Wang, Rui Gao, Shuang Li

Main category: cs.LG

TL;DR: DeepHalo: A neural framework for modeling context-dependent human choice with explicit control over interaction order and interpretable context effects.

Details

Motivation: Traditional choice models assume context-independent decisions, but behavioral research shows preferences are influenced by choice set composition (context/Halo effects). Existing models either ignore features, use restrictive interaction structures, or entangle all interaction orders, limiting interpretability.

Method: Propose DeepHalo, a neural modeling framework that incorporates features while enabling explicit control over interaction order and principled interpretation of context effects. The model systematically identifies interaction effects by order and serves as a universal approximator of context-dependent choice functions in featureless settings.

Result: Experiments on synthetic and real-world datasets demonstrate strong predictive performance while providing greater transparency into the drivers of choice.

Conclusion: DeepHalo offers an effective framework for modeling context-dependent human decision-making with improved interpretability and control over interaction effects, addressing limitations of existing approaches.

Abstract: Modeling human decision-making is central to applications such as recommendation, preference learning, and human-AI alignment. While many classic models assume context-independent choice behavior, a large body of behavioral research shows that preferences are often influenced by the composition of the choice set itself – a phenomenon known as the context effect or Halo effect. These effects can manifest as pairwise (first-order) or even higher-order interactions among the available alternatives. Recent models that attempt to capture such effects either focus on the featureless setting or, in the feature-based setting, rely on restrictive interaction structures or entangle interactions across all orders, which limits interpretability. In this work, we propose DeepHalo, a neural modeling framework that incorporates features while enabling explicit control over interaction order and principled interpretation of context effects. Our model enables systematic identification of interaction effects by order and serves as a universal approximator of context-dependent choice functions when specialized to a featureless setting. Experiments on synthetic and real-world datasets demonstrate strong predictive performance while providing greater transparency into the drivers of choice.

[462] Learning Dynamics in RL Post-Training for Language Models

Akiyoshi Tomihari

Main category: cs.LG

TL;DR: RL post-training reduces output diversity due to limited feature variability causing systematic confidence increases; proposed classifier-first RL (CF-RL) accelerates optimization by prioritizing classifier updates.

Details

Motivation: To understand why RL post-training reduces output diversity and to formalize the learning dynamics of RL post-training, which remains poorly understood despite its critical role in improving language model alignment and reasoning.

Method: Adopted empirical neural tangent kernel (NTK) framework to analyze RL learning dynamics, decomposing NTK into components to characterize how RL updates propagate. Proposed classifier-first reinforcement learning (CF-RL) - a two-stage strategy that prioritizes classifier updates before standard RL optimization.

Result: Analysis revealed limited feature variability causes RL updates to systematically increase model confidence, explaining reduced output diversity. CF-RL showed increased model confidence and accelerated optimization, with mechanism differing from linear-probing-then-fine-tuning in supervised learning.

Conclusion: The study formalizes RL post-training learning dynamics, provides explanation for reduced output diversity, and demonstrates CF-RL as an effective training strategy that motivates further analysis and improvements in RL post-training.

Abstract: Reinforcement learning (RL) post-training is a critical stage in modern language model development, playing a key role in improving alignment and reasoning ability. However, several phenomena remain poorly understood, including the reduction in output diversity. To gain a broader understanding of RL post-training, we analyze the learning dynamics of RL post-training from a perspective that has been studied in supervised learning but remains underexplored in RL. We adopt an empirical neural tangent kernel (NTK) framework and decompose the NTK into two components to characterize how RL updates propagate across training samples. Our analysis reveals that limited variability in feature representations can cause RL updates to systematically increase model confidence, providing an explanation for the commonly observed reduction in output diversity after RL post-training. Furthermore, we show that effective learning in this regime depends on rapidly shaping the classifier, which directly affects the gradient component of the NTK. Motivated by these insights, we propose classifier-first reinforcement learning (CF-RL), a simple two-stage training strategy that prioritizes classifier updates before standard RL optimization. Experimental results validate our theoretical analysis by demonstrating increased model confidence and accelerated optimization under CF-RL. Additional analysis shows that the mechanism underlying CF-RL differs from that of linear-probing-then-fine-tuning in supervised learning. Overall, our study formalizes the learning dynamics of RL post-training and motivates further analysis and improvement.

[463] Estimating Causal Effects in Gaussian Linear SCMs with Finite Data

Aurghya Maiti, Prateek Jain

Main category: cs.LG

TL;DR: The paper introduces Centralized Gaussian Linear SCMs (CGL-SCMs) to address overparameterization in causal effect estimation from observational data with latent confounders, and presents an EM-based algorithm for parameter learning.

Details

Motivation: Estimating causal effects from observational data is challenging with latent confounders. Gaussian Linear SCMs are analytically tractable but suffer from overparameterization, making parameter estimation infeasible with finite data.

Method: Introduces CGL-SCMs, a simplified subclass where exogenous variables follow standardized distributions. Presents a novel EM-based estimation algorithm to learn CGL-SCM parameters and estimate identifiable causal effects from finite observational samples.

Result: CGL-SCMs are shown to be equally expressive in terms of causal effect identifiability from observational distributions. Experiments on synthetic data and benchmark causal graphs demonstrate that learned models accurately recover causal distributions.

Conclusion: CGL-SCMs provide a practical solution to overparameterization in GL-SCMs while maintaining expressive power for causal effect estimation, with an effective EM-based algorithm for finite-sample learning.

Abstract: Estimating causal effects from observational data remains a fundamental challenge in causal inference, especially in the presence of latent confounders. This paper focuses on estimating causal effects in Gaussian Linear Structural Causal Models (GL-SCMs), which are widely used due to their analytical tractability. However, parameter estimation in GL-SCMs is often infeasible with finite data, primarily due to overparameterization. To address this, we introduce the class of Centralized Gaussian Linear SCMs (CGL-SCMs), a simplified yet expressive subclass where exogenous variables follow standardized distributions. We show that CGL-SCMs are equally expressive in terms of causal effect identifiability from observational distributions and present a novel EM-based estimation algorithm that can learn CGL-SCM parameters and estimate identifiable causal effects from finite observational samples. Our theoretical analysis is validated through experiments on synthetic data and benchmark causal graphs, demonstrating that the learned models accurately recover causal distributions.

[464] Nightmare Dreamer: Dreaming About Unsafe States And Planning Ahead

Oluwatosin Oseni, Shengjie Wang, Jun Zhu, Micah Corah

Main category: cs.LG

TL;DR: Nightmare Dreamer is a model-based Safe RL algorithm that uses learned world models to predict safety violations, achieving near-zero violations while maximizing rewards with 20x efficiency improvements on Safety Gymnasium tasks.

Details

Motivation: RL has shown success in robotics control but adoption is limited due to insufficient safety guarantees. Current approaches lack robust safety mechanisms, creating barriers for real-world deployment where safety is critical.

Method: Model-based Safe RL approach that learns a world model to predict potential safety violations. The algorithm uses this model to plan actions that avoid safety violations while maximizing rewards, operating with only image observations.

Result: Achieves nearly zero safety violations while maintaining high reward performance. Outperforms model-free baselines on Safety Gymnasium tasks with 20x improvement in efficiency using only image observations.

Conclusion: Nightmare Dreamer demonstrates that model-based approaches can provide strong safety guarantees in RL, enabling safer deployment in real-world applications like robotics control while maintaining high performance and efficiency.

Abstract: Reinforcement Learning (RL) has shown remarkable success in real-world applications, particularly in robotics control. However, RL adoption remains limited due to insufficient safety guarantees. We introduce Nightmare Dreamer, a model-based Safe RL algorithm that addresses safety concerns by leveraging a learned world model to predict potential safety violations and plan actions accordingly. Nightmare Dreamer achieves nearly zero safety violations while maximizing rewards. Nightmare Dreamer outperforms model-free baselines on Safety Gymnasium tasks using only image observations, achieving nearly a 20x improvement in efficiency.

[465] Do LLMs Benefit from User and Item Embeddings in Recommendation Tasks?

Mir Rayat Imtiaz Hossain, Leo Feng, Leonid Sigal, Mohamed Osama Ahmed

Main category: cs.LG

TL;DR: LLMs for recommendation enhanced by projecting collaborative filtering embeddings into token space alongside text, improving over text-only approaches.

Details

Motivation: Existing LLM-based recommendation methods rely too heavily on text semantics and incorporate collaborative signals poorly, typically using only single embeddings rather than handling multiple item embeddings from user history effectively.

Method: Project user and item embeddings from collaborative filtering into LLM token space via separate lightweight projector modules, then fine-tune LLM to condition on both projected embeddings and textual tokens for recommendation generation.

Result: Preliminary results show effective leveraging of structured user-item interaction data, improved recommendation performance over text-only LLM baselines.

Conclusion: The approach provides a practical path for bridging traditional recommendation systems with modern LLMs by effectively combining collaborative filtering signals with LLM capabilities.

Abstract: Large Language Models (LLMs) have emerged as promising recommendation systems, offering novel ways to model user preferences through generative approaches. However, many existing methods often rely solely on text semantics or incorporate collaborative signals in a limited manner, typically using only user or item embeddings. These methods struggle to handle multiple item embeddings representing user history, reverting to textual semantics and neglecting richer collaborative information. In this work, we propose a simple yet effective solution that projects user and item embeddings, learned from collaborative filtering, into the LLM token space via separate lightweight projector modules. A finetuned LLM then conditions on these projected embeddings alongside textual tokens to generate recommendations. Preliminary results show that this design effectively leverages structured user-item interaction data, improves recommendation performance over text-only LLM baselines, and offers a practical path for bridging traditional recommendation systems with modern LLMs.

[466] A zone-based training approach for last-mile routing using Graph Neural Networks and Pointer Networks

Àngel Ruiz-Fas, Carlos Granell, José Francisco Ramos, Joaquín Huerta, Sergio Trilles

Main category: cs.LG

TL;DR: Deep learning approach using GNN encoder and Pointer Network decoder for last-mile routing with asymmetric travel times, enhanced by geographical zoning to improve performance.

Details

Motivation: Last-mile delivery networks face challenges with asymmetric travel times (one-way streets, congestion) where classical heuristics struggle. Need for routing solutions that can handle real-world asymmetries to reduce costs, improve service speed, and lower emissions.

Method: Encoder-decoder architecture: GNN encoder creates node embeddings from asymmetric travel time graphs, Pointer Network decoder sequentially selects stops. Geographical zoning using Discrete Global Grid System clusters stops into zones, with separate model instances trained per zone.

Result: Zone-based training reduces average predicted route length compared to general training, with performance improvement becoming more pronounced as number of stops per route increases. Evaluated on Los Angeles routes from 2021 Amazon Last Mile Routing Challenge.

Conclusion: Geographical zoning combined with deep learning architecture effectively addresses asymmetric last-mile routing problems, showing scalable improvements especially for routes with more stops.

Abstract: Rapid e-commerce growth has pushed last-mile delivery networks to their limits, where small routing gains translate into lower costs, faster service, and fewer emissions. Classical heuristics struggle to adapt when travel times are highly asymmetric (e.g., one-way streets, congestion). A deep learning-based approach to the last-mile routing problem is presented to generate geographical zones composed of stop sequences to minimize last-mile delivery times. The presented approach is an encoder-decoder architecture. Each route is represented as a complete directed graph whose nodes are stops and whose edge weights are asymmetric travel times. A Graph Neural Network encoder produces node embeddings that captures the spatial relationships between stops. A Pointer Network decoder then takes the embeddings and the route’s start node to sequentially select the next stops, assigning a probability to each unvisited node as the next destination. Cells of a Discrete Global Grid System which contain route stops in the training data are obtained and clustered to generate geographical zones of similar size in which the process of training and inference are divided. Subsequently, a different instance of the model is trained per zone only considering the stops of the training routes which are included in that zone. This approach is evaluated using the Los Angeles routes from the 2021 Amazon Last Mile Routing Challenge. Results from general and zone-based training are compared, showing a reduction in the average predicted route length in the zone-based training compared to the general training. The performance improvement of the zone-based approach becomes more pronounced as the number of stops per route increases.

[467] MQ-GNN: A Multi-Queue Pipelined Architecture for Scalable and Efficient GNN Training

Irfan Ullah, Young-Koo Lee

Main category: cs.LG

TL;DR: MQ-GNN is a multi-queue pipelined framework that accelerates multi-GPU GNN training by overlapping stages, enabling asynchronous gradient sharing with consistency guarantees, and optimizing data transfer through caching and adaptive queue management.

Details

Motivation: Current GNN training frameworks suffer from scalability issues due to inefficient mini-batch generation, data transfer bottlenecks, and costly inter-GPU synchronization, leading to poor resource utilization and slow training times.

Method: MQ-GNN introduces: 1) Multi-queue pipelined framework to overlap training stages, 2) Ready-to-Update Asynchronous Consistent Model (RaCoM) for asynchronous gradient sharing with adaptive periodic synchronization, 3) Global neighbor sampling with caching to reduce data transfer, and 4) Adaptive queue-sizing strategy to balance computation and memory efficiency.

Result: Experiments on four large-scale datasets and ten baseline models show MQ-GNN achieves up to 4.6× faster training time and 30% improved GPU utilization while maintaining competitive accuracy.

Conclusion: MQ-GNN establishes itself as a scalable and efficient solution for multi-GPU GNN training by effectively addressing the key bottlenecks in existing frameworks through pipelining, asynchronous consistency, and optimized resource management.

Abstract: Graph Neural Networks (GNNs) are powerful tools for learning graph-structured data, but their scalability is hindered by inefficient mini-batch generation, data transfer bottlenecks, and costly inter-GPU synchronization. Existing training frameworks fail to overlap these stages, leading to suboptimal resource utilization. This paper proposes MQ-GNN, a multi-queue pipelined framework that maximizes training efficiency by interleaving GNN training stages and optimizing resource utilization. MQ-GNN introduces Ready-to-Update Asynchronous Consistent Model (RaCoM), which enables asynchronous gradient sharing and model updates while ensuring global consistency through adaptive periodic synchronization. Additionally, it employs global neighbor sampling with caching to reduce data transfer overhead and an adaptive queue-sizing strategy to balance computation and memory efficiency. Experiments on four large-scale datasets and ten baseline models demonstrate that MQ-GNN achieves up to \boldmath $\bm{4.6,\times}$ faster training time and 30% improved GPU utilization while maintaining competitive accuracy. These results establish MQ-GNN as a scalable and efficient solution for multi-GPU GNN training.

[468] GPU-Accelerated INT8 Quantization for KV Cache Compression in Large Language Models

Maanas Taneja, Purab Shingvi

Main category: cs.LG

TL;DR: INT8 quantization of KV cache reduces memory by 4x with minimal accuracy loss, using optimized CUDA kernels achieving up to 1694x speedup over CPU.

Details

Motivation: KV cache in LLMs creates major memory bottleneck during inference, growing linearly with sequence length and often exceeding model weights memory footprint.

Method: Implemented GPU-accelerated INT8 quantization for KV cache compression with four CUDA kernel variants: naive, tiled, coarsened, and vectorized. Benchmarked across realistic workloads up to 1 billion elements.

Result: Vectorized kernel achieves up to 1,694x speedup over CPU baselines while maintaining reconstruction error below 0.004 and attention score error below 0.1 even for 8K-dimensional heads. Provides 4x memory reduction with minimal computational overhead (6-58ms).

Conclusion: INT8 quantization offers practical approach for reducing memory pressure in LLM inference with negligible computational overhead and minimal impact on downstream model behavior.

Abstract: The key-value (KV) cache in large language models presents a significant memory bottleneck during inference, growing linearly with sequence length and often exceeding the memory footprint of model weights themselves. We implement and evaluate GPU-accelerated INT8 quantization for KV cache compression, achieving 4$\times$ memory reduction with minimal accuracy degradation. We develop four CUDA kernel variants – naive, tiled, coarsened, and vectorized – and benchmark them across realistic workload sizes up to 1 billion elements. Our vectorized kernel achieves up to 1,694$\times$ speedup over CPU baselines while maintaining reconstruction error below 0.004 and attention score error below 0.1 even for 8K-dimensional heads. These results demonstrate that INT8 quantization provides a practical approach for reducing memory pressure in LLM inference with negligible computational overhead (6–58ms) and minimal impact on downstream model behavior

[469] Excess Description Length of Learning Generalizable Predictors

Elizabeth Donoway, Hailey Joren, Fabien Roger, Jan Leike

Main category: cs.LG

TL;DR: The paper develops an information-theoretic framework called Excess Description Length (EDL) to quantify how much predictive structure fine-tuning extracts from training data and writes into model parameters, distinguishing between capability elicitation and teaching.

Details

Motivation: To address the fundamental question of whether fine-tuning elicits latent capabilities or teaches new ones, which is crucial for language model evaluation and safety. Current approaches lack rigorous quantitative frameworks for distinguishing these mechanisms.

Method: Develops Excess Description Length (EDL) based on prequential coding, measuring the gap between bits required to encode training labels sequentially using an evolving model and the residual encoding cost under the final trained model. Validates through toy models and theoretical analysis.

Result: EDL is non-negative in expectation, converges to surplus description length in infinite-data limit, and provides bounds on expected generalization gain. Toy models clarify why random labels yield near-zero EDL, how single examples can eliminate many bits of uncertainty, and how format learning creates distinct transients from capability acquisition.

Conclusion: The framework provides rigorous foundations for empirical observations that capability elicitation and teaching exhibit qualitatively distinct scaling signatures, offering a formal information-theoretic approach to analyze what fine-tuning actually accomplishes.

Abstract: Understanding whether fine-tuning elicits latent capabilities or teaches new ones is a fundamental question for language model evaluation and safety. We develop a formal information-theoretic framework for quantifying how much predictive structure fine-tuning extracts from the train dataset and writes into a model’s parameters. Our central quantity, Excess Description Length (EDL), is defined via prequential coding and measures the gap between the bits required to encode training labels sequentially using an evolving model (trained online) and the residual encoding cost under the final trained model. We establish that EDL is non-negative in expectation, converges to surplus description length in the infinite-data limit, and provides bounds on expected generalization gain. Through a series of toy models, we clarify common confusions about information in learning: why random labels yield EDL near zero, how a single example can eliminate many bits of uncertainty about the underlying rule(s) that describe the data distribution, why structure learned on rare inputs contributes proportionally little to expected generalization, and how format learning creates early transients distinct from capability acquisition. This framework provides rigorous foundations for the empirical observation that capability elicitation and teaching exhibit qualitatively distinct scaling signatures.

[470] Fast Mining and Dynamic Time-to-Event Prediction over Multi-sensor Data Streams

Kota Nakamura, Koki Kawabata, Yasuko Matsubara, Yasushi Sakurai

Main category: cs.LG

TL;DR: TimeCast: A dynamic prediction framework for continuously forecasting machine failure timing from multi-sensor data streams by identifying evolving patterns and adapting predictions in real-time.

Details

Motivation: Real-world sensor data streams from machines are dynamic with evolving patterns over time, requiring adaptive methods to accurately predict machine failures in real-time.

Method: TimeCast identifies distinct time-evolving patterns (stages) in multi-sensor data streams, learns individual models for each stage, and enables adaptive predictions based on pattern shifts. The method is designed to be scalable with linear time complexity and supports online model updates.

Result: Extensive experiments on real datasets show TimeCast provides higher prediction accuracy than state-of-the-art methods while detecting dynamic changes in data streams with significantly reduced computational time.

Conclusion: TimeCast successfully addresses the challenge of dynamic prediction in real-time sensor data streams by providing an adaptive, practical, and scalable framework for forecasting machine failure timing with improved accuracy and efficiency.

Abstract: Given real-time sensor data streams obtained from machines, how can we continuously predict when a machine failure will occur? This work aims to continuously forecast the timing of future events by analyzing multi-sensor data streams. A key characteristic of real-world data streams is their dynamic nature, where the underlying patterns evolve over time. To address this, we present TimeCast, a dynamic prediction framework designed to adapt to these changes and provide accurate, real-time predictions of future event time. Our proposed method has the following properties: (a) Dynamic: it identifies the distinct time-evolving patterns (i.e., stages) and learns individual models for each, enabling us to make adaptive predictions based on pattern shifts. (b) Practical: it finds meaningful stages that capture time-varying interdependencies between multiple sensors and improve prediction performance; (c) Scalable: our algorithm scales linearly with the input size and enables online model updates on data streams. Extensive experiments on real datasets demonstrate that TimeCast provides higher prediction accuracy than state-of-the-art methods while finding dynamic changes in data streams with a great reduction in computational time.

[471] Intraday spatiotemporal PV power prediction at national scale using satellite-based solar forecast models

Luca Lanzilao, Angela Meyer

Main category: cs.LG

TL;DR: This paper presents a comprehensive evaluation framework for spatiotemporal PV power forecasting, comparing seven intraday nowcasting models (satellite-based deep learning, optical-flow, and physics-based NWP) at national scale using 6434 PV stations in Switzerland.

Details

Motivation: To develop and evaluate a novel framework for spatiotemporal PV power forecasting at national scale, addressing the need for reliable, sharp, and accurate intraday nowcasting models that can handle mesoscale cloud systems affecting PV production.

Method: Developed a two-stage framework: 1) Validate forecasts against satellite-derived surface solar irradiance (SSI), 2) Convert irradiance fields to PV power using station-specific machine learning models. Evaluated seven models including satellite-based deep learning (IrradianceNet, SolarSTEPS, SHADECast), optical-flow approaches, and physics-based numerical weather prediction models (IFS-ENS), covering both deterministic and probabilistic formulations.

Result: Satellite-based approaches outperform IFS-ENS, especially at short lead times. SolarSTEPS and SHADECast deliver most accurate SSI and PV power predictions, with SHADECast providing most reliable ensemble spread. IrradianceNet achieves lowest RMSE. Satellite-based models forecast daily total PV generation with relative errors below 10% for 82% of days in 2019-2020. Forecast skill decreases with elevation.

Conclusion: This is the first national-scale spatiotemporal PV forecasting study, demonstrating satellite-based models’ superiority over physics-based NWP for intraday nowcasting, with robust performance suitable for operational use. The framework enables visualization of mesoscale cloud impacts on national PV production.

Abstract: We present a novel framework for spatiotemporal photovoltaic (PV) power forecasting and use it to evaluate the reliability, sharpness, and overall performance of seven intraday PV power nowcasting models. The model suite includes satellite-based deep learning and optical-flow approaches and physics-based numerical weather prediction models, covering both deterministic and probabilistic formulations. Forecasts are first validated against satellite-derived surface solar irradiance (SSI). Irradiance fields are then converted into PV power using station-specific machine learning models, enabling comparison with production data from 6434 PV stations across Switzerland. To our knowledge, this is the first study to investigate spatiotemporal PV forecasting at a national scale. We additionally provide the first visualizations of how mesoscale cloud systems shape national PV production on hourly and sub-hourly timescales. Our results show that satellite-based approaches outperform the Integrated Forecast System (IFS-ENS), particularly at short lead times. Among them, SolarSTEPS and SHADECast deliver the most accurate SSI and PV power predictions, with SHADECast providing the most reliable ensemble spread. The deterministic model IrradianceNet achieves the lowest root mean square error, while probabilistic forecasts of SolarSTEPS and SHADECast provide better-calibrated uncertainty. Forecast skill generally decreases with elevation. At a national scale, satellite-based models forecast the daily total PV generation with relative errors below 10% for 82% of the days in 2019-2020, demonstrating robustness and their potential for operational use.

[472] Smart IoT-Based Wearable Device for Detection and Monitoring of Common Cow Diseases Using a Novel Machine Learning Technique

Rupsa Rani Mishra, D. Chandrasekhar Rao, Ajaya Kumar Tripathy

Main category: cs.LG

TL;DR: Proposes IoT-enabled cyber-physical system with novel ML algorithm for automated detection of multiple cow diseases using physiological and behavioral data.

Details

Motivation: Manual cow health monitoring is labor-intensive, time-consuming, inaccurate, and costly in large-scale farming, leading to delayed disease detection and compromised animal health.

Method: IoT-enabled Cyber-Physical System framework collects physiological and behavioral data, with novel ML algorithm designed to predict multiple common diseases by analyzing comprehensive feature sets.

Result: The proposed system enables automated, low-cost, reliable health monitoring with enhanced accuracy and reduced operational costs compared to manual observation methods.

Conclusion: Automated IoT/ML-based system addresses limitations of manual monitoring, providing efficient multi-disease detection to improve animal health and farm productivity.

Abstract: Manual observation and monitoring of individual cows for disease detection present significant challenges in large-scale farming operations, as the process is labor-intensive, time-consuming, and prone to reduced accuracy. The reliance on human observation often leads to delays in identifying symptoms, as the sheer number of animals can hinder timely attention to each cow. Consequently, the accuracy and precision of disease detection are significantly compromised, potentially affecting animal health and overall farm productivity. Furthermore, organizing and managing human resources for the manual observation and monitoring of cow health is a complex and economically demanding task. It necessitates the involvement of skilled personnel, thereby contributing to elevated farm maintenance costs and operational inefficiencies. Therefore, the development of an automated, low-cost, and reliable smart system is essential to address these challenges effectively. Although several studies have been conducted in this domain, very few have simultaneously considered the detection of multiple common diseases with high prediction accuracy. However, advancements in Internet of Things (IoT), Machine Learning (ML), and Cyber-Physical Systems have enabled the automation of cow health monitoring with enhanced accuracy and reduced operational costs. This study proposes an IoT-enabled Cyber-Physical System framework designed to monitor the daily activities and health status of cow. A novel ML algorithm is proposed for the diagnosis of common cow diseases using collected physiological and behavioral data. The algorithm is designed to predict multiple diseases by analyzing a comprehensive set of recorded physiological and behavioral features, enabling accurate and efficient health assessment.

[473] AgentOCR: Reimagining Agent History via Optical Self-Compression

Lang Feng, Fuchao Yang, Feng Chen, Xin Cheng, Haiyang Xu, Zhenglin Wan, Ming Yan, Bo An

Main category: cs.LG

TL;DR: AgentOCR is a framework that converts textual agent histories into compact visual tokens to reduce token budgets and memory usage, using segment optical caching and agentic self-compression to maintain performance while improving efficiency.

Details

Motivation: Large language model agents trained with RL face practical deployment bottlenecks due to rapidly growing textual histories that inflate token budgets and memory usage during multi-turn interactions.

Method: 1) Represent accumulated observation-action history as compact rendered images using visual tokens; 2) Segment optical caching decomposes history into hashable segments with visual cache to eliminate redundant re-rendering; 3) Agentic self-compression where agents actively emit compression rates and are trained with compression-aware rewards to balance task success and token efficiency.

Result: AgentOCR preserves over 95% of text-based agent performance while reducing token consumption by >50%, achieves 20x rendering speedup from segment optical caching, and effectively balances task success with token efficiency through self-compression.

Conclusion: AgentOCR demonstrates that visual token representation combined with caching and self-compression mechanisms can substantially improve token and memory efficiency for LLM-based agentic systems while maintaining high task performance.

Abstract: Recent advances in large language models (LLMs) enable agentic systems trained with reinforcement learning (RL) over multi-turn interaction trajectories, but practical deployment is bottlenecked by rapidly growing textual histories that inflate token budgets and memory usage. We introduce AgentOCR, a framework that exploits the superior information density of visual tokens by representing the accumulated observation-action history as a compact rendered image. To make multi-turn rollouts scalable, AgentOCR proposes segment optical caching. By decomposing history into hashable segments and maintaining a visual cache, this mechanism eliminates redundant re-rendering. Beyond fixed rendering, AgentOCR introduces agentic self-compression, where the agent actively emits a compression rate and is trained with compression-aware reward to adaptively balance task success and token efficiency. We conduct extensive experiments on challenging agentic benchmarks, ALFWorld and search-based QA. Remarkably, results demonstrate that AgentOCR preserves over 95% of text-based agent performance while substantially reducing token consumption (>50%), yielding consistent token and memory efficiency. Our further analysis validates a 20x rendering speedup from segment optical caching and the effective strategic balancing of self-compression.

[474] Neural-Symbolic Integration with Evolvable Policies

Marios Thoma, Vassilis Vassiliades, Loizos Michael

Main category: cs.LG

TL;DR: Proposes an evolutionary framework for learning non-differentiable symbolic policies and neural weights concurrently in Neural-Symbolic AI systems without requiring predefined policies or differentiability.

Details

Motivation: Existing Neural-Symbolic frameworks require either predefined symbolic policies or differentiable policies, limiting applicability when domain expertise is unavailable or policies are inherently non-differentiable.

Method: Uses evolutionary process where NeSy systems evolve through mutations (symbolic rule additions and neural weight changes) with fitness-based selection. Extends NEUROLOG architecture, adapts Valiant’s Evolvability framework, and uses Machine Coaching semantics for mutable symbolic representations. Neural networks are trained through abductive reasoning from symbolic component.

Result: NeSy systems starting with empty policies and random neural weights can successfully approximate hidden non-differentiable target policies, achieving median correct performance approaching 100%.

Conclusion: Enables NeSy research in domains where acquiring symbolic knowledge from experts is challenging or infeasible, representing a step toward more flexible Neural-Symbolic AI systems.

Abstract: Neural-Symbolic (NeSy) Artificial Intelligence has emerged as a promising approach for combining the learning capabilities of neural networks with the interpretable reasoning of symbolic systems. However, existing NeSy frameworks typically require either predefined symbolic policies or policies that are differentiable, limiting their applicability when domain expertise is unavailable or when policies are inherently non-differentiable. We propose a framework that addresses this limitation by enabling the concurrent learning of both non-differentiable symbolic policies and neural network weights through an evolutionary process. Our approach casts NeSy systems as organisms in a population that evolve through mutations (both symbolic rule additions and neural weight changes), with fitness-based selection guiding convergence toward hidden target policies. The framework extends the NEUROLOG architecture to make symbolic policies trainable, adapts Valiant’s Evolvability framework to the NeSy context, and employs Machine Coaching semantics for mutable symbolic representations. Neural networks are trained through abductive reasoning from the symbolic component, eliminating differentiability requirements. Through extensive experimentation, we demonstrate that NeSy systems starting with empty policies and random neural weights can successfully approximate hidden non-differentiable target policies, achieving median correct performance approaching 100%. This work represents a step toward enabling NeSy research in domains where the acquisition of symbolic knowledge from experts is challenging or infeasible.

[475] Parallelizing Node-Level Explainability in Graph Neural Networks

Oscar Llorente, Jaime Boal, Eugenio F. Sánchez-Úbeda, Antonio Diaz-Cano, Miguel Familiar

Main category: cs.LG

TL;DR: Parallelizing GNN node-level explainability via graph partitioning for scalability, with memory-aware reconstruction for limited memory scenarios.

Details

Motivation: Node-level explainability in GNNs becomes extremely time-consuming for large graphs, and batching strategies degrade explanation quality, creating a need for scalable explainability solutions.

Method: Graph partitioning to decompose graphs into disjoint subgraphs for parallel computation of explainability, plus dropout-based reconstruction mechanism for memory-limited scenarios.

Result: Experimental results show substantial speedups on real-world datasets, enabling scalable and transparent explainability for large-scale GNN models.

Conclusion: The approach provides efficient parallel computation of GNN explainability through graph partitioning, with memory-aware reconstruction offering practical trade-offs for real-world deployment.

Abstract: Graph Neural Networks (GNNs) have demonstrated remarkable performance in a wide range of tasks, such as node classification, link prediction, and graph classification, by exploiting the structural information in graph-structured data. However, in node classification, computing node-level explainability becomes extremely time-consuming as the size of the graph increases, while batching strategies often degrade explanation quality. This paper introduces a novel approach to parallelizing node-level explainability in GNNs through graph partitioning. By decomposing the graph into disjoint subgraphs, we enable parallel computation of explainability for node neighbors, significantly improving the scalability and efficiency without affecting the correctness of the results, provided sufficient memory is available. For scenarios where memory is limited, we further propose a dropout-based reconstruction mechanism that offers a controllable trade-off between memory usage and explanation fidelity. Experimental results on real-world datasets demonstrate substantial speedups, enabling scalable and transparent explainability for large-scale GNN models.

[476] Rethinking GNNs and Missing Features: Challenges, Evaluation and a Robust Solution

Francesco Ferrini, Veronica Lachi, Antonio Longa, Bruno Lepri, Matono Akiyoshi, Andrea Passerini, Xin Liu, Manfred Jaeger

Main category: cs.LG

TL;DR: Paper addresses limitations of existing GNN research on missing node features by introducing dense-feature datasets, realistic missingness mechanisms, and proposing GNNmim baseline that performs competitively across diverse scenarios.

Details

Motivation: Existing GNN research on missing node features focuses on unrealistic scenarios: (1) high-dimensional but sparse features that mask true performance differences, and (2) Missing Completely At Random (MCAR) mechanisms that don't reflect real-world missingness patterns in domains like healthcare and sensor networks.

Method: 1) Theoretically analyze limitations of sparse features; 2) Introduce new datasets with dense, semantically meaningful features; 3) Design evaluation protocols with realistic missingness mechanisms beyond MCAR; 4) Provide theoretical background on missingness assumptions; 5) Propose GNNmim, a simple yet effective baseline for node classification with incomplete features.

Result: Experiments show that GNNmim is competitive with specialized architectures across diverse datasets and missingness regimes, demonstrating that a simple baseline can perform well when evaluated under more realistic conditions.

Conclusion: The paper provides a more realistic evaluation framework for GNNs with missing features, showing that existing benchmarks are insufficient and that simple methods like GNNmim can be competitive when properly evaluated under realistic conditions with dense features and realistic missingness mechanisms.

Abstract: Handling missing node features is a key challenge for deploying Graph Neural Networks (GNNs) in real-world domains such as healthcare and sensor networks. Existing studies mostly address relatively benign scenarios, namely benchmark datasets with (a) high-dimensional but sparse node features and (b) incomplete data generated under Missing Completely At Random (MCAR) mechanisms. For (a), we theoretically prove that high sparsity substantially limits the information loss caused by missingness, making all models appear robust and preventing a meaningful comparison of their performance. To overcome this limitation, we introduce one synthetic and three real-world datasets with dense, semantically meaningful features. For (b), we move beyond MCAR and design evaluation protocols with more realistic missingness mechanisms. Moreover, we provide a theoretical background to state explicit assumptions on the missingness process and analyze their implications for different methods. Building on this analysis, we propose GNNmim, a simple yet effective baseline for node classification with incomplete feature data. Experiments show that GNNmim is competitive with respect to specialized architectures across diverse datasets and missingness regimes.

[477] FibreCastML: An Open Web Platform for Predicting Electrospun Nanofibre Diameter Distributions

Elisa Roldan, Kirstie Andrews, Stephen M. Richardson, Reyhaneh Fatahian, Glen Cooper, Rasool Erfani, Tasneem Sabir, Neil D. Reeves

Main category: cs.LG

TL;DR: FibreCastML is an open ML framework that predicts complete fibre diameter distributions (not just means) from electrospinning parameters, enabling more reproducible scaffold optimization.

Details

Motivation: Existing ML approaches for electrospinning only predict mean fibre diameters, neglecting the full diameter distribution that actually governs scaffold performance. There's a need for distribution-aware prediction to better optimize fibrous scaffolds for tissue engineering, drug delivery, and wound care applications.

Method: Created a meta-dataset of 68,538 fibre diameter measurements from 1,778 studies across 16 biomedical polymers. Used 6 standard processing parameters to train 7 ML models with nested cross-validation (leave-one-study-out external folds). Achieved interpretability through variable importance analysis, SHAP, correlation matrices, and 3D parameter maps.

Result: Non-linear models outperformed linear baselines, achieving R² > 0.91 for several polymers. Solution concentration was identified as the dominant global driver of fibre diameter distributions. Experimental validation showed close agreement between predicted and measured distributions across different electrospinning systems.

Conclusion: FibreCastML enables more reproducible and data-driven optimization of electrospun scaffold architectures by predicting complete fibre diameter distributions rather than just mean values, providing interpretable insights into process-structure relationships.

Abstract: Electrospinning is a scalable technique for producing fibrous scaffolds with tunable micro- and nanoscale architectures for applications in tissue engineering, drug delivery, and wound care. While machine learning (ML) has been used to support electrospinning process optimisation, most existing approaches predict only mean fibre diameters, neglecting the full diameter distribution that governs scaffold performance. This work presents FibreCastML, an open, distribution-aware ML framework that predicts complete fibre diameter spectra from routinely reported electrospinning parameters and provides interpretable insights into process structure relationships. A meta-dataset comprising 68538 individual fibre diameter measurements extracted from 1778 studies across 16 biomedical polymers was curated. Six standard processing parameters, namely solution concentration, applied voltage, flow rate, tip to collector distance, needle diameter, and collector rotation speed, were used to train seven ML models using nested cross validation with leave one study out external folds. Model interpretability was achieved using variable importance analysis, SHapley Additive exPlanations, correlation matrices, and three dimensional parameter maps. Non linear models consistently outperformed linear baselines, achieving coefficients of determination above 0.91 for several widely used polymers. Solution concentration emerged as the dominant global driver of fibre diameter distributions. Experimental validation across different electrospinning systems demonstrated close agreement between predicted and measured distributions. FibreCastML enables more reproducible and data driven optimisation of electrospun scaffold architectures.

[478] Learnable Multipliers: Freeing the Scale of Language Model Matrix Layers

Maksim Velikanov, Ilyas Chahed, Jingwei Zuo, Dhia Eddine Rhaiem, Younes Belkada, Hakim Hacid

Main category: cs.LG

TL;DR: Learnable scalar and per-row/column multipliers outperform weight decay equilibrium norms in LLM pretraining, improving performance and reducing tuning overhead.

Details

Motivation: The standard weight decay practice creates a WD-noise equilibrium norm that's suboptimal and harmful. The paper aims to address this by learning optimal scales through learnable multipliers instead of relying on the equilibrium norm.

Method: Introduces learnable multipliers to matrix layers: 1) scalar multiplier attached to W, 2) per-row and per-column multipliers to free individual row/column norms. This is presented as a more expressive generalization of muP multipliers.

Result: The method outperforms well-tuned muP baselines, reduces computational overhead of multiplier tuning, and shows improvements in downstream evaluations. It works with both Adam and Muon optimizers, with improvements matching the switch from Adam to Muon.

Conclusion: Learnable multipliers effectively address the harmful equilibrium norm artifact, provide better scaling than muP, and raise practical questions about forward-pass symmetries and width-scaling of learned multipliers.

Abstract: Applying weight decay (WD) to matrix layers is standard practice in large-language-model pretraining. Prior work suggests that stochastic gradient noise induces a Brownian-like expansion of the weight matrices W, whose growth is counteracted by WD, leading to a WD-noise equilibrium with a certain weight norm ||W||. In this work, we view the equilibrium norm as a harmful artifact of the training procedure, and address it by introducing learnable multipliers to learn the optimal scale. First, we attach a learnable scalar multiplier to W and confirm that the WD-noise equilibrium norm is suboptimal: the learned scale adapts to data and improves performance. We then argue that individual row and column norms are similarly constrained, and free their scale by introducing learnable per-row and per-column multipliers. Our method can be viewed as a learnable, more expressive generalization of muP multipliers. It outperforms a well-tuned muP baseline, reduces the computational overhead of multiplier tuning, and surfaces practical questions such as forward-pass symmetries and the width-scaling of the learned multipliers. Finally, we validate learnable multipliers with both Adam and Muon optimizers, where it shows improvement in downstream evaluations matching the improvement of the switching from Adam to Muon.

[479] Distributed Online Convex Optimization with Efficient Communication: Improved Algorithm and Lower bounds

Sifan Yang, Wenhao Yang, Wei Jiang, Lijun Zhang

Main category: cs.LG

TL;DR: This paper improves regret bounds for distributed online convex optimization with compressed communication by proposing a novel two-level blocking update framework that achieves better dependence on compression quality factor ω and number of learners n.

Details

Motivation: Prior work on distributed online convex optimization with compressed communication suffers from quadratic/quartic dependence on compression quality factor ω⁻¹ and super-linear dependence on number of learners n, which is undesirable for practical applications.

Method: Proposes a novel algorithm with a two-level blocking update framework incorporating two key ingredients: an online gossip strategy and an error compensation scheme, which collaborate to achieve better consensus among learners.

Result: Achieves improved regret bounds of Õ(ω⁻¹/²ρ⁻¹n√T) for convex functions and Õ(ω⁻¹ρ⁻²n ln T) for strongly convex functions, with first lower bounds established to justify optimality with respect to ω and T.

Conclusion: The proposed method significantly improves regret bounds for distributed online optimization with compressed communication, establishes fundamental lower bounds, and extends to bandit feedback scenarios with gradient estimators.

Abstract: We investigate distributed online convex optimization with compressed communication, where $n$ learners connected by a network collaboratively minimize a sequence of global loss functions using only local information and compressed data from neighbors. Prior work has established regret bounds of $O(\max{ω^{-2}ρ^{-4}n^{1/2},ω^{-4}ρ^{-8}}n\sqrt{T})$ and $O(\max{ω^{-2}ρ^{-4}n^{1/2},ω^{-4}ρ^{-8}}n\ln{T})$ for convex and strongly convex functions, respectively, where $ω\in(0,1]$ is the compression quality factor ($ω=1$ means no compression) and $ρ<1$ is the spectral gap of the communication matrix. However, these regret bounds suffer from a \emph{quadratic} or even \emph{quartic} dependence on $ω^{-1}$. Moreover, the \emph{super-linear} dependence on $n$ is also undesirable. To overcome these limitations, we propose a novel algorithm that achieves improved regret bounds of $\tilde{O}(ω^{-1/2}ρ^{-1}n\sqrt{T})$ and $\tilde{O}(ω^{-1}ρ^{-2}n\ln{T})$ for convex and strongly convex functions, respectively. The primary idea is to design a \emph{two-level blocking update framework} incorporating two novel ingredients: an online gossip strategy and an error compensation scheme, which collaborate to \emph{achieve a better consensus} among learners. Furthermore, we establish the first lower bounds for this problem, justifying the optimality of our results with respect to both $ω$ and $T$. Additionally, we consider the bandit feedback scenario, and extend our method with the classic gradient estimators to enhance existing regret bounds.

[480] Cardinality augmented loss functions

Miguel O’Malley

Main category: cs.LG

TL;DR: Cardinality augmented loss functions improve neural network training on imbalanced datasets by leveraging mathematical invariants that measure effective diversity, boosting minority class performance.

Details

Motivation: Class imbalance is a pervasive problem in neural network training where majority classes dominate, skewing classifier performance toward majority outcomes and harming minority class recognition.

Method: Introduces cardinality augmented loss functions derived from mathematical invariants like magnitude and spread. These invariants measure the “effective diversity” of metric spaces, providing a natural solution for overly homogeneous training data. Establishes methodology for applying these loss functions in neural network training.

Result: Significant performance improvement observed for minority classes, as well as overall performance metrics improvement. Tested on both artificially imbalanced datasets and a real-world imbalanced material science dataset.

Conclusion: Cardinality augmented loss functions represent an effective approach to address class imbalance in neural network training by leveraging mathematical concepts of effective diversity, leading to better performance for minority classes and overall improved metrics.

Abstract: Class imbalance is a common and pernicious issue for the training of neural networks. Often, an imbalanced majority class can dominate training to skew classifier performance towards the majority outcome. To address this problem we introduce cardinality augmented loss functions, derived from cardinality-like invariants in modern mathematics literature such as magnitude and the spread. These invariants enrich the concept of cardinality by evaluating the `effective diversity’ of a metric space, and as such represent a natural solution to overly homogeneous training data. In this work, we establish a methodology for applying cardinality augmented loss functions in the training of neural networks and report results on both artificially imbalanced datasets as well as a real-world imbalanced material science dataset. We observe significant performance improvement among minority classes, as well as improvement in overall performance metrics.

[481] Precision over Diversity: High-Precision Reward Generalizes to Robust Instruction Following

Yirong Zeng, Yufei Liu, Xiao Ding, Yutai Hou, Yuxian Wang, Haonan Song, Wu Ning, Dandan Tu, Qixun Zhang, Bibo Cai, Yuxiang He, Ting Liu

Main category: cs.LG

TL;DR: Challenges conventional wisdom that diverse constraint mixtures are essential for instruction following tasks, finding that hard-only constraints with high-precision rewards outperform mixed datasets.

Details

Motivation: To challenge the prevailing belief that diverse mixtures of verifiable hard and unverifiable soft constraints are essential for generalizing to unseen instructions in reinforcement learning for instruction following tasks.

Method: Systematic empirical investigation comparing models trained on hard-only constraints vs. mixed datasets, analysis of reward precision, LLM judge limitations, attention mechanism analysis, and proposing a data-centric refinement strategy prioritizing reward precision.

Result: Hard-only constraint models consistently outperform mixed datasets; high-precision rewards develop transferable meta-skills; proposed approach outperforms baselines by 13.4% with 58% reduction in training time while maintaining strong generalization.

Conclusion: Advocates for paradigm shift from indiscriminate pursuit of data diversity toward prioritizing high-precision rewards as the primary driver of effective alignment in instruction following tasks.

Abstract: A central belief in scaling reinforcement learning with verifiable rewards for instruction following (IF) tasks is that, a diverse mixture of verifiable hard and unverifiable soft constraints is essential for generalizing to unseen instructions. In this work, we challenge this prevailing consensus through a systematic empirical investigation. Counter-intuitively, we find that models trained on hard-only constraints consistently outperform those trained on mixed datasets. Extensive experiments reveal that reward precision, rather than constraint diversity, is the primary driver of effective alignment. The LLM judge suffers from a low recall rate in detecting false response, which leads to severe reward hacking, thereby undermining the benefits of diversity. Furthermore, analysis of the attention mechanism reveals that high-precision rewards develop a transferable meta-skill for IF. Motivated by these insights, we propose a simple yet effective data-centric refinement strategy that prioritizes reward precision. Evaluated on five benchmarks, our approach outperforms competitive baselines by 13.4% in performance while achieving a 58% reduction in training time, maintaining strong generalization beyond instruction following. Our findings advocate for a paradigm shift: moving away from the indiscriminate pursuit of data diversity toward high-precision rewards.

[482] On the Definition and Detection of Cherry-Picking in Counterfactual Explanations

James Hinns, Sofie Goethals, Stephan Van der Veeken, Theodoros Evgeniou, David Martens

Main category: cs.LG

TL;DR: Counterfactual explanations can be cherry-picked to show favorable model behavior while hiding problematic behavior, and detection of such manipulation is extremely limited even with full access to the explanation process.

Details

Motivation: Counterfactual explanations have multiple valid options for a single instance, allowing explanation providers to selectively present favorable examples while concealing problematic ones, creating potential for manipulation and bias in how model behavior is communicated.

Method: Formally define cherry-picking for counterfactual explanations using admissible explanation spaces and utility functions. Study detection capabilities across three access levels: full procedural access, partial procedural access, and explanation-only access. Empirically analyze variability in counterfactual quality metrics (proximity, plausibility, sparsity) compared to cherry-picking effects.

Result: Detection of cherry-picking is extremely limited in practice. Even with full procedural access, cherry-picked explanations can remain indistinguishable from non-cherry-picked ones due to the multiplicity of valid counterfactuals and flexibility in explanation specification. Empirical analysis shows variability in standard quality metrics often exceeds cherry-picking effects, making manipulated explanations statistically indistinguishable from baseline explanations.

Conclusion: Safeguards should prioritize reproducibility, standardization, and procedural constraints over post-hoc detection. Recommendations are provided for algorithm developers, explanation providers, and auditors to address the cherry-picking vulnerability in counterfactual explanations.

Abstract: Counterfactual explanations are widely used to communicate how inputs must change for a model to alter its prediction. For a single instance, many valid counterfactuals can exist, which leaves open the possibility for an explanation provider to cherry-pick explanations that better suit a narrative of their choice, highlighting favourable behaviour and withholding examples that reveal problematic behaviour. We formally define cherry-picking for counterfactual explanations in terms of an admissible explanation space, specified by the generation procedure, and a utility function. We then study to what extent an external auditor can detect such manipulation. Considering three levels of access to the explanation process: full procedural access, partial procedural access, and explanation-only access, we show that detection is extremely limited in practice. Even with full procedural access, cherry-picked explanations can remain difficult to distinguish from non cherry-picked explanations, because the multiplicity of valid counterfactuals and flexibility in the explanation specification provide sufficient degrees of freedom to mask deliberate selection. Empirically, we demonstrate that this variability often exceeds the effect of cherry-picking on standard counterfactual quality metrics such as proximity, plausibility, and sparsity, making cherry-picked explanations statistically indistinguishable from baseline explanations. We argue that safeguards should therefore prioritise reproducibility, standardisation, and procedural constraints over post-hoc detection, and we provide recommendations for algorithm developers, explanation providers, and auditors.

[483] On the Hidden Objective Biases of Group-based Reinforcement Learning

Aleksandar Fontana, Marco Simoni, Giulio Rossolini, Andrea Saracino, Paolo Mori

Main category: cs.LG

TL;DR: Theoretical analysis reveals structural mismatches and limitations in GRPO-style group-based RL methods for LLM post-training, including gradient biases, reward scaling insensitivity, and momentum-driven clipping violations.

Details

Motivation: Despite empirical success of group-based RL methods like GRPO for post-training large language models, there are structural mismatches between reward optimization and underlying training objectives that need theoretical investigation.

Method: Theoretical analysis of GRPO-style methods using a unified surrogate formulation to study their properties systematically.

Result: Identified three key limitations: (1) non-uniform group weighting causes systematic gradient biases on shared prefix tokens, (2) interactions with AdamW optimizer make training dynamics insensitive to reward scaling, and (3) optimizer momentum can push policy updates beyond intended clipping regions under repeated optimization steps.

Conclusion: These findings highlight fundamental limitations of current group-based RL approaches and provide principled guidance for designing improved formulations in the future.

Abstract: Group-based reinforcement learning methods, like Group Relative Policy Optimization (GRPO), are widely used nowadays to post-train large language models. Despite their empirical success, they exhibit structural mismatches between reward optimization and the underlying training objective. In this paper, we present a theoretical analysis of GRPO style methods by studying them within a unified surrogate formulation. This perspective reveals recurring properties that affect all the methods under analysis: (i) non-uniform group weighting induces systematic gradient biases on shared prefix tokens; (ii) interactions with the AdamW optimizer make training dynamics largely insensitive to reward scaling; and (iii) optimizer momentum can push policy updates beyond the intended clipping region under repeated optimization steps. We believe that these findings highlight fundamental limitations of current approaches and provide principled guidance for the design of future formulations.

[484] HMVI: Unifying Heterogeneous Attributes with Natural Neighbors for Missing Value Inference

Xiaopeng Luo, Zexi Tan, Zhuowei Wang

Main category: cs.LG

TL;DR: A novel imputation method that models cross-type feature dependencies between numerical and categorical attributes in a unified framework, outperforming existing techniques and improving downstream ML tasks.

Details

Motivation: Current imputation methods handle numerical and categorical attributes independently, overlooking critical interdependencies among heterogeneous features, which limits their effectiveness in real-world systems with missing data.

Method: Proposes a unified framework that explicitly models cross-type feature dependencies, leveraging both complete and incomplete instances to ensure accurate and consistent imputation in tabular data.

Result: Extensive experiments demonstrate superior performance over existing imputation techniques and significant enhancement of downstream machine learning tasks.

Conclusion: The proposed approach provides a robust solution for real-world systems with missing data by effectively capturing feature interdependencies in a unified imputation framework.

Abstract: Missing value imputation is a fundamental challenge in machine intelligence, heavily dependent on data completeness. Current imputation methods often handle numerical and categorical attributes independently, overlooking critical interdependencies among heterogeneous features. To address these limitations, we propose a novel imputation approach that explicitly models cross-type feature dependencies within a unified framework. Our method leverages both complete and incomplete instances to ensure accurate and consistent imputation in tabular data. Extensive experimental results demonstrate that the proposed approach achieves superior performance over existing techniques and significantly enhances downstream machine learning tasks, providing a robust solution for real-world systems with missing data.

[485] Approximate equivariance via projection-based regularisation

Torben Berndt, Jan Stühmer

Main category: cs.LG

TL;DR: A projection-based regularizer for approximate equivariance that outperforms sample-based methods in efficiency and performance by penalizing non-equivariance at operator level rather than point-wise.

Details

Motivation: While equivariance improves generalization and physical consistency, non-equivariant models have better runtime performance and handle imperfect symmetries in real-world applications. Existing approximate equivariance methods use sample-based regularizers with high sample complexity, especially for continuous groups like SO(3).

Method: Develops a projection-based regularizer that leverages orthogonal decomposition of linear layers into equivariant and non-equivariant components. Penalizes non-equivariance at operator level across full group orbit rather than point-wise. Provides mathematical framework for computing penalty exactly and efficiently in both spatial and spectral domains.

Result: Method consistently outperforms prior approximate equivariance approaches in both model performance and efficiency, achieving substantial runtime gains over sample-based regularizers.

Conclusion: Projection-based regularizer offers superior approach to approximate equivariance by addressing limitations of sample-based methods, particularly for continuous symmetry groups, through efficient operator-level regularization.

Abstract: Equivariance is a powerful inductive bias in neural networks, improving generalisation and physical consistency. Recently, however, non-equivariant models have regained attention, due to their better runtime performance and imperfect symmetries that might arise in real-world applications. This has motivated the development of approximately equivariant models that strike a middle ground between respecting symmetries and fitting the data distribution. Existing approaches in this field usually apply sample-based regularisers which depend on data augmentation at training time, incurring a high sample complexity, in particular for continuous groups such as $SO(3)$. This work instead approaches approximate equivariance via a projection-based regulariser which leverages the orthogonal decomposition of linear layers into equivariant and non-equivariant components. In contrast to existing methods, this penalises non-equivariance at an operator level across the full group orbit, rather than point-wise. We present a mathematical framework for computing the non-equivariance penalty exactly and efficiently in both the spatial and spectral domain. In our experiments, our method consistently outperforms prior approximate equivariance approaches in both model performance and efficiency, achieving substantial runtime gains over sample-based regularisers.

[486] A Data-Driven Predictive Framework for Inventory Optimization Using Context-Augmented Machine Learning Models

Anees Fatima, Mohammad Abdus Salam

Main category: cs.LG

TL;DR: This paper investigates machine learning algorithms for demand forecasting in retail and vending machine sectors, finding that XGBoost with external factors achieves the best performance.

Details

Motivation: Traditional demand forecasting approaches in supply chain management often neglect external influences like weather, holidays, and equipment breakdowns, leading to inefficiencies in inventory optimization and waste reduction.

Method: The research compares four ML algorithms (XGBoost, ARIMA, Facebook Prophet, and SVR) for demand prediction, systematically incorporating external factors such as weekdays, holidays, and sales deviation indicators to enhance precision.

Result: XGBoost achieved the lowest Mean Absolute Error (MAE) of 22.7 when external variables were included. ARIMAX and Facebook Prophet showed significant improvements, while SVR performed poorly in comparison.

Conclusion: Incorporating external factors significantly improves demand forecasting accuracy, with XGBoost identified as the most effective algorithm. The study provides a robust framework for enhancing inventory management in retail and vending machine systems.

Abstract: Demand forecasting in supply chain management (SCM) is critical for optimizing inventory, reducing waste, and improving customer satisfaction. Conventional approaches frequently neglect external influences like weather, festivities, and equipment breakdowns, resulting in inefficiencies. This research investigates the use of machine learning (ML) algorithms to improve demand prediction in retail and vending machine sectors. Four machine learning algorithms. Extreme Gradient Boosting (XGBoost), Autoregressive Integrated Moving Average (ARIMA), Facebook Prophet (Fb Prophet), and Support Vector Regression (SVR) were used to forecast inventory requirements. Ex-ternal factors like weekdays, holidays, and sales deviation indicators were methodically incorporated to enhance precision. XGBoost surpassed other models, reaching the lowest Mean Absolute Error (MAE) of 22.7 with the inclusion of external variables. ARIMAX and Fb Prophet demonstrated noteworthy enhancements, whereas SVR fell short in performance. Incorporating external factors greatly improves the precision of demand forecasting models, and XGBoost is identified as the most efficient algorithm. This study offers a strong framework for enhancing inventory management in retail and vending machine systems.

[487] DeepWeightFlow: Re-Basined Flow Matching for Generating Neural Network Weights

Saumya Gupta, Scott Biggs, Moritz Laber, Zohair Shafi, Robin Walters, Ayan Paul

Main category: cs.LG

TL;DR: DeepWeightFlow is a Flow Matching model that generates diverse, high-accuracy neural network weights directly in weight space without requiring fine-tuning, addressing challenges of high-dimensional weight spaces and neural network symmetries.

Details

Motivation: Existing generative models for neural network weights face challenges: they either generate only partial weights for larger models (ResNet, ViT), or generate complete weights but struggle with speed or require fine-tuning. There's a need for efficient, scalable generation of diverse neural network weights that work well without additional tuning.

Method: DeepWeightFlow uses Flow Matching operating directly in weight space. It incorporates Git Re-Basin and TransFusion for neural network canonicalization to handle permutation symmetries and improve generation efficiency for larger models. The approach generates complete neural network weights for various architectures and sizes.

Result: The generated networks achieve high accuracy without requiring fine-tuning, scale to large networks, and excel at transfer learning. Hundreds of neural networks can be generated in minutes, far exceeding diffusion-based methods in efficiency.

Conclusion: DeepWeightFlow enables efficient and scalable generation of diverse neural network sets, paving the way for more practical weight-space generative models that overcome previous limitations in speed, scalability, and the need for fine-tuning.

Abstract: Building efficient and effective generative models for neural network weights has been a research focus of significant interest that faces challenges posed by the high-dimensional weight spaces of modern neural networks and their symmetries. Several prior generative models are limited to generating partial neural network weights, particularly for larger models, such as ResNet and ViT. Those that do generate complete weights struggle with generation speed or require finetuning of the generated models. In this work, we present DeepWeightFlow, a Flow Matching model that operates directly in weight space to generate diverse and high-accuracy neural network weights for a variety of architectures, neural network sizes, and data modalities. The neural networks generated by DeepWeightFlow do not require fine-tuning to perform well and can scale to large networks. We apply Git Re-Basin and TransFusion for neural network canonicalization in the context of generative weight models to account for the impact of neural network permutation symmetries and to improve generation efficiency for larger model sizes. The generated networks excel at transfer learning, and ensembles of hundreds of neural networks can be generated in minutes, far exceeding the efficiency of diffusion-based methods. DeepWeightFlow models pave the way for more efficient and scalable generation of diverse sets of neural networks.

[488] Milestones over Outcome: Unlocking Geometric Reasoning with Sub-Goal Verifiable Reward

Jianlong Chen, Daocheng Fu, Shengze Xu, Jiawei Chen, Yuan Feng, Yue Yang, Junchi Yan, Hongyuan Zha, Renqiu Xia

Main category: cs.LG

TL;DR: The paper introduces GeoGoal benchmark and SGVR framework to address MLLMs’ geometric reasoning limitations by shifting from outcome-based to subgoal-level evaluation and learning.

Details

Motivation: Multimodal Large Language Models struggle with complex geometric reasoning because traditional "black box" outcome-based supervision cannot distinguish between lucky guesses and rigorous deductive reasoning, leading to unreliable performance assessment.

Method: 1) Construct GeoGoal benchmark using formal verification data engine to convert abstract proofs into verifiable numeric subgoals; 2) Propose Sub-Goal Verifiable Reward (SGVR) framework that replaces sparse outcome signals with dense rewards based on Skeleton Rate.

Result: SGVR enhances geometric reasoning performance by +9.7%, shows strong generalization with gains in general math (+8.0%) and other general reasoning tasks (+2.8%), demonstrating broad applicability across diverse domains.

Conclusion: The subgoal-level evaluation and learning paradigm effectively addresses MLLMs’ geometric reasoning limitations, with SGVR framework providing a more reliable approach that transfers benefits to other reasoning domains, offering a promising direction for improving complex reasoning in multimodal models.

Abstract: Multimodal Large Language Models (MLLMs) struggle with complex geometric reasoning, largely because “black box” outcome-based supervision fails to distinguish between lucky guesses and rigorous deduction. To address this, we introduce a paradigm shift towards subgoal-level evaluation and learning. We first construct GeoGoal, a benchmark synthesized via a rigorous formal verification data engine, which converts abstract proofs into verifiable numeric subgoals. This structure reveals a critical divergence between reasoning quality and outcome accuracy. Leveraging this, we propose the Sub-Goal Verifiable Reward (SGVR) framework, which replaces sparse signals with dense rewards based on the Skeleton Rate. Experiments demonstrate that SGVR not only enhances geometric performance (+9.7%) but also exhibits strong generalization, transferring gains to general math (+8.0%) and other general reasoning tasks (+2.8%), demonstrating broad applicability across diverse domains.

[489] Exploring Student Expectations and Confidence in Learning Analytics

Hayk Asatryan, Basile Tousside, Janis Mohr, Malte Neugebauer, Hildo Bijl, Paul Spiegelberg, Claudia Frohn-Schauf, Jörg Frochte

Main category: cs.LG

TL;DR: The paper analyzes student expectations and confidence regarding Learning Analytics data processing using SELAQ questionnaire, identifying four student clusters: Enthusiasts, Realists, Cautious, and Indifferents.

Details

Motivation: Learning Analytics is widely used in education but must comply with privacy regulations. Understanding student perspectives on data processing for LA is crucial for ethical implementation and acceptance.

Method: Used the Student Expectation of Learning Analytics Questionnaire (SELAQ) to survey students from different faculties about their expectations and confidence regarding LA data processing. Applied clustering algorithms to analyze the responses.

Result: Identified four distinct student clusters: Enthusiasts (positive about LA), Realists (balanced view), Cautious (concerned about privacy), and Indifferents (neutral/uninterested).

Conclusion: The structured analysis provides valuable insights into student acceptance and criticism of Learning Analytics, highlighting diverse perspectives that can inform ethical implementation strategies.

Abstract: Learning Analytics (LA) is nowadays ubiquitous in many educational systems, providing the ability to collect and analyze student data in order to understand and optimize learning and the environments in which it occurs. On the other hand, the collection of data requires to comply with the growing demand regarding privacy legislation. In this paper, we use the Student Expectation of Learning Analytics Questionnaire (SELAQ) to analyze the expectations and confidence of students from different faculties regarding the processing of their data for Learning Analytics purposes. This allows us to identify four clusters of students through clustering algorithms: Enthusiasts, Realists, Cautious and Indifferents. This structured analysis provides valuable insights into the acceptance and criticism of Learning Analytics among students.

[490] Sequential Subspace Noise Injection Prevents Accuracy Collapse in Certified Unlearning

Polina Dolgova, Sebastian U. Stich

Main category: cs.LG

TL;DR: Sequential noise scheduling improves certified unlearning by distributing noise across parameter subspaces, maintaining privacy guarantees while significantly boosting model accuracy.

Details

Motivation: Current certified unlearning methods based on differential privacy provide strong guarantees but severely degrade model accuracy, making them impractical for real-world use.

Method: Sequential noise scheduling that distributes noise budget across orthogonal subspaces of parameter space instead of injecting all noise at once, preserving certification guarantees while mitigating noise damage.

Result: Substantially improved accuracy after unlearning on image classification benchmarks while remaining robust to membership inference attacks, showing certified unlearning can be both rigorous and practical.

Conclusion: Certified unlearning can achieve both strong privacy guarantees and practical utility through sequential noise scheduling across parameter subspaces.

Abstract: Certified unlearning based on differential privacy offers strong guarantees but remains largely impractical: the noisy fine-tuning approaches proposed so far achieve these guarantees but severely reduce model accuracy. We propose sequential noise scheduling, which distributes the noise budget across orthogonal subspaces of the parameter space, rather than injecting it all at once. This simple modification mitigates the destructive effect of noise while preserving the original certification guarantees. We extend the analysis of noisy fine-tuning to the subspace setting, proving that the same $(\varepsilon,δ)$ privacy budget is retained. Empirical results on image classification benchmarks show that our approach substantially improves accuracy after unlearning while remaining robust to membership inference attacks. These results show that certified unlearning can achieve both rigorous guarantees and practical utility.

[491] Safe Continual Reinforcement Learning Methods for Nonstationary Environments. Towards a Survey of the State of the Art

Timofey Tomashevskiy

Main category: cs.LG

TL;DR: Survey paper on continual safe online reinforcement learning (COSRL) methods, covering theoretical aspects, challenges, taxonomy, safety constraints, and future directions for reliable safe online learning algorithms.

Details

Motivation: To provide a comprehensive overview of the state-of-the-art in continual safe online reinforcement learning, addressing the need for algorithms that can maintain safety while adapting to nonstationary environments and distribution shifts.

Method: The paper presents a survey methodology with taxonomy development based on safe learning mechanisms that account for nonstationarity adaptation. It categorizes safety constraint formulations for online RL algorithms and analyzes various approaches including HM-MDP, NSMDP, POMDP, and safe POMDP frameworks.

Result: A systematic taxonomy and detailed analysis of COSRL methods, categorization of safety constraints for online RL, and identification of theoretical challenges and open questions in the field.

Conclusion: The survey highlights the importance of developing reliable safe online learning algorithms that can handle nonstationarity while maintaining safety, and discusses prospects for future research in creating robust continual safe reinforcement learning systems.

Abstract: This work provides a state-of-the-art survey of continual safe online reinforcement learning (COSRL) methods. We discuss theoretical aspects, challenges, and open questions in building continual online safe reinforcement learning algorithms. We provide the taxonomy and the details of continual online safe reinforcement learning methods based on the type of safe learning mechanism that takes adaptation to nonstationarity into account. We categorize safety constraints formulation for online reinforcement learning algorithms, and finally, we discuss prospects for creating reliable, safe online learning algorithms. Keywords: safe RL in nonstationary environments, safe continual reinforcement learning under nonstationarity, HM-MDP, NSMDP, POMDP, safe POMDP, constraints for continual learning, safe continual reinforcement learning review, safe continual reinforcement learning survey, safe continual reinforcement learning, safe online learning under distribution shift, safe continual online adaptation, safe reinforcement learning, safe exploration, safe adaptation, constrained Markov decision processes, safe reinforcement learning, partially observable Markov decision process, safe reinforcement learning and hidden Markov decision processes, Safe Online Reinforcement Learning, safe online reinforcement learning, safe online reinforcement learning, safe meta-learning, safe meta-reinforcement learning, safe context-based reinforcement learning, formulating safety constraints for continual learning

[492] FaST: Efficient and Effective Long-Horizon Forecasting for Large-Scale Spatial-Temporal Graphs via Mixture-of-Experts

Yiji Zhao, Zihao Zhong, Ao Wang, Haomin Wen, Ming Jin, Yuxuan Liang, Huaiyu Wan, Hao Wu

Main category: cs.LG

TL;DR: FaST is an efficient framework for long-horizon, large-scale spatial-temporal graph forecasting using heterogeneity-aware Mixture-of-Experts to enable week-ahead predictions on thousands of nodes.

Details

Motivation: Existing spatial-temporal graph forecasting models focus on short-horizon predictions and suffer from high computational costs and memory consumption when scaling to long-horizon predictions and large graphs.

Method: FaST uses two key innovations: 1) adaptive graph agent attention to reduce computational burden of graph convolution and self-attention on large graphs, and 2) parallel MoE module with Gated Linear Units replacing traditional feed-forward networks for efficient parallel structure.

Result: FaST achieves superior long-horizon predictive accuracy and remarkable computational efficiency compared to state-of-the-art baselines on real-world datasets, enabling week-ahead predictions (672 steps at 15-minute granularity) with thousands of nodes.

Conclusion: FaST presents an effective and efficient framework for long-horizon, large-scale spatial-temporal graph forecasting that addresses computational limitations of existing approaches while maintaining high predictive accuracy.

Abstract: Spatial-Temporal Graph (STG) forecasting on large-scale networks has garnered significant attention. However, existing models predominantly focus on short-horizon predictions and suffer from notorious computational costs and memory consumption when scaling to long-horizon predictions and large graphs. Targeting the above challenges, we present FaST, an effective and efficient framework based on heterogeneity-aware Mixture-of-Experts (MoEs) for long-horizon and large-scale STG forecasting, which unlocks one-week-ahead (672 steps at a 15-minute granularity) prediction with thousands of nodes. FaST is underpinned by two key innovations. First, an adaptive graph agent attention mechanism is proposed to alleviate the computational burden inherent in conventional graph convolution and self-attention modules when applied to large-scale graphs. Second, we propose a new parallel MoE module that replaces traditional feed-forward networks with Gated Linear Units (GLUs), enabling an efficient and scalable parallel structure. Extensive experiments on real-world datasets demonstrate that FaST not only delivers superior long-horizon predictive accuracy but also achieves remarkable computational efficiency compared to state-of-the-art baselines. Our source code is available at: https://github.com/yijizhao/FaST.

[493] An interpretable data-driven approach to optimizing clinical fall risk assessment

Fardin Ganjkhanloo, Emmett Springer, Erik H. Hoyer, Daniel L. Young, Holley Farley, Kimia Ghobadi

Main category: cs.LG

TL;DR: Data-driven optimization of JHFRAT scoring weights improves fall risk prediction while maintaining clinical interpretability and workflow.

Details

Motivation: To better align fall risk prediction from the Johns Hopkins Fall Risk Assessment Tool (JHFRAT) with clinically meaningful measures while preserving interpretability and existing clinical workflows.

Method: Retrospective cohort analysis of 54,209 inpatient admissions, using constrained score optimization (CSO) models to reweight JHFRAT scoring weights while preserving additive structure and clinical thresholds. Compared CSO with current JHFRAT and black-box XGBoost models.

Result: CSO significantly improved predictive performance (AUC-ROC=0.91 vs JHFRAT’s 0.86), protecting additional 35 high-risk patients per week. CSO performed similarly with/without EHR variables and showed more robustness to risk labeling variations than XGBoost (AUC-ROC=0.94).

Conclusion: Evidence-based CSO approach provides robust foundation for enhancing inpatient fall prevention protocols and patient safety through data-driven optimization while maintaining clinical interpretability and workflow.

Abstract: In this study, we aim to better align fall risk prediction from the Johns Hopkins Fall Risk Assessment Tool (JHFRAT) with additional clinically meaningful measures via a data-driven modelling approach. We conducted a retrospective cohort analysis of 54,209 inpatient admissions from three Johns Hopkins Health System hospitals between March 2022 and October 2023. A total of 20,208 admissions were included as high fall risk encounters, and 13,941 were included as low fall risk encounters. To incorporate clinical knowledge and maintain interpretability, we employed constrained score optimization (CSO) models to reweight the JHFRAT scoring weights, while preserving its additive structure and clinical thresholds. Recalibration refers to adjusting item weights so that the resulting score can order encounters more consistently by the study’s risk labels, and without changing the tool’s form factor or deployment workflow. The model demonstrated significant improvements in predictive performance over the current JHFRAT (CSO AUC-ROC=0.91, JHFRAT AUC-ROC=0.86). This performance improvement translates to protecting an additional 35 high-risk patients per week across the Johns Hopkins Health System. The constrained score optimization models performed similarly with and without the EHR variables. Although the benchmark black-box model (XGBoost), improves upon the performance metrics of the knowledge-based constrained logistic regression (AUC-ROC=0.94), the CSO demonstrates more robustness to variations in risk labeling. This evidence-based approach provides a robust foundation for health systems to systematically enhance inpatient fall prevention protocols and patient safety using data-driven optimization techniques, contributing to improved risk assessment and resource allocation in healthcare settings.

[494] EARL: Energy-Aware Optimization of Liquid State Machines for Pervasive AI

Zain Iqbal, Lorenzo Valerio

Main category: cs.LG

TL;DR: EARL is an energy-aware reinforcement learning framework that optimizes Liquid State Machines for on-device AI, achieving higher accuracy, lower energy consumption, and faster optimization compared to existing methods.

Details

Motivation: Pervasive AI needs low-latency, energy-efficient on-device learning systems, but Liquid State Machines (LSMs) face deployment challenges due to hyperparameter sensitivity and computational cost of traditional optimization methods that ignore energy constraints.

Method: EARL integrates Bayesian optimization with adaptive reinforcement learning selection policy to jointly optimize accuracy and energy consumption. It uses surrogate modeling for global exploration, reinforcement learning for dynamic candidate prioritization, and early termination to eliminate redundant evaluations.

Result: On three benchmark datasets, EARL achieves 6-15% higher accuracy, 60-80% lower energy consumption, and up to an order of magnitude reduction in optimization time compared to leading hyperparameter tuning frameworks.

Conclusion: EARL demonstrates the effectiveness of energy-aware adaptive search in improving efficiency and scalability of LSMs for resource-constrained on-device AI applications.

Abstract: Pervasive AI increasingly depends on on-device learning systems that deliver low-latency and energy-efficient computation under strict resource constraints. Liquid State Machines (LSMs) offer a promising approach for low-power temporal processing in pervasive and neuromorphic systems, but their deployment remains challenging due to high hyperparameter sensitivity and the computational cost of traditional optimization methods that ignore energy constraints. This work presents EARL, an energy-aware reinforcement learning framework that integrates Bayesian optimization with an adaptive reinforcement learning based selection policy to jointly optimize accuracy and energy consumption. EARL employs surrogate modeling for global exploration, reinforcement learning for dynamic candidate prioritization, and an early termination mechanism to eliminate redundant evaluations, substantially reducing computational overhead. Experiments on three benchmark datasets demonstrate that EARL achieves 6 to 15 percent higher accuracy, 60 to 80 percent lower energy consumption, and up to an order of magnitude reduction in optimization time compared to leading hyperparameter tuning frameworks. These results highlight the effectiveness of energy-aware adaptive search in improving the efficiency and scalability of LSMs for resource-constrained on-device AI applications.

[495] Robust Reasoning as a Symmetry-Protected Topological Phase

Ilmo Sung

Main category: cs.LG

TL;DR: The paper proposes viewing robust logical inference in LLMs as a Symmetry-Protected Topological phase, analogous to non-Abelian anyon braiding, which provides immunity to semantic noise through topological invariants rather than geometric interpolation.

Details

Motivation: Large language models suffer from "hallucinations" - logical inconsistencies caused by semantic noise. Current architectures operate in a "Metric Phase" where causal order is vulnerable to spontaneous symmetry breaking, making them fragile to noise.

Method: The authors propose a Holonomic Network that treats robust inference as a Symmetry-Protected Topological phase, where logical operations are formally isomorphic to non-Abelian anyon braiding. This replaces fragile geometric interpolation with robust topological invariants.

Result: Empirical results show a sharp topological phase transition: while Transformers and RNNs exhibit gapless decay, the Holonomic Network reveals a macroscopic “mass gap” maintaining invariant fidelity below critical noise. In variable-binding tasks on S₁₀ (3.6×10⁶ states), the topological model maintains perfect fidelity extrapolating 100× beyond training (L=50→5000), while Transformers lose logical coherence.

Conclusion: The protection emerges strictly from non-Abelian gauge symmetry, providing strong evidence for a new universality class for logical reasoning that links causal stability to the topology of the semantic manifold, enabling theoretically indefinite causal horizons.

Abstract: Large language models suffer from “hallucinations”-logical inconsistencies induced by semantic noise. We propose that current architectures operate in a “Metric Phase,” where causal order is vulnerable to spontaneous symmetry breaking. Here, we identify robust inference as an effective Symmetry-Protected Topological phase, where logical operations are formally isomorphic to non-Abelian anyon braiding, replacing fragile geometric interpolation with robust topological invariants. Empirically, we demonstrate a sharp topological phase transition: while Transformers and RNNs exhibit gapless decay, our Holonomic Network reveals a macroscopic “mass gap,” maintaining invariant fidelity below a critical noise threshold. Furthermore, in a variable-binding task on $S_{10}$ ($3.6 \times 10^6$ states) representing symbolic manipulation, we demonstrate holonomic generalization: the topological model maintains perfect fidelity extrapolating $100\times$ beyond training ($L=50 \to 5000$), consistent with a theoretically indefinite causal horizon, whereas Transformers lose logical coherence. Ablation studies indicate this protection emerges strictly from non-Abelian gauge symmetry. This provides strong evidence for a new universality class for logical reasoning, linking causal stability to the topology of the semantic manifold.

[496] Optimal Lower Bounds for Online Multicalibration

Natalie Collina, Jiuyao Lu, Georgy Noarov, Aaron Roth

Main category: cs.LG

TL;DR: The paper proves tight Ω(T^{2/3}) lower bounds for online multicalibration, establishing an information-theoretic separation from marginal calibration which has O(T^{2/3-ε}) upper bounds.

Details

Motivation: To understand the fundamental limitations of online multicalibration algorithms and establish whether multicalibration is inherently more difficult than marginal calibration, particularly in terms of achievable error rates in online learning settings.

Method: The authors use information-theoretic techniques to prove lower bounds. For general group functions (depending on both context and predictions), they prove Ω(T^{2/3}) lower bound using just three disjoint binary groups. For context-dependent (but prediction-independent) group functions, they construct a Θ(T)-sized group family using orthogonal function systems to prove Ω̃(T^{2/3}) lower bound.

Result: 1) Ω(T^{2/3}) lower bound for general group functions with three disjoint binary groups, matching Noarov et al. (2025) upper bounds up to log factors. 2) Ω̃(T^{2/3}) lower bound for context-dependent group functions with Θ(T)-sized group family, again matching upper bounds. 3) Separation from marginal calibration which has O(T^{2/3-ε}) upper bounds (Dagan et al., 2025).

Conclusion: Online multicalibration is fundamentally harder than marginal calibration, requiring Ω(T^{2/3}) error rate compared to O(T^{2/3-ε}) for marginal calibration. The lower bounds are tight up to logarithmic factors, establishing the optimal error rate for online multicalibration problems.

Abstract: We prove tight lower bounds for online multicalibration, establishing an information-theoretic separation from marginal calibration. In the general setting where group functions can depend on both context and the learner’s predictions, we prove an $Ω(T^{2/3})$ lower bound on expected multicalibration error using just three disjoint binary groups. This matches the upper bounds of Noarov et al. (2025) up to logarithmic factors and exceeds the $O(T^{2/3-\varepsilon})$ upper bound for marginal calibration (Dagan et al., 2025), thereby separating the two problems. We then turn to lower bounds for the more difficult case of group functions that may depend on context but not on the learner’s predictions. In this case, we establish an $\widetildeΩ(T^{2/3})$ lower bound for online multicalibration via a $Θ(T)$-sized group family constructed using orthogonal function systems, again matching upper bounds up to logarithmic factors.

[497] Convergence of Sign-based Random Reshuffling Algorithms for Nonconvex Optimization

Zhen Qin, Zhishuai Liu, Pan Xu

Main category: cs.LG

TL;DR: This paper provides the first theoretical analysis of signSGD with random reshuffling (SignRR), identifies its limitations, and proposes two improved algorithms (SignRVR and SignRVM) with better convergence rates.

Details

Motivation: There's a gap between theory and practice: existing signSGD analyses assume data sampling with replacement, but practical implementations use random reshuffling. The theoretical understanding of signSGD with random reshuffling (SignRR) remains largely unexplored.

Method: The paper first analyzes SignRR to identify its limitations, then develops two new algorithms: SignRVR (incorporates variance-reduced gradients) and SignRVM (integrates momentum-based updates). Both algorithms are extended to distributed settings.

Result: SignRR has convergence rate O(log(nT)/√(nT) + σ) where σ may not vanish. SignRVR and SignRVM achieve faster rate O(log(nT)/√(nT) + log(nT)√n/√T). Distributed versions achieve O(log(n₀T)/√(n₀T) + log(n₀T)√n₀/√T). Experiments show these methods match or surpass baselines.

Conclusion: This work provides the first theoretical understanding of practical sign-based optimization with random reshuffling, develops improved algorithms with better convergence rates, and validates the findings experimentally, bridging the theory-practice gap.

Abstract: signSGD is popular in nonconvex optimization due to its communication efficiency. Yet, existing analyses typically assume data are sampled with replacement in each iteration, contradicting a common practical implementation where data are randomly reshuffled and sequentially fed into the algorithm. This gap leaves the theoretical understanding of the more practical algorithm, signSGD with random reshuffling (SignRR), largely unexplored. We develop the first analysis of SignRR to identify the core technical challenge that prevents a thorough convergence analysis of this method. In particular, given a dataset of size $n$ and $T$ epochs, we show that the expected gradient norm of SignRR is upper bounded by $O(\log(nT)/\sqrt{nT} + σ)$, where $σ$ is the averaged conditional mean square error that may not vanish. To tackle this limitation, we develop two new sign-based algorithms under random reshuffling: SignRVR, which incorporates variance-reduced gradients, and SignRVM, which integrates momentum-based updates. Both algorithms achieve a faster convergence rate of ${O}(\log(nT)/\sqrt{nT} +\log(nT)\sqrt{n}/\sqrt{T})$. We further extend our algorithms to a distributed setting, with a convergence rate of ${O}(\log(n_0T)/\sqrt{n_0T} +\log (n_0T)\sqrt{n_0}/\sqrt{T})$, where $n_0$ is the size of the dataset of a single machine. These results mark the first step towards the theoretical understanding of practical implementation of sign-based optimization algorithms. Finally, we back up our theoretical findings through experiments on simulated and real-world problems, verifying that randomly reshuffled sign methods match or surpass existing baselines.

[498] Topology-Informed Graph Transformer

Yun Young Choi, Sun Woo Park, Minho Lee, Youngho Woo

Main category: cs.LG

TL;DR: TIGT is a novel graph transformer that enhances discriminative power for graph isomorphism detection and overall performance through topological positional embeddings, dual-path message passing, global attention, and graph information layers.

Details

Motivation: Transformers have shown success in NLP and Vision, but integrating them with GNNs faces challenges in distinguishing graph isomorphisms, which is crucial for predictive performance in graph tasks.

Method: TIGT uses four key components: 1) topological positional embedding layer using non-isomorphic universal covers based on cyclic subgraphs, 2) dual-path message-passing layer for explicit topological encoding, 3) global attention mechanism, and 4) graph information layer for feature recalibration.

Result: TIGT outperforms previous Graph Transformers in classifying synthetic datasets for distinguishing graph isomorphism classes and shows competitive edge over state-of-the-art Graph Transformers across various benchmark datasets.

Conclusion: TIGT successfully addresses the challenge of enhancing discriminative power for graph isomorphism detection while improving overall Graph Transformer performance through topological-informed design.

Abstract: Transformers have revolutionized performance in Natural Language Processing and Vision, paving the way for their integration with Graph Neural Networks (GNNs). One key challenge in enhancing graph transformers is strengthening the discriminative power of distinguishing isomorphisms of graphs, which plays a crucial role in boosting their predictive performances. To address this challenge, we introduce ‘Topology-Informed Graph Transformer (TIGT)’, a novel transformer enhancing both discriminative power in detecting graph isomorphisms and the overall performance of Graph Transformers. TIGT consists of four components: A topological positional embedding layer using non-isomorphic universal covers based on cyclic subgraphs of graphs to ensure unique graph representation: A dual-path message-passing layer to explicitly encode topological characteristics throughout the encoder layers: A global attention mechanism: And a graph information layer to recalibrate channel-wise graph features for better feature representation. TIGT outperforms previous Graph Transformers in classifying synthetic dataset aimed at distinguishing isomorphism classes of graphs. Additionally, mathematical analysis and empirical evaluations highlight our model’s competitive edge over state-of-the-art Graph Transformers across various benchmark datasets.

[499] GRAPHGINI: Fostering Individual and Group Fairness in Graph Neural Networks

Anuj Kumar Sirohi, Anjali Gupta, Sandeep Kumar, Amitabha Bagchi, Sayan Ranu

Main category: cs.LG

TL;DR: GraphGini: A novel GNN fairness approach using Gini coefficient to improve individual and group fairness while maintaining utility.

Details

Motivation: GNNs are increasingly used in high-stakes decision-making systems but can generate unfair decisions for underprivileged groups when lacking fairness constraints, raising concerns about algorithmic bias.

Method: Introduces GraphGini, which incorporates the Gini coefficient into GNN framework to enhance fairness, establishes its superiority over Lipschitz constant methods, and uses Nash social welfare program to ensure Pareto optimal distribution of group fairness.

Result: Extensive experiments on real-world datasets show GraphGini significantly improves individual fairness compared to state-of-the-art methods while maintaining utility and group fairness.

Conclusion: GraphGini provides a robust approach to address fairness concerns in GNNs, offering better individual fairness through Gini coefficient while ensuring Pareto optimal group fairness distribution.

Abstract: Graph Neural Networks (GNNs) have demonstrated impressive performance across various tasks, leading to their increased adoption in high-stakes decision-making systems. However, concerns have arisen about GNNs potentially generating unfair decisions for underprivileged groups or individuals when lacking fairness constraints. This work addresses this issue by introducing GraphGini, a novel approach that incorporates the Gini coefficient to enhance both individual and group fairness within the GNN framework. We rigorously establish that the Gini coefficient offers greater robustness and promotes equal opportunity among GNN outcomes, advantages not afforded by the prevailing Lipschitz constant methodology. Additionally, we employ the Nash social welfare program to ensure our solution yields a Pareto optimal distribution of group fairness. Extensive experimentation on real-world datasets demonstrates GraphGini’s efficacy in significantly improving individual fairness compared to state-of-the-art methods while maintaining utility and group fairness.

[500] A Counterfactual Analysis of the Dishonest Casino

Martin Haugh, Raghav Singal

Main category: cs.LG

TL;DR: This paper develops linear programming bounds for counterfactual causal effects in hidden Markov models, specifically quantifying how much of a gambler’s winnings are caused by a casino’s cheating in the dishonest casino HMM.

Details

Motivation: The dishonest casino HMM is commonly used in education, but traditional methods only recover latent states. The authors want to answer a counterfactual causal question: how much winnings are actually caused by cheating? This bridges HMMs with causal inference.

Method: Define structural causal models consistent with the HMM, introduce Expected Winnings Attributable to Cheating (EWAC), develop linear programming bounds for this partially identifiable quantity, incorporate domain knowledge through linear constraints, and analyze asymptotic identifiability.

Result: EWAC bounds are derived via linear programs. Time homogeneity yields tighter bounds, while relaxing it produces looser bounds with explicit solutions. Domain knowledge constraints improve bounds. Time-averaged EWAC becomes fully identifiable asymptotically.

Conclusion: This is the first work to develop LP bounds for counterfactuals in HMM settings, providing educational tools for teaching counterfactual inference while bridging HMMs and causal inference methodologies.

Abstract: The dishonest casino is a well-known hidden Markov model (HMM) often used in education to introduce HMMs and graphical models. A sequence of die rolls is observed with the casino switching between a fair and a loaded die. Instead of recovering the latent regime through filtering, smoothing, or the Viterbi algorithm, we ask a counterfactual question: how much of the gambler’s winnings are caused by the casino’s cheating? We introduce a class of structural causal models (SCMs) consistent with the HMM and define the expected winnings attributable to cheating (EWAC). Because EWAC is only partially identifiable, we bound it via linear programs (LPs). Numerical experiments help to develop intuition using benchmark SCMs based on independence, comonotonic, and countermonotonic copulas. Imposing a time homogeneity condition on the SCM yields tighter bounds, whereas relaxing it produces looser bounds that admit an explicit LP solution. Domain knowledge such as pathwise monotonicity or counterfactual stability can be incorporated through additional linear constraints. Finally, we show the time-averaged EWAC becomes fully identifiable as the number of time periods tends to infinity. Our work is the first to develop LP bounds for counterfactuals in an HMM setting, benefiting educational contexts where counterfactual inference is taught.

[501] What Should Embeddings Embed? Autoregressive Models Represent Latent Generating Distributions

Liyi Zhang, Michael Y. Li, R. Thomas McCoy, Theodore R. Sumers, Jian-Qiao Zhu, Thomas L. Griffiths

Main category: cs.LG

TL;DR: The paper connects autoregressive prediction to constructing predictive sufficient statistics, identifies three settings where optimal embeddings can be defined, and shows transformers encode these latent distributions.

Details

Motivation: To understand what embeddings from autoregressive language models should represent, connecting the prediction objective to constructing predictive sufficient statistics that summarize sequence information.

Method: Theoretical connection between autoregressive prediction and predictive sufficient statistics, identifying three optimal embedding settings: IID data (sufficient statistics), latent state models (posterior over states), and discrete hypothesis spaces (posterior over hypotheses). Empirical probing studies with transformers.

Result: Transformers encode these three kinds of latent generating distributions, perform well in out-of-distribution cases, and avoid token memorization in these settings.

Conclusion: Autoregressive language models learn embeddings that capture optimal predictive sufficient statistics corresponding to different latent generating distributions, explaining their ability to extract latent structure from text.

Abstract: Autoregressive language models have demonstrated a remarkable ability to extract latent structure from text. The embeddings from large language models have been shown to capture aspects of the syntax and semantics of language. But what should embeddings represent? We connect the autoregressive prediction objective to the idea of constructing predictive sufficient statistics to summarize the information contained in a sequence of observations, and use this connection to identify three settings where the optimal content of embeddings can be identified: independent identically distributed data, where the embedding should capture the sufficient statistics of the data; latent state models, where the embedding should encode the posterior distribution over states given the data; and discrete hypothesis spaces, where the embedding should reflect the posterior distribution over hypotheses given the data. We then conduct empirical probing studies to show that transformers encode these three kinds of latent generating distributions, and that they perform well in out-of-distribution cases and without token memorization in these settings.

[502] Federated Clustering: An Unsupervised Cluster-Wise Training for Decentralized Data Distributions

Mirko Nardi, Lorenzo Valerio, Andrea Passarella

Main category: cs.LG

TL;DR: FedCRef is an unsupervised federated learning method that discovers all underlying data distributions across decentralized clients without labels, handling multi-cluster-per-client scenarios without prior knowledge of cluster counts.

Details

Motivation: Federated Learning has been well-studied for supervised learning but remains underdeveloped for unsupervised contexts. There's a need for methods that can discover all data distributions in decentralized systems without requiring labels, especially when clients have heterogeneous, non-uniform data distributions and lack centralized coordination.

Method: FedCRef combines local clustering, model exchange with reconstruction error analysis, and collaborative refinement within federated groups of similar distributions. Clients iteratively refine data partitions while discovering distinct distributions, without assuming one-cluster-per-client or requiring prior knowledge of cluster numbers.

Result: Extensive evaluations on EMNIST, KMNIST, Fashion-MNIST, and KMNIST49 datasets show FedCRef successfully identifies true global data distributions with average local accuracy up to 95%. The method is robust to noisy conditions, scalable, and lightweight for resource-constrained edge devices.

Conclusion: FedCRef provides an effective solution for unsupervised federated learning that generalizes to multi-cluster-per-client scenarios, enabling discovery of all underlying data distributions across decentralized systems without labels or prior knowledge of cluster counts.

Abstract: Federated Learning (FL) enables decentralized machine learning while preserving data privacy, making it ideal for sensitive applications where data cannot be shared. While FL has been widely studied in supervised contexts, its application to unsupervised learning remains underdeveloped. This work introduces FedCRef, a novel unsupervised federated learning method designed to uncover all underlying data distributions across decentralized clients without requiring labels. This task, known as Federated Clustering, presents challenges due to heterogeneous, non-uniform data distributions and the lack of centralized coordination. Unlike previous methods that assume a one-cluster-per-client setup or require prior knowledge of the number of clusters, FedCRef generalizes to multi-cluster-per-client scenarios. Clients iteratively refine their data partitions while discovering all distinct distributions in the system. The process combines local clustering, model exchange and evaluation via reconstruction error analysis, and collaborative refinement within federated groups of similar distributions to enhance clustering accuracy. Extensive evaluations on four public datasets (EMNIST, KMNIST, Fashion-MNIST and KMNIST49) show that FedCRef successfully identifies true global data distributions, achieving an average local accuracy of up to 95%. The method is also robust to noisy conditions, scalable, and lightweight, making it suitable for resource-constrained edge devices.

[503] Practical Aspects on Solving Differential Equations Using Deep Learning: A Primer

Georgios Is. Detorakis

Main category: cs.LG

TL;DR: A primer introducing deep learning basics and the Deep Galerkin Method for solving differential equations, with practical implementation examples including 1D heat equation, ODE systems, and integral equations.

Details

Motivation: To provide accessible technical and practical insights into applying deep learning methods, specifically the Deep Galerkin Method, for solving differential equations, making the approach available to researchers without specialized hardware.

Method: Introduces the Deep Galerkin Method which uses deep neural networks to solve differential equations. Provides step-by-step implementation guidance for solving the 1D heat equation, systems of ordinary differential equations, and integral equations like Fredholm equations of the second kind.

Result: Provides complete working examples and code snippets that can run on simple computers without GPU requirements, with full source code available on Github for practical implementation.

Conclusion: The paper serves as an accessible primer that demonstrates the practical application of deep learning methods to solve various types of differential equations, lowering the barrier to entry for researchers interested in these computational approaches.

Abstract: Deep learning has become a popular tool across many scientific fields, including the study of differential equations, particularly partial differential equations. This work introduces the basic principles of deep learning and the Deep Galerkin method, which uses deep neural networks to solve differential equations. This primer aims to provide technical and practical insights into the Deep Galerkin method and its implementation. We demonstrate how to solve the one-dimensional heat equation step-by-step. We also show how to apply the Deep Galerkin method to solve systems of ordinary differential equations and integral equations, such as the Fredholm of the second kind. Additionally, we provide code snippets within the text and the complete source code on Github. The examples are designed so that one can run them on a simple computer without needing a GPU.

[504] BraVE: Offline Reinforcement Learning for Discrete Combinatorial Action Spaces

Matthew Landers, Taylor W. Killian, Hugo Barnes, Thomas Hartvigsen, Afsaneh Doryab

Main category: cs.LG

TL;DR: BraVE is a new offline RL method that uses tree-structured action traversal to efficiently handle high-dimensional discrete action spaces, outperforming prior methods by up to 20x in environments with over 4 million actions.

Details

Motivation: Offline RL in high-dimensional discrete action spaces is challenging due to exponential scaling of joint action space with sub-actions and difficulty modeling sub-action dependencies. Existing methods are either computationally infeasible or fail to represent joint sub-action effects.

Method: Branch Value Estimation (BraVE) uses tree-structured action traversal to evaluate a linear number of joint actions while preserving dependency structure between sub-actions.

Result: BraVE outperforms prior offline RL methods by up to 20x in environments with over four million actions, demonstrating superior computational efficiency and performance.

Conclusion: BraVE provides an effective solution for offline RL in high-dimensional discrete action spaces by combining computational efficiency with proper modeling of sub-action dependencies through tree-structured traversal.

Abstract: Offline reinforcement learning in high-dimensional, discrete action spaces is challenging due to the exponential scaling of the joint action space with the number of sub-actions and the complexity of modeling sub-action dependencies. Existing methods either exhaustively evaluate the action space, making them computationally infeasible, or factorize Q-values, failing to represent joint sub-action effects. We propose Branch Value Estimation (BraVE), a value-based method that uses tree-structured action traversal to evaluate a linear number of joint actions while preserving dependency structure. BraVE outperforms prior offline RL methods by up to $20\times$ in environments with over four million actions.

[505] $π_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, Ury Zhilinsky

Main category: cs.LG

TL;DR: The paper proposes a generalist robot policy using flow matching architecture built on pre-trained vision-language models to address data, generalization, and robustness challenges in robot learning.

Details

Motivation: Robot learning faces major obstacles in data, generalization, and robustness for real-world applications. Generalist robot policies (robot foundation models) can address these challenges by leveraging large-scale knowledge.

Method: Novel flow matching architecture built on top of pre-trained vision-language models to inherit Internet-scale semantic knowledge, trained on diverse datasets from multiple dexterous robot platforms (single-arm, dual-arm, mobile manipulators).

Result: Evaluated on zero-shot task performance after pre-training, language instruction following from people and VLM policies, and skill acquisition via fine-tuning across diverse tasks like laundry folding, table cleaning, and box assembly.

Conclusion: Generalist robot policies using foundation model approaches show promise for addressing core challenges in robot learning and enabling flexible, dexterous real-world robot systems.

Abstract: Robot learning holds tremendous promise to unlock the full potential of flexible, general, and dexterous robot systems, as well as to address some of the deepest questions in artificial intelligence. However, bringing robot learning to the level of generality required for effective real-world systems faces major obstacles in terms of data, generalization, and robustness. In this paper, we discuss how generalist robot policies (i.e., robot foundation models) can address these challenges, and how we can design effective generalist robot policies for complex and highly dexterous tasks. We propose a novel flow matching architecture built on top of a pre-trained vision-language model (VLM) to inherit Internet-scale semantic knowledge. We then discuss how this model can be trained on a large and diverse dataset from multiple dexterous robot platforms, including single-arm robots, dual-arm robots, and mobile manipulators. We evaluate our model in terms of its ability to perform tasks in zero shot after pre-training, follow language instructions from people and from a high-level VLM policy, and its ability to acquire new skills via fine-tuning. Our results cover a wide variety of tasks, such as laundry folding, table cleaning, and assembling boxes.

[506] Human-in-the-Loop Feature Selection Using Interpretable Kolmogorov-Arnold Network-based Double Deep Q-Network

Md Abrar Jahin, M. F. Mridha, Nilanjan Dey, Md. Jakir Hossen

Main category: cs.LG

TL;DR: HITL feature selection framework using KAN-DDQN with Beta sampling achieves 93% accuracy on MNIST, 83% on FashionMNIST, outperforming MLP-DDQN by 9% while using 4x fewer neurons and providing interpretable symbolic representations.

Details

Motivation: Feature selection is crucial for performance and interpretability in high-dimensional spaces. Existing static approaches lack adaptability, and dynamic per-instance feature selection with model-specific interpretability in RL remains underexplored.

Method: Human-in-the-loop feature selection framework integrated into Double Deep Q-Network using Kolmogorov-Arnold Network. Leverages simulated human feedback and Beta distribution-based sampling to iteratively refine feature subsets per data instance.

Result: KAN-DDQN achieved 93% test accuracy on MNIST and 83% on FashionMNIST, outperforming MLP-DDQN by up to 9%. Used 4x fewer neurons than MLPs. Without feature selection, models only achieved 58% (MNIST) and 64% (FashionMNIST). Scalable to CIFAR datasets with 30% relative macro F1 improvement on MNIST and 5% on CIFAR-10, reducing calibration error by 25%. Real-time feasible with <1ms latency and <0.02M parameters.

Conclusion: Proposed framework provides scalable, interpretable solution for feature selection suitable for real-time, adaptive decision-making with minimal human oversight, combining dynamic feature selection with model transparency through symbolic representations.

Abstract: Feature selection is critical for improving the performance and interpretability of machine learning models, particularly in high-dimensional spaces where complex feature interactions can reduce accuracy and increase computational demands. Existing approaches often rely on static feature subsets or manual intervention, limiting adaptability and scalability. However, dynamic, per-instance feature selection methods and model-specific interpretability in reinforcement learning remain underexplored. This study proposes a human-in-the-loop (HITL) feature selection framework integrated into a Double Deep Q-Network (DDQN) using a Kolmogorov-Arnold Network (KAN). Our novel approach leverages simulated human feedback and stochastic distribution-based sampling, specifically Beta, to iteratively refine feature subsets per data instance, improving flexibility in feature selection. The KAN-DDQN achieved notable test accuracies of 93% on MNIST and 83% on FashionMNIST, outperforming conventional MLP-DDQN models by up to 9%. The KAN-based model provided high interpretability via symbolic representation while using 4 times fewer neurons in the hidden layer than MLPs did. Comparatively, the models without feature selection achieved test accuracies of only 58% on MNIST and 64% on FashionMNIST, highlighting significant gains with our framework. We further validate scalability on CIFAR-10 and CIFAR-100, achieving up to 30% relative macro F1 improvement on MNIST and 5% on CIFAR-10, while reducing calibration error by 25%. Complexity analysis confirms real-time feasibility with latency below 1 ms and parameter counts under 0.02M. Pruning and visualization further enhanced model transparency by elucidating decision pathways. These findings present a scalable, interpretable solution for feature selection that is suitable for applications requiring real-time, adaptive decision-making with minimal human oversight.

[507] Graph-Dictionary Signal Model for Sparse Representations of Multivariate Data

William Cappelletti, Pascal Frossard

Main category: cs.LG

TL;DR: A novel Graph-Dictionary signal model that represents multivariate data relationships as sparse combinations of elementary graph atoms, with a framework to learn these dictionaries from observed signals.

Details

Motivation: Current methods lack models to infer underlying graph structures from multivariate data. There's a need to capture complex relational information as sparse sums of simpler graph structures, bridging sparse representations with structured decomposition of sample-varying relationships.

Method: Proposes a Graph-Dictionary signal model where relationships are characterized by filters on weighted sums of graph Laplacians. Introduces a framework to infer graph dictionary representations from node signals, incorporating prior knowledge about signal properties and underlying graphs. Uses a bilinear generalization of primal-dual splitting algorithm for optimization.

Result: The method successfully reconstructs graphs from signals in synthetic settings, outperforming popular baselines. In a motor imagery decoding task on brain activity data, it achieves better classification of imagined motion than standard methods using fewer features.

Conclusion: The graph-dictionary model effectively bridges sparse representations of multivariate data with structured decomposition of relationships into elementary graph atoms, providing a powerful framework for analyzing complex relational data.

Abstract: Representing and exploiting multivariate signals requires capturing relations between variables, which we can represent by graphs. Graph dictionaries allow to describe complex relational information as a sparse sum of simpler structures, but no prior model exists to infer such underlying structure elements from data. We define a novel Graph-Dictionary signal model, where a finite set of graphs characterizes relationships in data distribution as filters on the weighted sum of their Laplacians. We propose a framework to infer the graph dictionary representation from observed node signals, which allows to include a priori knowledge about signal properties, and about underlying graphs and their coefficients. We introduce a bilinear generalization of the primal-dual splitting algorithm to solve the learning problem. We show the capability of our method to reconstruct graphs from signals in multiple synthetic settings, where our model outperforms popular baselines. Then, we exploit graph-dictionary representations in an illustrative motor imagery decoding task on brain activity data, where we classify imagined motion better than standard methods relying on many more features. Our graph-dictionary model bridges a gap between sparse representations of multivariate data and a structured decomposition of sample-varying relationships into a sparse combination of elementary graph atoms.

[508] Meta-Learning Objectives for Preference Optimization

Carlo Alfano, Silvia Sapora, Jakob Nicolaus Foerster, Patrick Rebeschini, Yee Whye Teh

Main category: cs.LG

TL;DR: Researchers propose using MuJoCo tasks as a cheaper diagnostic benchmark for preference optimization algorithms, develop Mirror Preference Optimization (MPO) using evolutionary strategies, and show discovered algorithms outperform existing methods in both MuJoCo and LLM alignment tasks.

Details

Motivation: Evaluating preference optimization algorithms on LLM alignment is expensive, noisy, and involves many variables (model size, hyperparameters). The authors want a simpler, more controlled benchmark to gain insights into PO algorithm efficacy.

Method: 1) Design diagnostic suite of MuJoCo tasks and datasets for systematic PO evaluation; 2) Propose Mirror Preference Optimization (MPO) family based on mirror descent; 3) Use evolutionary strategies to discover algorithms specialized for specific dataset properties (mixed-quality, noisy data).

Result: 1) Discovered PO algorithms outperform all known algorithms in targeted MuJoCo settings; 2) Based on MuJoCo insights, designed PO algorithm that significantly outperforms existing baselines in LLM alignment task.

Conclusion: Simpler benchmarks like MuJoCo can provide valuable insights for preference optimization algorithm development, and evolutionary search in the MPO family can discover specialized algorithms that transfer effectively to complex tasks like LLM alignment.

Abstract: Evaluating preference optimization (PO) algorithms on LLM alignment is a challenging task that presents prohibitive costs, noise, and several variables like model size and hyper-parameters. In this work, we show that it is possible to gain insights on the efficacy of PO algorithm on simpler benchmarks. We design a diagnostic suite of MuJoCo tasks and datasets, which we use to systematically evaluate PO algorithms, establishing a more controlled and cheaper benchmark. We then propose a novel family of PO algorithms based on mirror descent, which we call Mirror Preference Optimization (MPO). Through evolutionary strategies, we search this class to discover algorithms specialized to specific properties of preference datasets, such as mixed-quality or noisy data. We demonstrate that our discovered PO algorithms outperform all known algorithms in the targeted MuJoCo settings. Finally, based on the insights gained from our MuJoCo experiments, we design a PO algorithm that significantly outperform existing baselines in an LLM alignment task.

[509] Mirror Descent Actor Critic via Bounded Advantage Learning

Ryo Iwaki

Main category: cs.LG

TL;DR: MDAC is an actor-critic version of Mirror Descent Value Iteration for continuous action domains that boosts performance by bounding actor’s log-density terms in critic’s loss, outperforming entropy-only regularization methods.

Details

Motivation: KL-entropy-regularized methods like MDVI work well in discrete action domains but don't surpass entropy-only regularization in continuous action domains, creating a performance gap that needs addressing.

Method: Propose Mirror Descent Actor Critic (MDAC) - actor-critic instantiation of MDVI for continuous actions. Key innovation: bounding actor’s log-density terms in critic’s loss function. Also explore effective bounding functions and relate MDAC to Advantage Learning.

Result: MDAC with bounded log-density terms significantly boosts empirical performance compared to naive unbounded instantiation. With appropriate bounding functions, MDAC outperforms strong non-regularized and entropy-only-regularized methods in continuous action domains.

Conclusion: Bounding advantage terms (actor’s log-probability) is validated and beneficial for continuous action RL, making KL-entropy regularization competitive with entropy-only methods through proper implementation in actor-critic framework.

Abstract: Regularization is a core component of recent Reinforcement Learning (RL) algorithms. Mirror Descent Value Iteration (MDVI) uses both Kullback-Leibler divergence and entropy as regularizers in its value and policy updates. Despite its empirical success in discrete action domains and strong theoretical guarantees, the performance of KL-entropy-regularized methods does not surpass that of a strong entropy-only-regularized method in continuous action domains. In this study, we propose Mirror Descent Actor Critic (MDAC) as an actor-critic style instantiation of MDVI for continuous action domains, and show that its empirical performance is significantly boosted by bounding the actor’s log-density terms in the critic’s loss function, compared to a non-bounded naive instantiation. Further, we relate MDAC to Advantage Learning by recalling that the actor’s log-probability is equal to the regularized advantage function in tabular cases, and theoretically discuss when and why bounding the advantage terms is validated and beneficial. We also empirically explore effective choices for the bounding functions, and show that MDAC performs better than strong non-regularized and entropy-only-regularized methods with an appropriate choice of the bounding functions.

[510] Reward Shaping to Mitigate Reward Hacking in RLHF

Jiayi Fu, Xuandong Zhao, Chengyuan Yao, Heng Wang, Qi Han, Yanghua Xiao

Main category: cs.LG

TL;DR: PAR (Preference As Reward) is a novel reward shaping method for RLHF that uses latent preferences from reward models to prevent reward hacking and improve alignment in LLMs.

Details

Motivation: RLHF is vulnerable to reward hacking where models exploit flaws in reward functions instead of learning intended behaviors. Existing reward shaping methods lack systematic investigation and design principles.

Method: Proposes PAR which leverages latent preferences embedded within reward models as RL signals. Based on two design principles: RL rewards should be bounded, and should have rapid initial growth followed by gradual convergence.

Result: PAR outperforms other reward shaping methods on Gemma2-2B with Ultrafeedback-Binarized and HH-RLHF datasets. Achieves at least 5% higher win rate on AlpacaEval 2.0, shows remarkable data efficiency (single reference reward needed), and maintains robustness against reward hacking even after two full training epochs.

Conclusion: PAR effectively addresses reward hacking in RLHF through principled reward shaping, offering superior performance, data efficiency, and robustness compared to existing methods.

Abstract: Reinforcement Learning from Human Feedback (RLHF) is essential for aligning large language models (LLMs) with human values. However, RLHF is susceptible to \emph{reward hacking}, where the agent exploits flaws in the reward function rather than learning the intended behavior, thus degrading alignment. Although reward shaping helps stabilize RLHF and partially mitigate reward hacking, a systematic investigation into shaping techniques and their underlying principles remains lacking. To bridge this gap, we present a comprehensive study of the prevalent reward shaping methods. Our analysis suggests two key design principles: (1) the RL reward should be bounded, and (2) the RL reward benefits from rapid initial growth followed by gradual convergence. Guided by these insights, we propose Preference As Reward (PAR), a novel approach that leverages the latent preferences embedded within the reward model as the signal for reinforcement learning. Moreover, PAR exhibits two critical variance-reduction properties that contribute to stabilizing the RLHF training process and effectively extending the tolerance window for early stopping. We evaluated PAR on the base model Gemma2-2B using two datasets, Ultrafeedback-Binarized and HH-RLHF. Experimental results demonstrate PAR’s superior performance over other reward shaping methods. On the AlpacaEval 2.0 benchmark, PAR achieves a win rate of at least 5 percentage points higher than competing approaches. Furthermore, PAR exhibits remarkable data efficiency, requiring only a single reference reward for optimal performance, and maintains robustness against reward hacking even after two full epochs of training. The code is available at https://github.com/PorUna-byte/PAR.

[511] From Actions to Words: Towards Abstractive-Textual Policy Summarization in RL

Sahar Admoni, Assaf Hallak, Yftah Ziser, Omer Ben-Porat, Ofra Amir

Main category: cs.LG

TL;DR: SySLLM uses LLMs to generate textual summaries of RL agent behavior from trajectory data, providing better explanations than demonstration-based methods.

Details

Motivation: Current RL explanation methods rely on curated demonstrations that only show local behaviors and don't reveal global strategies, making it hard for humans to understand agent intent and decision-making patterns.

Method: SySLLM converts spatiotemporal trajectories into structured text and prompts LLMs to generate coherent summaries describing agent goals, exploration style, and decision patterns without task-specific fine-tuning.

Result: Expert evaluations show strong alignment with human analyses, and 75.5% of participants in a large-scale user study preferred SySLLM summaries over state-of-the-art demonstration-based explanations.

Conclusion: Abstractive textual summarization using LLMs represents a promising new paradigm for interpreting complex reinforcement learning behavior.

Abstract: Explaining reinforcement learning agents is challenging because policies emerge from complex reward structures and neural representations that are difficult for humans to interpret. Existing approaches often rely on curated demonstrations that expose local behaviors but provide limited insight into an agent’s global strategy, leaving users to infer intent from raw observations. We propose SySLLM (Synthesized Summary using Large Language Models), a framework that reframes policy interpretation as a language-generation problem. Instead of visual demonstrations, SySLLM converts spatiotemporal trajectories into structured text and prompts an LLM to generate coherent summaries describing the agent’s goals, exploration style, and decision patterns. SySLLM scales to long-horizon, semantically rich environments without task-specific fine-tuning, leveraging LLM world knowledge and compositional reasoning to capture latent behavioral structure across policies. Expert evaluations show strong alignment with human analyses, and a large-scale user study found that 75.5% of participants preferred SySLLM summaries over state-of-the-art demonstration-based explanations. Together, these results position abstractive textual summarization as a paradigm for interpreting complex RL behavior.

[512] Enabling Weak Client Participation via On-device Knowledge Distillation in Heterogeneous Federated Learning

Jihyun Lim, Junhyuk Jo, Tuo Zhang, Sunwoo Lee

Main category: cs.LG

TL;DR: On-device KD-based heterogeneous FL method that uses auxiliary models and selective knowledge transfer to address limitations of server-side logit ensemble approaches in non-IID data scenarios.

Details

Motivation: Existing online KD methods in FL assume centralized unlabeled data and use logit ensemble approaches that degrade soft target quality with non-IID data, plus they don't effectively utilize heterogeneous edge device resources.

Method: Proposes on-device KD-based heterogeneous FL: 1) small auxiliary models learn from labeled local data, 2) subset of resource-rich clients transfer knowledge to large models via on-device KD using unlabeled data, enabling efficient resource utilization.

Result: Extensive experiments show the method effectively utilizes all edge device system resources and unlabeled data, achieving higher accuracy than state-of-the-art KD-based FL methods.

Conclusion: The proposed on-device KD-based heterogeneous FL method overcomes critical limitations of existing approaches by enabling efficient knowledge transfer without centralized data assumptions and better handling non-IID data distributions.

Abstract: Online Knowledge Distillation (KD) is recently highlighted to train large models in Federated Learning (FL) environments. Many existing studies adopt the logit ensemble method to perform KD on the server side. However, they often assume that unlabeled data collected at the edge is centralized on the server. Moreover, the logit ensemble method personalizes local models, which can degrade the quality of soft targets, especially when data is highly non-IID. To address these critical limitations,we propose a novel on-device KD-based heterogeneous FL method. Our approach leverages a small auxiliary model to learn from labeled local data. Subsequently, a subset of clients with strong system resources transfers knowledge to a large model through on-device KD using their unlabeled data. Our extensive experiments demonstrate that our on-device KD-based heterogeneous FL method effectively utilizes the system resources of all edge devices as well as the unlabeled data, resulting in higher accuracy compared to SOTA KD-based FL methods.

[513] Improving Bayesian Optimization for Portfolio Management with an Adaptive Scheduling

Zinuo You, John Cartlidge, Karen Elliott, Menghan Ge, Daniel Gold

Main category: cs.LG

TL;DR: TPE-AS: A Bayesian optimization framework for stable and sample-efficient optimization of black-box portfolio management systems under limited evaluation budgets.

Details

Motivation: Black-box portfolio management systems are widely used but their performance fluctuates with market regimes. Evaluating them is computationally expensive due to limited observation budgets, creating a need for stable and sample-efficient optimization methods.

Method: Proposes TPE-AS framework with a weighted Lagrangian estimator using adaptive schedule and importance sampling. It dynamically balances exploration (maximizing model performance) and exploitation (minimizing variance of observations) to guide search from broad exploration to stable regions.

Result: Extensive experiments across four backtest settings with three distinct black-box portfolio management models demonstrate the effectiveness of the proposed method compared to baseline configurations.

Conclusion: TPE-AS provides a stable and efficient Bayesian optimization framework for black-box portfolio systems under limited evaluation budgets, addressing the critical challenge of sample-efficient optimization in financial applications.

Abstract: Existing black-box portfolio management systems are prevalent in the financial industry due to commercial and safety constraints, though their performance can fluctuate dramatically with changing market regimes. Evaluating these non-transparent systems is computationally expensive, as fixed budgets limit the number of possible observations. Therefore, achieving stable and sample-efficient optimization for these systems has become a critical challenge. This work presents a novel Bayesian optimization framework (TPE-AS) that improves search stability and efficiency for black-box portfolio models under these limited observation budgets. Standard Bayesian optimization, which solely maximizes expected return, can yield erratic search trajectories and misalign the surrogate model with the true objective, thereby wasting the limited evaluation budget. To mitigate these issues, we propose a weighted Lagrangian estimator that leverages an adaptive schedule and importance sampling. This estimator dynamically balances exploration and exploitation by incorporating both the maximization of model performance and the minimization of the variance of model observations. It guides the search from broad, performance-seeking exploration towards stable and desirable regions as the optimization progresses. Extensive experiments and ablation studies, which establish our proposed method as the primary approach and other configurations as baselines, demonstrate its effectiveness across four backtest settings with three distinct black-box portfolio management models.

[514] SAINT: Attention-Based Policies for Discrete Combinatorial Action Spaces

Matthew Landers, Taylor W. Killian, Thomas Hartvigsen, Afsaneh Doryab

Main category: cs.LG

TL;DR: SAINT introduces a Transformer-based policy architecture for combinatorial action spaces that models sub-actions as unordered sets with self-attention, outperforming baselines in 20 environments with up to 17 million joint actions.

Details

Motivation: Combinatorial action spaces in real-world problems lead to exponential growth in possible actions, limiting conventional RL algorithms. Existing approaches impose restrictive factorized or sequential structures that fail to capture complex joint behavior between sub-actions.

Method: SAINT (Sub-Action Interaction Network using Transformers) represents multi-component actions as unordered sets and models their dependencies via self-attention conditioned on the global state. The architecture is permutation-invariant and compatible with standard policy optimization algorithms.

Result: SAINT consistently outperforms strong baselines across 20 distinct combinatorial environments spanning three task domains, including environments with nearly 17 million joint actions.

Conclusion: SAINT provides an effective, sample-efficient solution for combinatorial action spaces by capturing complex joint behavior through Transformer-based modeling of sub-action interactions, overcoming limitations of previous approaches.

Abstract: The combinatorial structure of many real-world action spaces leads to exponential growth in the number of possible actions, limiting the effectiveness of conventional reinforcement learning algorithms. Recent approaches for combinatorial action spaces impose factorized or sequential structures over sub-actions, failing to capture complex joint behavior. We introduce the Sub-Action Interaction Network using Transformers (SAINT), a novel policy architecture that represents multi-component actions as unordered sets and models their dependencies via self-attention conditioned on the global state. SAINT is permutation-invariant, sample-efficient, and compatible with standard policy optimization algorithms. In 20 distinct combinatorial environments across three task domains, including environments with nearly 17 million joint actions, SAINT consistently outperforms strong baselines.

[515] Mining Intrinsic Rewards from LLM Hidden States for Efficient Best-of-N Sampling

Jizhou Guo, Zhaomin Wu, Hanchen Yang, Philip S. Yu

Main category: cs.LG

TL;DR: SWIFT introduces a lightweight method that learns reward functions from LLM hidden states instead of text-based models, achieving better performance with far fewer parameters.

Details

Motivation: Best-of-N sampling depends on massive, text-based reward models that are computationally expensive and data-hungry, overlooking the rich information available in LLM internal hidden states.

Method: SWIFT learns reward functions directly from LLM hidden states at token embedding level using simple linear layers to distinguish between preferred and dispreferred generations, eliminating text-based modeling.

Result: SWIFT outperforms existing baselines (12.7% higher accuracy than EurusRM-7B on MATH) while using less than 0.005% of parameters, shows robust scalability, works with closed-source models via logit access, and combines with traditional reward models.

Conclusion: SWIFT offers a practical, efficient alternative to text-based reward models for LLM post-training by leveraging internal hidden states, significantly reducing computational and data requirements while maintaining or improving performance.

Abstract: Best-of-N sampling is a powerful method for improving Large Language Model (LLM) performance, but it is often limited by its dependence on massive, text-based reward models. These models are not only computationally expensive but also data-hungry, requiring extensive labeled datasets for training. This creates a significant data challenge, as they overlook a rich, readily available data source: the LLM’s own internal hidden states. To address this data and efficiency gap, we introduce SWIFT (Simple Weighted Intrinsic Feedback Technique), a novel and lightweight method that learns a reward function directly from the rich information embedded in LLM hidden states. Operating at the token embedding level, SWIFT employs simple linear layers to effectively distinguish between preferred and dispreferred generations, eliminating the need for computationally intensive text-based modeling. Extensive experiments on standard benchmarks show that SWIFT outperforms existing baselines (12.7% higher accuracy than EurusRM-7B on MATH dataset) while using less than 0.005% of their parameters. Its robust scalability, compatibility with certain closed-source models via logit access, and ability to combine with traditional reward models for additional performance highlight SWIFT’s practical value and contribution to more efficient data-driven LLM post-training. Our code is available at https://github.com/aster2024/SWIFT .

[516] Inverse Q-Learning Done Right: Offline Imitation Learning in $Q^π$-Realizable MDPs

Antoine Moulin, Gergely Neu, Luca Viano

Main category: cs.LG

TL;DR: SPOIL algorithm for offline imitation learning in MDPs with linear Qπ-realizability achieves ε-optimal performance with O(ε⁻²) samples, extended to nonlinear cases with O(ε⁻⁴) complexity, and provides a new critic loss for deep imitation learning.

Details

Motivation: Existing offline imitation learning approaches assume the expert belongs to a tractable class of known policies. This work approaches the problem from a different angle by leveraging structural assumptions about the environment (Qπ-realizability) rather than assumptions about the expert policy class.

Method: Introduces SPOIL (saddle-point offline imitation learning) algorithm for linear Qπ-realizable MDPs, which uses a saddle-point formulation. Extends to nonlinear Qπ-realizable MDPs. Also proposes a new loss function for training critic networks from expert data in deep imitation learning.

Result: SPOIL guarantees matching expert performance up to additive error ε with O(ε⁻²) samples for linear Qπ-realizable MDPs, and O(ε⁻⁴) for nonlinear cases. Neural net implementation outperforms behavior cloning and is competitive with state-of-the-art algorithms on standard benchmarks.

Conclusion: The SPOIL algorithm provides a new approach to offline imitation learning with strong theoretical guarantees and practical performance, leveraging environmental structure rather than expert policy assumptions, with promising empirical results in deep learning settings.

Abstract: We study the problem of offline imitation learning in Markov decision processes (MDPs), where the goal is to learn a well-performing policy given a dataset of state-action pairs generated by an expert policy. Complementing a recent line of work on this topic that assumes the expert belongs to a tractable class of known policies, we approach this problem from a new angle and leverage a different type of structural assumption about the environment. Specifically, for the class of linear $Q^π$-realizable MDPs, we introduce a new algorithm called saddle-point offline imitation learning (\SPOIL), which is guaranteed to match the performance of any expert up to an additive error $\varepsilon$ with access to $\mathcal{O}(\varepsilon^{-2})$ samples. Moreover, we extend this result to possibly nonlinear $Q^π$-realizable MDPs at the cost of a worse sample complexity of order $\mathcal{O}(\varepsilon^{-4})$. Finally, our analysis suggests a new loss function for training critic networks from expert data in deep imitation learning. Empirical evaluations on standard benchmarks demonstrate that the neural net implementation of \SPOIL is superior to behavior cloning and competitive with state-of-the-art algorithms.

[517] Breaking AR’s Sampling Bottleneck: Provable Acceleration via Diffusion Language Models

Gen Li, Changxiao Cai

Main category: cs.LG

TL;DR: The paper develops convergence guarantees for diffusion language models from an information-theoretic perspective, showing that sampling error decays inversely with iteration count and scales linearly with token mutual information, enabling high-quality generation with fewer iterations than sequence length.

Details

Motivation: Diffusion models show strong potential for language generation with parallel sampling advantages over autoregressive models, but theoretical understanding remains underdeveloped. The authors aim to provide rigorous convergence analysis to explain the practical effectiveness of diffusion language models.

Method: The authors develop information-theoretic convergence guarantees for diffusion language models, analyzing sampling error measured by KL divergence. They establish matching upper and lower bounds to show tightness, covering the regime where number of iterations T is less than sequence length L.

Result: The analysis shows sampling error decays inversely with iteration count T and scales linearly with mutual information between tokens. Crucially, high-quality samples can be generated with T < L, breaking the fundamental L-step bottleneck of autoregressive models.

Conclusion: The theoretical results provide novel insights into why diffusion language models work effectively in practice, justifying their ability to generate high-quality text with fewer iterations than sequence length, overcoming limitations of autoregressive generation.

Abstract: Diffusion models have emerged as a powerful paradigm for modern generative modeling, demonstrating strong potential for large language models (LLMs). Unlike conventional autoregressive (AR) models that generate tokens sequentially, diffusion models allow for parallel sampling, offering a promising path to accelerate generation and eliminate the left-to-right generation constraints. Despite their empirical success, theoretical understandings of diffusion language models remain underdeveloped. In this work, we develop convergence guarantees for diffusion language models from an information-theoretic perspective. Our analysis demonstrates that the sampling error, measured by the Kullback-Leibler (KL) divergence, decays inversely with the number of iterations $T$ and scales linearly with the mutual information between tokens in the target text sequence. Crucially, our theory covers the regime $T<L$, where $L$ is the text sequence length. This justifies that high-quality samples can be generated with fewer iterations than $L$, thereby breaking the fundamental sampling bottleneck of $L$ steps required by AR models. We further establish matching upper and lower bounds, up to some constant factor, that shows the tightness of our convergence analysis. These results offer novel theoretical insights into the practical effectiveness of diffusion language models.

[518] Blockchain-Enabled Privacy-Preserving Second-Order Federated Edge Learning in Personalized Healthcare

Anum Nawaz, Muhammad Irfan, Xianjia Yu, Hamad Aldawsari, Rayan Hamza Alsisi, Zhuo Zou, Tomi Westerlund

Main category: cs.LG

TL;DR: Proposes BFEL: a blockchain-enhanced second-order FL framework using optimized FedCurv for personalized healthcare wearables, addressing non-iid data challenges with verifiable aggregation and reduced communication costs.

Details

Motivation: FL addresses security/privacy in health monitoring wearables, but conventional FL struggles with personalized training due to heterogeneous non-iid data from individual physiology and usage patterns. Need for stable, efficient personalized FL with trust/auditability.

Method: Develops BFEL framework based on optimized FedCurv (second-order FL). FedCurv uses Fisher Information Matrix to preserve client-specific knowledge and reduce model drift. Incorporates Ethereum-based model aggregation for trust/verifiability and public key encryption for privacy. Tests with federated CNNs/MLPs on MNIST, CIFAR-10, PathMNIST datasets.

Result: Demonstrates high efficiency, scalability, suitability for edge deployment on wearables, and significant reduction in communication costs. Framework effectively manages personalized training on non-iid heterogeneous data while minimizing communication rounds for target precision convergence.

Conclusion: BFEL provides a verifiable, auditable, optimized second-order FL framework for personalized healthcare systems that addresses non-iid data challenges, reduces communication overhead, and ensures trust/privacy through blockchain integration.

Abstract: Federated learning (FL) is increasingly recognised for addressing security and privacy concerns in traditional cloud-centric machine learning (ML), particularly within personalised health monitoring such as wearable devices. By enabling global model training through localised policies, FL allows resource-constrained wearables to operate independently. However, conventional first-order FL approaches face several challenges in personalised model training due to the heterogeneous non-independent and identically distributed (non-iid) data by each individual’s unique physiology and usage patterns. Recently, second-order FL approaches maintain the stability and consistency of non-iid datasets while improving personalised model training. This study proposes and develops a verifiable and auditable optimised second-order FL framework BFEL (blockchain enhanced federated edge learning) based on optimised FedCurv for personalised healthcare systems. FedCurv incorporates information about the importance of each parameter to each client’s task (through fisher information matrix) which helps to preserve client-specific knowledge and reduce model drift during aggregation. Moreover, it minimizes communication rounds required to achieve a target precision convergence for each client device while effectively managing personalised training on non-iid and heterogeneous data. The incorporation of ethereum-based model aggregation ensures trust, verifiability, and auditability while public key encryption enhances privacy and security. Experimental results of federated CNNs and MLPs utilizing mnist, cifar-10, and PathMnist demonstrate framework’s high efficiency, scalability, suitability for edge deployment on wearables, and significant reduction in communication cost.

[519] When Lower-Order Terms Dominate: Adaptive Expert Algorithms for Heavy-Tailed Losses

Antoine Moulin, Emmanuel Esposito, Dirk van der Hoeven

Main category: cs.LG

TL;DR: Adaptive algorithms for prediction with expert advice under heavy-tailed losses (only second moment bounded by θ) that avoid problematic lower-order terms in existing regret bounds.

Details

Motivation: Existing adaptive algorithms for heavy-tailed losses have lower-order terms (often maximum losses) that can actually dominate regret bounds, scaling as √(KT) even with small θ. This is problematic because these terms can overshadow the main regret term.

Method: Develop adaptive algorithms that don’t require prior knowledge about loss range or second moments. The algorithms avoid dependence on problematic lower-order terms like maximum losses.

Result: Achieve improved regret bounds: O(√(θT log K)) worst-case regret, and O(θ log(KT)/Δ_min) when losses are i.i.d. from fixed distribution (Δ_min is gap between second best and best expert). Also improved bounds for squared loss.

Conclusion: The paper addresses a significant limitation in existing adaptive algorithms for heavy-tailed losses by eliminating problematic lower-order terms that can dominate regret, providing more robust and practical regret guarantees.

Abstract: We consider the problem setting of prediction with expert advice with possibly heavy-tailed losses, i.e. the only assumption on the losses is an upper bound on their second moments, denoted by $θ$. We develop adaptive algorithms that do not require any prior knowledge about the range or the second moment of the losses. Existing adaptive algorithms have what is typically considered a lower-order term in their regret guarantees. We show that this lower-order term, which is often the maximum of the losses, can actually dominate the regret bound in our setting. Specifically, we show that even with small constant $θ$, this lower-order term can scale as $\sqrt{KT}$, where $K$ is the number of experts and $T$ is the time horizon. We propose adaptive algorithms with improved regret bounds that avoid the dependence on such a lower-order term and guarantee $\mathcal{O}(\sqrt{θT\log(K)})$ regret in the worst case, and $\mathcal{O}(θ\log(KT)/Δ_{\min})$ regret when the losses are sampled i.i.d. from some fixed distribution, where $Δ_{\min}$ is the difference between the mean losses of the second best expert and the best expert. Additionally, when the loss function is the squared loss, our algorithm also guarantees improved regret bounds over prior results.

[520] KVmix: Gradient-Based Layer Importance-Aware Mixed-Precision Quantization for KV Cache

Fei Li, Song Liu, Weiguo Wu, Shiqiang Nie, Jinyu Wang

Main category: cs.LG

TL;DR: KVmix: A mixed-precision quantization method for KV Cache that uses gradient-based importance analysis to allocate layer-specific bit-widths and dynamically prioritizes recent pivotal tokens in long-context tasks, achieving near-lossless performance with extreme compression (2.19bit Key, 2.38bit Value) and 4.9x memory reduction.

Details

Motivation: The high memory demands of KV Cache during LLM inference severely restrict deployment on resource-constrained platforms. Existing quantization methods either use static precision allocation or fail to dynamically prioritize critical KV in long-context tasks, forcing tradeoffs between memory, accuracy, and throughput.

Method: KVmix uses gradient-based importance analysis to evaluate how individual Key and Value projection matrices affect model loss, enabling layer-specific bit-width allocation for mixed-precision quantization. It dynamically prioritizes higher precision for important layers while aggressively quantizing less influential ones. Also includes a dynamic long-context optimization strategy that adaptively keeps full-precision KV pairs for recent pivotal tokens and compresses older ones, plus efficient low-bit quantization and CUDA kernels.

Result: On LLMs like Llama and Mistral, KVmix achieves near-lossless inference performance with extremely low quantization configuration (Key 2.19bit, Value 2.38bit), while delivering 4.9x memory compression and 5.3x speedup in inference throughput.

Conclusion: KVmix effectively addresses KV Cache memory bottlenecks through intelligent mixed-precision quantization with dynamic importance-based allocation, achieving excellent compression and speedup while maintaining near-lossless performance, making LLMs more deployable on resource-constrained platforms.

Abstract: The high memory demands of the Key-Value (KV) Cache during the inference of Large Language Models (LLMs) severely restrict their deployment in resource-constrained platforms. Quantization can effectively alleviate the memory pressure caused by KV Cache. However, existing methods either rely on static one-size-fits-all precision allocation or fail to dynamically prioritize critical KV in long-context tasks, forcing memory-accuracy-throughput tradeoffs. In this work, we propose a novel mixed-precision quantization method for KV Cache named KVmix. KVmix leverages gradient-based importance analysis to evaluate how individual Key and Value projection matrices affect the model loss, enabling layer-specific bit-width allocation for mix-precision quantization. It dynamically prioritizes higher precision for important layers while aggressively quantizing less influential ones, achieving a tunable balance between accuracy and efficiency. KVmix also introduces a dynamic long-context optimization strategy that adaptively keeps full-precision KV pairs for recent pivotal tokens and compresses older ones, achieving high-quality sequence generation with low memory usage. Additionally, KVmix provides efficient low-bit quantization and CUDA kernels to optimize computational overhead. On LLMs such as Llama and Mistral, KVmix achieves near-lossless inference performance with extremely low quantization configuration (Key 2.19bit Value 2.38bit), while delivering a remarkable 4.9x memory compression and a 5.3x speedup in inference throughput.

[521] Low-rank variational dropout: Rank selection and uncertainty in adapters

Cooper Doyle, Rebecca Chan, Andy Hu, Anna Leontjeva

Main category: cs.LG

TL;DR: BayesLoRA: A Bayesian framework for low-rank adaptation that learns predictive uncertainty and effective adapter rank with minimal overhead, outperforming existing methods in accuracy and calibration.

Details

Motivation: Low-rank adaptation methods enable efficient task-specific updates but lack principled mechanisms for uncertainty estimation and capacity control, limiting their reliability and interpretability.

Method: Introduces Low-Rank Variational Dropout (LRVD), a Bayesian framework with scale-invariant sparsity-inducing priors and structured variational families that tie uncertainty at latent rank components, creating rank-wise noise-to-signal ratios for automatic capacity selection.

Result: BayesLoRA induces stable, non-arbitrary rank structure aligned with intrinsic singular directions, outperforms existing low-rank sparsification methods in accuracy at comparable training cost, and delivers substantially improved predictive calibration with only O(r) additional parameters.

Conclusion: LRVD provides a principled Bayesian approach to low-rank adaptation that simultaneously learns uncertainty and effective rank, offering improved accuracy, calibration, and interpretability with minimal computational overhead.

Abstract: Low-rank adaptation methods enable efficient task-specific updates in large neural networks, but provide no principled mechanism for uncertainty estimation or capacity control. We introduce Low-Rank Variational Dropout (LRVD), a Bayesian framework that operates directly in the space of low-rank adaptation. LRVD employs a scale-invariant, sparsity-inducing prior together with a structured variational family that ties uncertainty at the level of latent rank components, inducing rank-wise noise-to-signal ratios for automatic capacity selection. As a concrete instantiation, we apply LRVD to low-rank adaptation and obtain BayesLoRA, which jointly learns predictive uncertainty and the effective adapter rank with only O(r) additional parameters, where r is the adapter rank. We empirically show that BayesLoRA induces stable, non-arbitrary rank structure aligned with the intrinsic singular directions of the learned updates, and outperforms existing low-rank sparsification methods in accuracy at comparable training cost while delivering substantially improved predictive calibration at negligible additional overhead.

[522] Structure-preserving Lift & Learn: Scientific machine learning for nonlinear conservative partial differential equations

Harsh Sharma, Juan Diego Draxl Giannoni, Boris Kramer

Main category: cs.LG

TL;DR: Structure-preserving Lift & Learn method learns physics-preserving reduced-order models for nonlinear PDEs with conservation laws using energy-quadratization lifting transformations.

Details

Motivation: To develop efficient reduced-order models for nonlinear PDEs with conservation laws that preserve underlying physics while being computationally efficient and generalizable.

Method: Hybrid learning approach using energy-quadratization to transform nonlinear PDEs into equivalent quadratic lifted systems, then analytically deriving quadratic reduced terms and learning linear operators via constrained optimization.

Result: Method yields computationally efficient quadratic reduced-order models that respect physics, demonstrated on three PDE examples with competitive accuracy and efficiency compared to state-of-the-art methods.

Conclusion: Structure-preserving Lift & Learn provides an effective hybrid approach for learning physics-preserving reduced-order models for nonlinear PDEs with conservation laws, offering both accuracy and computational efficiency.

Abstract: This work presents structure-preserving Lift & Learn, a scientific machine learning method that employs lifting variable transformations to learn structure-preserving reduced-order models for nonlinear partial differential equations (PDEs) with conservation laws. We propose a hybrid learning approach based on a recently developed energy-quadratization strategy that uses knowledge of the nonlinearity at the PDE level to derive an equivalent quadratic lifted system with quadratic system energy. The lifted dynamics obtained via energy quadratization are linear in the old variables, making model learning very effective in the lifted setting. Based on the lifted quadratic PDE model form, the proposed method derives quadratic reduced terms analytically and then uses those derived terms to formulate a constrained optimization problem to learn the remaining linear reduced operators in a structure-preserving way. The proposed hybrid learning approach yields computationally efficient quadratic reduced-order models that respect the underlying physics of the high-dimensional problem. We demonstrate the generalizability of quadratic models learned via the proposed structure-preserving Lift & Learn method through three numerical examples: the one-dimensional wave equation with exponential nonlinearity, the two-dimensional sine-Gordon equation, and the two-dimensional Klein-Gordon-Zakharov equations. The numerical results show that the proposed learning approach is competitive with the state-of-the-art structure-preserving data-driven model reduction method in terms of both accuracy and computational efficiency.

[523] Pruning the Unsurprising: Efficient LLM Reasoning via First-Token Surprisal

Wenhao Zeng, Yaoning Wang, Chao Hu, Yuling Shi, Chengcheng Wan, Hongyu Zhang, Xiaodong Gu

Main category: cs.LG

TL;DR: ASAP is a novel coarse-to-fine framework for compressing Chain-of-Thought reasoning traces that preserves logical coherence while reducing training and inference costs.

Details

Motivation: Large Reasoning Models face challenges with excessively long reasoning traces that increase training costs and inference latency. Existing compression methods have trade-offs: token-level methods disrupt coherence, while step-level methods based on perplexity fail to capture logically critical steps due to logical information dilution.

Method: ASAP uses a three-stage approach: 1) Anchor-guided pruning to preserve core reasoning structure and reduce search space, 2) Logic-aware pruning using a novel first-token surprisal metric to select essential reasoning steps (based on insight that logical branching choices concentrate at step onset), and 3) Distillation to enable models to autonomously generate and use concise CoTs at inference.

Result: ASAP achieves state-of-the-art accuracy across multiple benchmarks while substantially reducing both training and inference costs compared to existing methods.

Conclusion: The proposed ASAP framework effectively addresses the limitations of existing CoT compression methods by preserving logical coherence while achieving significant efficiency gains, making it a promising approach for scalable reasoning in large language models.

Abstract: Large Reasoning Models (LRMs) have demonstrated remarkable capabilities by scaling up the length of Chain-of-Thought (CoT). However, excessively long reasoning traces pose substantial challenges for training cost and inference latency. While various CoT compression approaches have emerged to address this challenge, they face inherent trade-offs: token-level methods often disrupt syntactic and logical coherence, while step-level methods based on perplexity fail to reliably capture the logically critical reasoning steps because of the dilution of logical information. In this paper, we propose ASAP (Anchor-guided, SurprisAl-based Pruning), a novel coarse-to-fine framework for CoT compression. ASAP first performs anchor-guided pruning to preserve the core reasoning structure, which efficiently reduces the search space for subsequent processing. Leveraging the insight that logical branching choices are concentrated at the onset of reasoning steps, it then enables logic-aware pruning by selecting logically essential reasoning steps based on a novel first-token surprisal metric. Finally, ASAP distills the models to autonomously generate and leverage these concise CoTs at inference time, enabling efficient reasoning. Experiments show that ASAP achieves state-of-the-art accuracy across multiple benchmarks while substantially reducing training and inference costs.

[524] Why Does Stochastic Gradient Descent Slow Down in Low-Precision Training?

Vincent-Daniel Yun

Main category: cs.LG

TL;DR: Low-precision training causes gradient shrinkage that slows SGD convergence by reducing effective stepsize and increasing steady-state error.

Details

Motivation: Low-precision training reduces computational/memory costs but introduces gradient quantization effects that change SGD convergence behavior. Need to understand how gradient shrinkage from quantization affects convergence properties.

Method: Model gradient quantization as shrinkage factor q_k ∈ (0,1] scaling each stochastic gradient. Analyze SGD convergence under this shrinkage model with smoothness and bounded-variance assumptions. Compare effective stepsize μ_k q_k vs usual stepsize μ_k.

Result: Low-precision SGD still converges but slower (pace set by q_min) with higher steady error due to quantization. Gradient shrinkage reduces effective stepsize, slowing convergence when q_min < 1.

Conclusion: Quantization-induced gradient shrinkage fundamentally changes SGD convergence by reducing effective stepsize, leading to slower convergence and higher steady-state error, though convergence is still guaranteed under standard assumptions.

Abstract: Low-precision training has become crucial for reducing the computational and memory costs of large-scale deep learning. However, quantizing gradients introduces magnitude shrinkage, which can change how stochastic gradient descent (SGD) converges. In this study, we explore SGD convergence under a gradient shrinkage model, where each stochastic gradient is scaled by a factor ( q_k \in (0,1] ). We show that this shrinkage affect the usual stepsize ( μ_k ) with an effective stepsize ( μ_k q_k ), slowing convergence when ( q_{\min} < 1 ). With typical smoothness and bounded-variance assumptions, we prove that low-precision SGD still converges, but at a slower pace set by ( q_{\min} ), and with a higher steady error level due to quantization effects. We analyze theoretically how lower numerical precision slows training by treating it as gradient shrinkage within the standard SGD convergence setup.

[525] ASTGI: Adaptive Spatio-Temporal Graph Interactions for Irregular Multivariate Time Series Forecasting

Xvyuan Liu, Xiangfei Qiu, Hanyin Cheng, Xingjian Wu, Chenjuan Guo, Bin Yang, Jilin Hu

Main category: cs.LG

TL;DR: ASTGI is a novel framework for irregular multivariate time series forecasting that addresses data distortion and complex dependency challenges through adaptive spatio-temporal graph construction and propagation.

Details

Motivation: Irregular multivariate time series (IMTS) with asynchronous sampling and irregular intervals pose challenges for existing methods: (1) accurate representation without data distortion, and (2) capturing complex dynamic dependencies between observation points.

Method: ASTGI framework includes: 1) Spatio-Temporal Point Representation module encoding observations as points in learnable embedding space, 2) Neighborhood-Adaptive Graph Construction building causal graphs via nearest neighbor search, 3) Spatio-Temporal Dynamic Propagation iteratively updating information on adaptive graphs, and 4) Query Point-based Prediction aggregating neighborhood information for forecasting.

Result: Extensive experiments on multiple benchmark datasets demonstrate that ASTGI outperforms various state-of-the-art methods.

Conclusion: ASTGI effectively addresses the core challenges of IMTS forecasting by providing accurate representation without distortion and capturing complex spatio-temporal dependencies through adaptive graph interactions.

Abstract: Irregular multivariate time series (IMTS) are prevalent in critical domains like healthcare and finance, where accurate forecasting is vital for proactive decision-making. However, the asynchronous sampling and irregular intervals inherent to IMTS pose two core challenges for existing methods: (1) how to accurately represent the raw information of irregular time series without introducing data distortion, and (2) how to effectively capture the complex dynamic dependencies between observation points. To address these challenges, we propose the Adaptive Spatio-Temporal Graph Interaction (ASTGI) framework. Specifically, the framework first employs a Spatio-Temporal Point Representation module to encode each discrete observation as a point within a learnable spatio-temporal embedding space. Second, a Neighborhood-Adaptive Graph Construction module adaptively builds a causal graph for each point in the embedding space via nearest neighbor search. Subsequently, a Spatio-Temporal Dynamic Propagation module iteratively updates information on these adaptive causal graphs by generating messages and computing interaction weights based on the relative spatio-temporal positions between points. Finally, a Query Point-based Prediction module generates the final forecast by aggregating neighborhood information for a new query point and performing regression. Extensive experiments on multiple benchmark datasets demonstrate that ASTGI outperforms various state-of-the-art methods.

[526] Towards generalizable deep ptychography neural networks

Albert Vong, Steven Henke, Oliver Hoidn, Hanna Ruth, Junjing Deng, Alexander Hexemer, David Shapiro, Apurva Mehta, Arianna Gleason, Levi Hancock, Nicholas Schwarz

Main category: cs.LG

TL;DR: Unsupervised training workflow for X-ray ptychography using probe learning with synthetic objects enables multi-beamline generalization and real-time reconstruction.

Details

Motivation: X-ray ptychography needs real-time feedback at next-gen light sources, but existing deep learning approaches lack robustness across diverse experimental conditions.

Method: Probe-centric unsupervised training combining experimentally-measured probes with procedurally generated synthetic objects, using physics-informed neural networks.

Result: Single network can reconstruct unseen experiments across multiple beamlines; probe learning is as important as in-distribution learning; achieves comparable fidelity to experimental-data-trained models.

Conclusion: The approach enables training of experiment-steering models for real-time feedback under dynamic experimental conditions at next-generation light sources.

Abstract: X-ray ptychography is a data-intensive imaging technique expected to become ubiquitous at next-generation light sources delivering many-fold increases in coherent flux. The need for real-time feedback under accelerated acquisition rates motivates surrogate reconstruction models like deep neural networks, which offer orders-of-magnitude speedup over conventional methods. However, existing deep learning approaches lack robustness across diverse experimental conditions. We propose an unsupervised training workflow emphasizing probe learning by combining experimentally-measured probes with synthetic, procedurally generated objects. This probe-centric approach enables a single physics-informed neural network to reconstruct unseen experiments across multiple beamlines; among the first demonstrations of multi-probe generalization. We find probe learning is equally important as in-distribution learning; models trained using this synthetic workflow achieve reconstruction fidelity comparable to those trained exclusively on experimental data, even when changing the type of synthetic training object. The proposed approach enables training of experiment-steering models that provide real-time feedback under dynamic experimental conditions.

[527] Forking-Sequences

Willa Potosnak, Malcolm Wolff, Mengfei Cao, Ruijun Ma, Tatiana Konstantinova, Dmitry Efimov, Michael W. Mahoney, Boris Oreshkin, Kin G. Olivares

Main category: cs.LG

TL;DR: Forking-sequences neural architecture improves forecast stability across creation dates while maintaining accuracy, outperforming conventional independent processing methods.

Details

Motivation: Forecast stability across different forecast creation dates is crucial for reliable decision-making, as erratic revisions between dates can disrupt downstream processes even with accurate models.

Method: Introduces forking-sequences architecture that jointly encodes and decodes entire time series across all forecast creation dates, producing complete multi-horizon forecast grid in single forward pass, contrasting with conventional independent processing methods.

Result: Significant accuracy improvements across datasets: 29.7% (MLP), 46.2% (RNN), 49.3% (LSTM), 28.6% (CNN), 24.7% (Transformer), 6.4% (StateSpace). Forecast ensembling improved stability by 10.8-13.2% while maintaining accuracy.

Conclusion: Forking-sequences architecture provides substantial benefits in forecast stability, training consistency, and computational efficiency, making it a superior approach for time series forecasting compared to conventional window-sampling methods.

Abstract: While accuracy is a critical requirement for time series forecasting, an equally important desideratum is forecast stability across forecast creation dates (FCDs). Even highly accurate models can produce erratic revisions between FCDs, disrupting downstream decision-making. To improve forecast stability of such revisions, several state-of-the-art models including MQCNN, MQT, and SPADE employ a powerful yet underexplored neural network architectural design known as forking-sequences. This architectural design jointly encodes and decodes the entire time series across all FCDs, producing an entire multi-horizon forecast grid in a single forward pass. This approach contrasts with conventional neural forecasting methods that process FCDs independently, generating only a single multi-horizon forecast per forward pass. In this work, we formalize the forking-sequences design and motivate its broader adoption by introducing a metric for quantifying excess volatility in forecast revisions and by providing theoretical and empirical analysis. We theoretically motivate three key benefits of forking-sequences: (i) increased forecast stability through ensembling; (ii) gradient variance reduction, leading to more stable and consistent training steps; and (iii) improved computational efficiency during inference. We validate the benefits of forking-sequences compared to baseline window-sampling on the M-series benchmark, using 16 datasets from the M1, M3, M4, and Tourism competitions. We observe median accuracy improvements across datasets of 29.7%, 46.2%, 49.3%, 28.6%, 24.7%, and 6.4% for MLP, RNN, LSTM, CNN, Transformer, and StateSpace-based architectures, respectively. We then show that forecast ensembling during inference can improve median forecast stability by 10.8%, 13.2%, 13.0%, 10.9%, 10.2%, and 11.2% for these respective models trained with forking-sequences, while maintaining accuracy.

[528] PEAR: Planner-Executor Agent Robustness Benchmark

Shen Dong, Mingxuan Zhang, Pengfei He, Li Ma, Bhavani Thuraisingham, Hui Liu, Yue Xing

Main category: cs.LG

TL;DR: PEAR is a benchmark for evaluating utility and vulnerability of planner-executor multi-agent systems, revealing key insights about performance-robustness trade-offs and planner vulnerabilities.

Details

Motivation: LLM-based Multi-Agent Systems are powerful but vulnerable to adversarial attacks. Existing research lacks holistic understanding of MAS vulnerabilities, focusing only on isolated attack surfaces or specific scenarios.

Method: Introduces PEAR benchmark for systematic evaluation of planner-executor MAS architectures. Focuses on practical planner-executor design while being compatible with various MAS architectures. Conducts extensive experiments to analyze vulnerabilities.

Result: Four key findings: (1) Weak planner degrades clean task performance more than weak executor; (2) Memory module essential for planner but not for executor; (3) Trade-off between task performance and robustness; (4) Attacks targeting planner are particularly effective.

Conclusion: The findings provide actionable insights for enhancing MAS robustness and lay groundwork for principled defenses in multi-agent settings, addressing the gap in holistic vulnerability understanding.

Abstract: Large Language Model (LLM)-based Multi-Agent Systems (MAS) have emerged as a powerful paradigm for tackling complex, multi-step tasks across diverse domains. However, despite their impressive capabilities, MAS remain susceptible to adversarial manipulation. Existing studies typically examine isolated attack surfaces or specific scenarios, leaving a lack of holistic understanding of MAS vulnerabilities. To bridge this gap, we introduce PEAR, a benchmark for systematically evaluating both the utility and vulnerability of planner-executor MAS. While compatible with various MAS architectures, our benchmark focuses on the planner-executor structure, which is a practical and widely adopted design. Through extensive experiments, we find that (1) a weak planner degrades overall clean task performance more severely than a weak executor; (2) while a memory module is essential for the planner, having a memory module for the executor does not impact the clean task performance; (3) there exists a trade-off between task performance and robustness; and (4) attacks targeting the planner are particularly effective at misleading the system. These findings offer actionable insights for enhancing the robustness of MAS and lay the groundwork for principled defenses in multi-agent settings.

[529] GAPO: Robust Advantage Estimation for Real-World Code LLMs

Jianqing Zhang, Zhezheng Hao, Wei Xia, Hande Dong, Hong Wang, Chenxing Wei, Yuyan Zhou, Yubin Qi, Qiang Lin, Jian Cao

Main category: cs.LG

TL;DR: GAPO improves RL for code editing by adaptively finding high-SNR intervals and using median Q values to handle skewed reward distributions and noise, outperforming GRPO/DAPO on real-world tasks.

Details

Motivation: Real-world code editing scenarios often have skewed reward distributions with unpredictable noise, which distorts advantage computation and causes rollout outliers in existing group-relative RL methods like GRPO.

Method: Group Adaptive Policy Optimization (GAPO) adaptively finds intervals with highest Signal-to-Noise Ratio per prompt, then uses the median of that interval as an adaptive Q to replace group mean in advantage calculation, reducing noise while remaining plug-and-play.

Result: GAPO achieves up to 4.35 in-domain and 5.30 out-of-domain exact-match improvements over GRPO/DAPO on 51,844 real-world code-editing tasks across 10 languages, with lower clipping ratios and higher GPU throughput.

Conclusion: GAPO effectively handles noisy reward distributions in code editing RL, providing robust performance improvements while maintaining efficiency and plug-and-play compatibility with existing methods.

Abstract: Reinforcement learning (RL) is widely used for post-training large language models (LLMs) in code editing, where group-relative methods, such as GRPO, are popular due to their critic-free and normalized advantage estimation. However, in real-world code-editing scenarios, reward distributions are often skewed with unpredictable noise, leading to distorted advantage computation and increased rollout outliers. To address this issue, we propose Group Adaptive Policy Optimization (GAPO), which adaptively finds an interval with the highest SNR (Signal to Noise Ratio) per prompt and uses the median of that interval as an adaptive Q to replace the group mean in advantage calculation to reduce noise further. This adaptive Q robustly handles rollout noise while remaining plug-and-play and efficient. We evaluate GAPO on nine instruction-tuned LLMs (3B-14B) using a collected large dataset of 51,844 real-world, history-aware code-editing tasks spanning 10 programming languages. GAPO yields up to 4.35 in-domain (ID) and 5.30 out-of-domain (OOD) exact-match improvements over GRPO and its variant DAPO, while achieving lower clipping ratios and higher GPU throughput. Code: https://github.com/TsingZ0/verl-GAPO.

[530] OpenEM: Large-scale multi-structural 3D datasets for electromagnetic methods

Shuang Wang, Xuben Wang, Fei Deng, Peifan Jiang, Jian Chen, Gianluca Fiandaca

Main category: cs.LG

TL;DR: OpenEM is a large-scale, multi-structural 3D geoelectric dataset with nine categories of geologically plausible subsurface structures, created to address the lack of standardized datasets for deep learning in electromagnetic exploration.

Details

Motivation: Existing deep learning applications in electromagnetic methods rely on datasets from random 1D or simple 3D models that don't represent real geological environments, and there's a lack of standardized, publicly available 3D geoelectric datasets hindering progress.

Method: Created OpenEM dataset with nine categories of geologically plausible subsurface structures (anomalous bodies in half-space, flat layers, folded layers, flat faults, curved faults, and variants with anomalous bodies). Developed a deep learning-based fast forward modeling approach to enable efficient forward modeling across the entire dataset since 3D forward modeling in electromagnetics is extremely time-consuming.

Result: OpenEM provides a unified, comprehensive, and large-scale dataset for common EM exploration systems, publicly available at https://doi.org/10.5281/zenodo.17141981, enabling rapid deployment for a wide range of tasks.

Conclusion: OpenEM addresses the limitations of existing datasets and accelerates the application of deep learning in electromagnetic methods by providing a standardized, geologically realistic 3D dataset with efficient forward modeling capabilities.

Abstract: Electromagnetic methods have become one of the most widely used techniques in geological exploration. With the remarkable success of deep learning, applying such techniques to EM methods has emerged as a promising research direction to overcome the limitations of conventional approaches. The effectiveness of deep learning methods depends heavily on the quality of datasets, which directly influences model performance and generalization ability. Existing application studies often construct datasets from random one-dimensional or structurally simple three-dimensional models, which fail to represent the real geological environments. Furthermore, the absence of standardized, publicly 3D geoelectric datasets continues to hinder progress in deep learning based EM exploration. To address these limitations, we present OpenEM, a large-scale, multi-structural three-dimensional geoelectric dataset that encompasses a broad range of geologically plausible subsurface structures. OpenEM consists of nine categories of geoelectric models, spanning from simple configurations with anomalous bodies in half-space to more complex structures such as flat layers, folded layers, flat faults, curved faults, and their corresponding variants with anomalous bodies. Since three-dimensional forward modeling in electromagnetics is extremely time-consuming, we further developed a deep learning based fast forward modeling approach for OpenEM, enabling efficient and reliable forward modeling across the entire dataset. This capability allows OpenEM to be rapidly deployed for a wide range of tasks. OpenEM provides a unified, comprehensive, and large-scale dataset for common EM exploration systems to accelerate the application of deep learning in electromagnetic methods.The complete dataset is publicly available at https://doi.org/10.5281/zenodo.17141981.

[531] MoEMeta: Mixture-of-Experts Meta Learning for Few-Shot Relational Learning

Han Wu, Jie Yin

Main category: cs.LG

TL;DR: MoEMeta: A meta-learning framework for few-shot KG relational learning that disentangles globally shared knowledge from task-specific contexts using mixture-of-experts and task-tailored adaptation.

Details

Motivation: Existing meta-learning approaches for few-shot KG relational learning have two key limitations: 1) they learn relation meta-knowledge in isolation without capturing common relational patterns across tasks, and 2) they struggle to incorporate local, task-specific contexts crucial for rapid adaptation.

Method: Proposes MoEMeta with two key innovations: 1) a mixture-of-experts (MoE) model that learns globally shared relational prototypes to enhance generalization, and 2) a task-tailored adaptation mechanism that captures local contexts for fast task-specific adaptation.

Result: Extensive experiments on three KG benchmarks show MoEMeta consistently outperforms existing baselines and achieves state-of-the-art performance.

Conclusion: By balancing global generalization with local adaptability, MoEMeta significantly advances few-shot relational learning in knowledge graphs.

Abstract: Few-shot knowledge graph relational learning seeks to perform reasoning over relations given only a limited number of training examples. While existing approaches largely adopt a meta-learning framework for enabling fast adaptation to new relations, they suffer from two key pitfalls. First, they learn relation meta-knowledge in isolation, failing to capture common relational patterns shared across tasks. Second, they struggle to effectively incorporate local, task-specific contexts crucial for rapid adaptation. To address these limitations, we propose MoEMeta, a novel meta-learning framework that disentangles globally shared knowledge from task-specific contexts to enable both effective model generalization and rapid adaptation. MoEMeta introduces two key innovations: (i) a mixture-of-experts (MoE) model that learns globally shared relational prototypes to enhance generalization, and (ii) a task-tailored adaptation mechanism that captures local contexts for fast task-specific adaptation. By balancing global generalization with local adaptability, MoEMeta significantly advances few-shot relational learning. Extensive experiments and analyses on three KG benchmarks show that MoEMeta consistently outperforms existing baselines, achieving state-of-the-art performance.

[532] Simplex-FEM Networks (SiFEN): Learning A Triangulated Function Approximator

Chaymae Yahyati, Ismail Lamaakal, Khalid El Makkaoui, Ibrahim Ouahbi, Yassine Maleh

Main category: cs.LG

TL;DR: SiFEN is a learned piecewise-polynomial predictor using finite-element fields on learned simplicial meshes, offering explicit locality, controllable smoothness, and cache-friendly sparsity with theoretical approximation guarantees.

Details

Motivation: To create a compact, interpretable, and theoretically grounded alternative to dense MLPs and edge-spline networks that provides explicit locality, controllable smoothness, and better computational efficiency.

Method: Represents functions as globally C^r finite-element fields on learned simplicial meshes with optional input space warping. Uses degree-m Bernstein-Bezier polynomials with barycentric coordinates, training end-to-end with shape regularization, semi-discrete OT coverage, and differentiable edge flips.

Result: Achieves classic FEM approximation rate M^(-m/d) with M mesh vertices. Empirically matches or surpasses MLPs and KANs at matched parameter budgets, improves calibration (lower ECE/Brier), and reduces inference latency due to geometric locality.

Conclusion: SiFEN provides a compact, interpretable, theoretically grounded alternative to dense MLPs with better computational efficiency, calibration, and explicit locality properties.

Abstract: We introduce Simplex-FEM Networks (SiFEN), a learned piecewise-polynomial predictor that represents f: R^d -> R^k as a globally C^r finite-element field on a learned simplicial mesh in an optionally warped input space. Each query activates exactly one simplex and at most d+1 basis functions via barycentric coordinates, yielding explicit locality, controllable smoothness, and cache-friendly sparsity. SiFEN pairs degree-m Bernstein-Bezier polynomials with a light invertible warp and trains end-to-end with shape regularization, semi-discrete OT coverage, and differentiable edge flips. Under standard shape-regularity and bi-Lipschitz warp assumptions, SiFEN achieves the classic FEM approximation rate M^(-m/d) with M mesh vertices. Empirically, on synthetic approximation tasks, tabular regression/classification, and as a drop-in head on compact CNNs, SiFEN matches or surpasses MLPs and KANs at matched parameter budgets, improves calibration (lower ECE/Brier), and reduces inference latency due to geometric locality. These properties make SiFEN a compact, interpretable, and theoretically grounded alternative to dense MLPs and edge-spline networks.

[533] FAQNAS: FLOPs-aware Hybrid Quantum Neural Architecture Search using Genetic Algorithm

Muhammad Kashif, Shaf Khalid, Alberto Marchisio, Nouhaila Innan, Muhammad Shafique

Main category: cs.LG

TL;DR: FAQNAS is a FLOPs-aware neural architecture search framework for Hybrid Quantum Neural Networks that optimizes both accuracy and computational complexity, showing quantum FLOPs dominate accuracy improvements while classical FLOPs remain stable.

Details

Motivation: In the NISQ era, HQNNs are trained on classical simulators where FLOPs directly impact runtime and scalability, making FLOPs a practical metric for computational complexity that should be explicitly considered in HQNN design.

Method: FAQNAS formulates HQNN design as a multi-objective optimization problem balancing accuracy and FLOPs, explicitly incorporating FLOPs into the optimization objective to discover architectures that minimize computational cost while maintaining performance.

Result: Experiments on five benchmark datasets show quantum FLOPs dominate accuracy improvements while classical FLOPs remain largely fixed. Pareto-optimal solutions achieve competitive accuracy with significantly reduced computational cost compared to FLOPs-agnostic baselines.

Conclusion: FLOPs-awareness is established as a practical criterion for HQNN design in the NISQ era and as a scalable principle for future HQNN systems, enabling efficient architecture discovery that balances performance and computational cost.

Abstract: Hybrid Quantum Neural Networks (HQNNs), which combine parameterized quantum circuits with classical neural layers, are emerging as promising models in the noisy intermediate-scale quantum (NISQ) era. While quantum circuits are not naturally measured in floating point operations (FLOPs), most HQNNs (in NISQ era) are still trained on classical simulators where FLOPs directly dictate runtime and scalability. Hence, FLOPs represent a practical and viable metric to measure the computational complexity of HQNNs. In this work, we introduce FAQNAS, a FLOPs-aware neural architecture search (NAS) framework that formulates HQNN design as a multi-objective optimization problem balancing accuracy and FLOPs. Unlike traditional approaches, FAQNAS explicitly incorporates FLOPs into the optimization objective, enabling the discovery of architectures that achieve strong performance while minimizing computational cost. Experiments on five benchmark datasets (MNIST, Digits, Wine, Breast Cancer, and Iris) show that quantum FLOPs dominate accuracy improvements, while classical FLOPs remain largely fixed. Pareto-optimal solutions reveal that competitive accuracy can often be achieved with significantly reduced computational cost compared to FLOPs-agnostic baselines. Our results establish FLOPs-awareness as a practical criterion for HQNN design in the NISQ era and as a scalable principle for future HQNN systems.

[534] Enhancing Graph Representations with Neighborhood-Contextualized Message-Passing

Brian Godwin Lim, Galvin Brice Lim, Renzo Roel Tan, Irwin King, Kazushi Ikeda

Main category: cs.LG

TL;DR: Proposes neighborhood-contextualized message-passing (NCMP) framework to enhance GNN expressivity by incorporating broader local neighborhood context, implemented as SINC-GCN model with competitive performance.

Details

Motivation: Standard message-passing GNNs only consider pair-wise messages between center node and individual neighbors, failing to capture contextual information from the broader local neighborhood, which limits their ability to learn complex relationships among neighboring nodes.

Method: Formalizes neighborhood-contextualization concept inspired by attentional GNNs, generalizes message-passing to NCMP framework, and implements SINC-GCN as a practical parametrization that efficiently operationalizes NCMP.

Result: SINC-GCN demonstrates competitive performance against baseline GNN models across diverse synthetic and benchmark datasets, with substantial and statistically significant performance gains in graph property prediction tasks.

Conclusion: The NCMP framework provides a practical path to enhance graph representational power of classical GNNs by incorporating neighborhood context, with SINC-GCN showing promising expressivity and efficiency.

Abstract: Graph neural networks (GNNs) have become an indispensable tool for analyzing relational data. Classical GNNs are broadly classified into three variants: convolutional, attentional, and message-passing. While the standard message-passing variant is expressive, its typical pair-wise messages only consider the features of the center node and each neighboring node individually. This design fails to incorporate contextual information contained within the broader local neighborhood, potentially hindering its ability to learn complex relationships within the entire set of neighboring nodes. To address this limitation, this work first formalizes the concept of neighborhood-contextualization, rooted in a key property of the attentional variant. This then serves as the foundation for generalizing the message-passing variant to the proposed neighborhood-contextualized message-passing (NCMP) framework. To demonstrate its utility, a simple, practical, and efficient method to parametrize and operationalize NCMP is presented, leading to the development of the proposed Soft-Isomorphic Neighborhood-Contextualized Graph Convolution Network (SINC-GCN). Across a diverse set of synthetic and benchmark GNN datasets, SINC-GCN demonstrates competitive performance against baseline GNN models, highlighting its expressivity and efficiency. Notably, it also delivers substantial and statistically significant performance gains in graph property prediction tasks, further underscoring the distinctive utility of neighborhood-contextualization. Overall, the paper lays the foundation for the NCMP framework as a practical path toward enhancing the graph representational power of classical GNNs.

[535] N-GLARE: An Non-Generative Latent Representation-Efficient LLM Safety Evaluator

Zheyu Lin, Jirui Yang, Yukui Qiu, Hengqi Guo, Yubing Bao, Yao Guan

Main category: cs.LG

TL;DR: N-GLARE is a non-generative, latent representation-efficient LLM safety evaluator that analyzes hidden layer dynamics without full text generation, achieving 99% cost reduction compared to traditional red teaming.

Details

Motivation: Current red teaming methods for LLM safety evaluation are costly, rely on online generation and black-box analysis, and suffer from feedback latency, making them unsuitable for agile diagnostics after training new models.

Method: N-GLARE operates on latent representations, analyzing Angular-Probabilistic Trajectory (APT) of hidden layers and introducing Jensen-Shannon Separability (JSS) metric to characterize safety dynamics without text generation.

Result: Experiments on 40+ models and 20+ red teaming strategies show JSS metric has high consistency with safety rankings from red teaming, reproducing discriminative trends at <1% token and runtime cost.

Conclusion: N-GLARE provides an efficient, output-free evaluation proxy for real-time LLM safety diagnostics, enabling cost-effective safety robustness assessment without generation overhead.

Abstract: Evaluating the safety robustness of LLMs is critical for their deployment. However, mainstream Red Teaming methods rely on online generation and black-box output analysis. These approaches are not only costly but also suffer from feedback latency, making them unsuitable for agile diagnostics after training a new model. To address this, we propose N-GLARE (A Non-Generative, Latent Representation-Efficient LLM Safety Evaluator). N-GLARE operates entirely on the model’s latent representations, bypassing the need for full text generation. It characterizes hidden layer dynamics by analyzing the APT (Angular-Probabilistic Trajectory) of latent representations and introducing the JSS (Jensen-Shannon Separability) metric. Experiments on over 40 models and 20 red teaming strategies demonstrate that the JSS metric exhibits high consistency with the safety rankings derived from Red Teaming. N-GLARE reproduces the discriminative trends of large-scale red-teaming tests at less than 1% of the token cost and the runtime cost, providing an efficient output-free evaluation proxy for real-time diagnostics.

Yansong Liu, Ronnie Stafford, Pramit Khetrapal, Huriye Kocadag, Graça Carvalho, Patricia de Winter, Maryam Imran, Amelia Snook, Adamos Hadjivasiliou, D. Vijay Anand, Weining Lin, John Kelly, Yukun Zhou, Ivana Drobnjak

Main category: cs.LG

TL;DR: Multi-modal AI framework for remote patient monitoring in cancer care achieves 83.9% accuracy in forecasting adverse events using wearable sensors, surveys, and clinical data.

Details

Motivation: Cancer patients face uncertainties and unmonitored side effects between clinic visits, creating a care gap that needs bridging through remote monitoring.

Method: Developed multi-modal AI framework integrating HALO-X platform data (demographics, wearables, daily surveys, clinical events) with adapted models to handle asynchronous, incomplete real-world RPM data for continuous risk forecasting.

Result: Achieved 83.9% accuracy (AUROC=0.70) in observational trial with 84 patients and 2.1M data points; identified key predictive features: previous treatments, wellness check-ins, and daily maximum heart rate; demonstrated early warning capability in case study.

Conclusion: Establishes feasibility of multi-modal AI RPM for cancer care and provides path toward proactive patient support, earning Best Paper Poster Award at NeurIPS 2025 workshop.

Abstract: For patients undergoing systemic cancer therapy, the time between clinic visits is full of uncertainties and risks of unmonitored side effects. To bridge this gap in care, we developed and prospectively trialed a multi-modal AI framework for remote patient monitoring (RPM). This system integrates multi-modal data from the HALO-X platform, such as demographics, wearable sensors, daily surveys, and clinical events. Our observational trial is one of the largest of its kind and has collected over 2.1 million data points (6,080 patient-days) of monitoring from 84 patients. We developed and adapted a multi-modal AI model to handle the asynchronous and incomplete nature of real-world RPM data, forecasting a continuous risk of future adverse events. The model achieved an accuracy of 83.9% (AUROC=0.70). Notably, the model identified previous treatments, wellness check-ins, and daily maximum heart rate as key predictive features. A case study demonstrated the model’s ability to provide early warnings by outputting escalating risk profiles prior to the event. This work establishes the feasibility of multi-modal AI RPM for cancer care and offers a path toward more proactive patient support.(Accepted at Europe NeurIPS 2025 Multimodal Representation Learning for Healthcare Workshop. Best Paper Poster Award.)

[537] Renormalizable Spectral-Shell Dynamics as the Origin of Neural Scaling Laws

Yizhou Zhang

Main category: cs.LG

TL;DR: The paper derives macroscopic structure of deep network training from gradient descent in function space, showing training error evolves via a time-dependent operator. Using spectral analysis and shell coarse-graining, it explains neural scaling laws and double descent as self-similar solutions of shell dynamics.

Details

Motivation: To understand the macroscopic structure underlying deep network training despite highly nonlinear optimization dynamics, and to unify phenomena like neural scaling laws and double descent within a single theoretical framework.

Method: Derive training dynamics from gradient descent in function space, analyze via Kato perturbation theory to get modewise ODEs, introduce logarithmic spectral-shell coarse-graining to track error energy, and assume renormalizable shell-dynamics with power-law spectral transport.

Result: Obtains exact coupled ODEs in eigenbasis of time-dependent operator, shows microscopic interactions cancel within shells, derives self-similar solutions with moving resolution frontier, and explains neural scaling laws and double descent as limits of spectral-shell dynamics.

Conclusion: The framework unifies lazy (NTK-like) training and feature learning as two limits of the same spectral-shell dynamics, providing a comprehensive theoretical explanation for observed macroscopic training phenomena in deep networks.

Abstract: Neural scaling laws and double-descent phenomena suggest that deep-network training obeys a simple macroscopic structure despite highly nonlinear optimization dynamics. We derive such structure directly from gradient descent in function space. For mean-squared error loss, the training error evolves as $\dot e_t=-M(t)e_t$ with $M(t)=J_{θ(t)}J_{θ(t)}^{!*}$, a time-dependent self-adjoint operator induced by the network Jacobian. Using Kato perturbation theory, we obtain an exact system of coupled modewise ODEs in the instantaneous eigenbasis of $M(t)$. To extract macroscopic behavior, we introduce a logarithmic spectral-shell coarse-graining and track quadratic error energy across shells. Microscopic interactions within each shell cancel identically at the energy level, so shell energies evolve only through dissipation and external inter-shell interactions. We formalize this via a \emph{renormalizable shell-dynamics} assumption, under which cumulative microscopic effects reduce to a controlled net flux across shell boundaries. Assuming an effective power-law spectral transport in a relevant resolution range, the shell dynamics admits a self-similar solution with a moving resolution frontier and explicit scaling exponents. This framework explains neural scaling laws and double descent, and unifies lazy (NTK-like) training and feature learning as two limits of the same spectral-shell dynamics.

[538] Tiny Recursive Models on ARC-AGI-1: Inductive Biases, Identity Conditioning, and Test-Time Compute

Antonio Roye-Azar, Santiago Vargas-Naranjo, Dhruv Ghai, Nithin Balamurugan, Rayan Amir

Main category: cs.LG

TL;DR: TRM’s strong ARC performance comes more from test-time compute (augmentation/ensembling), task-specific conditioning, and efficiency than deep recursive reasoning.

Details

Motivation: To understand what drives TRM's performance on ARC tasks - whether it's truly recursive reasoning or other factors like test-time compute, task conditioning, or efficiency.

Method: Empirical analysis of ARC Prize TRM checkpoint on ARC-AGI-1 through: 1) test-time augmentation/ensembling ablation, 2) puzzle-identity ablation, 3) recursion trajectory analysis, 4) early training experiments with different augmentation regimes, and 5) efficiency comparison with Llama 3 8B QLoRA.

Result: 1) 1000-sample voting improves Pass@1 by ~11pp over single inference; 2) Zero accuracy without correct puzzle ID; 3) Most accuracy achieved at first recursion step, shallow effective recursion; 4) Heavy augmentation broadens solution distribution; 5) TRM has much higher throughput and lower memory than Llama 3 8B QLoRA.

Conclusion: TRM’s ARC-AGI-1 performance stems from efficiency, task-specific conditioning, and aggressive test-time compute rather than deep internal recursive reasoning.

Abstract: Tiny Recursive Models (TRM) were proposed as a parameter-efficient alternative to large language models for solving Abstraction and Reasoning Corpus (ARC) style tasks. The original work reports strong performance and suggests that recursive latent updates enable non-trivial reasoning, but it remains unclear how much of this performance stems from architecture, test-time compute, or task-specific priors. In this technical note, we empirically analyze the ARC Prize TRM checkpoint on ARC-AGI-1 and report four behavioral findings and an efficiency comparison. First, we show that test-time augmentation and majority-vote ensembling account for a substantial fraction of reported performance: the 1000-sample voting pipeline improves Pass@1 by about 11 percentage points over single-pass canonical inference. Second, a puzzle-identity ablation reveals strict dependence on task identifiers: replacing the correct puzzle ID with a blank or random token yields zero accuracy. Third, a recursion trajectory analysis shows that most of the final accuracy is achieved at the first recursion step and that performance saturates after few latent updates, indicating shallow effective recursion. Fourth, early-stage training experiments under canonical versus heavy augmentation regimes suggest that heavy augmentation broadens the distribution of candidate solutions and improves multi-sample success. Finally, we compare TRM with a naive QLoRA fine-tune of Llama 3 8B on canonical ARC-AGI-1, finding that TRM’s non-autoregressive design achieves much higher throughput and substantially lower memory usage in this setting. Overall, TRM’s ARC-AGI-1 performance appears to arise from an interaction between efficiency, task-specific conditioning, and aggressive test-time compute rather than deep internal reasoning.

[539] CALM: A CKA-Guided Adaptive Layer-Wise Modularization Framework for LLM Quantization

Jinhao Zhang, Yunquan Zhang, Daning Chen, JunSun, Zicheng Yan

Main category: cs.LG

TL;DR: CALM is a fine-tuning-free framework that uses CKA metric to automatically select optimal quantization strategies per layer for LLMs, outperforming uniform quantization and mixed-precision methods.

Details

Motivation: Current post-training quantization methods use uniform strategies across all layers, ignoring significant differences in algorithmic suitability between layers, leading to suboptimal quantization performance.

Method: CALM independently evaluates multiple PTQ algorithms per layer, uses Linear Centered Kernel Alignment (CKA) as a metric to automatically select optimal quantization strategy for each layer, then integrates these strategies to build a hybrid quantized model.

Result: CALM consistently outperforms both uniform quantization baselines and state-of-the-art mixed-precision methods across mainstream LLMs (LLaMA and Qwen) in terms of perplexity (PPL) and downstream task performance.

Conclusion: Layer-wise adaptive quantization using CKA-guided strategy selection provides superior performance over uniform approaches, offering a fine-tuning-free, plug-and-play solution for algorithmic heterogeneous quantization of large language models.

Abstract: Current mainstream post-training quantization methods for large language models typically apply a uniform quantization strategy across all network layers, overlooking the substantial differences in algorithmic suitability among layers. To address this limitation, we propose CALM (A CKA-guided Adaptive Layer-wise Modularization)a fine-tuning-free, plug-and-play framework for algorithmic heterogeneous quantization. CALM independently evaluates multiple PTQ algorithms on each layer and employs Linear Centered Kernel Alignment (CKA) as a metric to automatically select the optimal quantization strategy per layer. The individually optimized strategies are then integrated to construct a hybrid quantized model. Experiments demonstrate that our approach consistently outperforms both uniform quantization baselines and state-of-the-art mixed-precision methods across mainstream LLMsincluding LLaMA and Qwenin terms of perplexity (PPL) and downstream task performance.

[540] PHOTON: Hierarchical Autoregressive Modeling for Lightspeed and Memory-Efficient Language Generation

Yuma Ichikawa, Naoya Takagi, Takumi Nakagawa, Yuzi Kanazawa, Akira Sakai

Main category: cs.LG

TL;DR: PHOTON replaces Transformer’s horizontal token-by-token scanning with vertical multi-resolution context scanning using hierarchical latent streams, achieving significantly higher throughput per memory unit.

Details

Motivation: Transformers suffer from increasing prefill latency and memory-bound long-context decoding due to KV-cache reads/writes dominating inference time over arithmetic operations.

Method: Hierarchical autoregressive model with bottom-up encoder compressing tokens into low-rate contextual states and lightweight top-down decoders reconstructing token representations in parallel, plus recursive generation updating only coarsest latent stream.

Result: PHOTON achieves superior throughput-quality trade-off compared to Transformer-based models, with up to 1000× higher throughput per unit memory, especially beneficial for long-context and multi-query tasks.

Conclusion: PHOTON’s vertical hierarchical approach effectively addresses Transformer’s memory-bound inference limitations, offering substantial performance improvements for long-context language modeling.

Abstract: Transformers operate as horizontal token-by-token scanners; at each generation step, attending to an ever-growing sequence of token-level states. This access pattern increases prefill latency and makes long-context decoding more memory-bound, as KV-cache reads and writes dominate inference time over arithmetic operations. We propose Parallel Hierarchical Operation for TOp-down Networks (PHOTON), a hierarchical autoregressive model that replaces horizontal scanning with vertical, multi-resolution context scanning. PHOTON maintains a hierarchy of latent streams: a bottom-up encoder compresses tokens into low-rate contextual states, while lightweight top-down decoders reconstruct fine-grained token representations in parallel. We further introduce recursive generation that updates only the coarsest latent stream and eliminates bottom-up re-encoding. Experimental results show that PHOTON is superior to competitive Transformer-based language models regarding the throughput-quality trade-off, providing advantages in long-context and multi-query tasks. In particular, this reduces decode-time KV-cache traffic, yielding up to $10^{3}\times$ higher throughput per unit memory.

[541] kooplearn: A Scikit-Learn Compatible Library of Algorithms for Evolution Operator Learning

Giacomo Turri, Grégoire Pacreau, Giacomo Meanti, Timothée Devergne, Daniel Ordonez, Erfan Mirzaei, Bruno Belucci, Karim Lounici, Vladimir Kostic, Massimiliano Pontil, Pietro Novelli

Main category: cs.LG

TL;DR: kooplearn is a Python library for learning dynamical operators (Koopman/Transfer operators and generators) using ML methods, with scikit-learn API compatibility and benchmark datasets.

Details

Motivation: To provide a unified, accessible tool for learning dynamical operators from data, enabling spectral analysis, reduced-order modeling, and forecasting of dynamical systems.

Method: Implements linear, kernel, and deep-learning estimators for both discrete-time evolution operators and continuous-time infinitesimal generators, with scikit-learn compliant API.

Result: A comprehensive open-source library that facilitates dynamical systems analysis, supports reproducible research with benchmark datasets, and integrates with existing ML workflows.

Conclusion: kooplearn fills a gap in the ML ecosystem by providing standardized tools for learning dynamical operators, making advanced dynamical systems analysis accessible to researchers and practitioners.

Abstract: kooplearn is a machine-learning library that implements linear, kernel, and deep-learning estimators of dynamical operators and their spectral decompositions. kooplearn can model both discrete-time evolution operators (Koopman/Transfer) and continuous-time infinitesimal generators. By learning these operators, users can analyze dynamical systems via spectral methods, derive data-driven reduced-order models, and forecast future states and observables. kooplearn’s interface is compliant with the scikit-learn API, facilitating its integration into existing machine learning and data science workflows. Additionally, kooplearn includes curated benchmark datasets to support experimentation, reproducibility, and the fair comparison of learning algorithms. The software is available at https://github.com/Machine-Learning-Dynamical-Systems/kooplearn.

[542] The Bayesian Geometry of Transformer Attention

Naman Agarwal, Siddhartha R. Dalal, Vishal Misra

Main category: cs.LG

TL;DR: Transformers implement Bayesian inference through geometric mechanisms: residual streams store beliefs, feed-forward networks update posteriors, and attention provides routing. This is verified using controlled “Bayesian wind tunnels” where true posteriors are known.

Details

Motivation: To rigorously verify whether transformers perform Bayesian reasoning in context, overcoming limitations of natural data (lack of analytic posteriors) and large models (memorization confounding reasoning).

Method: Construct “Bayesian wind tunnels” - controlled environments with known true posteriors where memorization is impossible. Test small transformers on two tasks: bijection elimination and Hidden Markov Model state tracking, using geometric diagnostics to analyze mechanisms.

Result: Transformers reproduce Bayesian posteriors with 10^-3-10^-4 bit accuracy, while capacity-matched MLPs fail by orders of magnitude. Transformers implement Bayesian inference through specific geometric mechanisms: residual streams as belief substrate, feed-forward networks for posterior updates, attention for content-addressable routing.

Conclusion: Hierarchical attention realizes Bayesian inference by geometric design, explaining both the necessity of attention and failure of flat architectures. Bayesian wind tunnels provide foundation for connecting small verifiable systems to reasoning in large language models.

Abstract: Transformers often appear to perform Bayesian reasoning in context, but verifying this rigorously has been impossible: natural data lack analytic posteriors, and large models conflate reasoning with memorization. We address this by constructing \emph{Bayesian wind tunnels} – controlled environments where the true posterior is known in closed form and memorization is provably impossible. In these settings, small transformers reproduce Bayesian posteriors with $10^{-3}$-$10^{-4}$ bit accuracy, while capacity-matched MLPs fail by orders of magnitude, establishing a clear architectural separation. Across two tasks – bijection elimination and Hidden Markov Model (HMM) state tracking – we find that transformers implement Bayesian inference through a consistent geometric mechanism: residual streams serve as the belief substrate, feed-forward networks perform the posterior update, and attention provides content-addressable routing. Geometric diagnostics reveal orthogonal key bases, progressive query-key alignment, and a low-dimensional value manifold parameterized by posterior entropy. During training this manifold unfurls while attention patterns remain stable, a \emph{frame-precision dissociation} predicted by recent gradient analyses. Taken together, these results demonstrate that hierarchical attention realizes Bayesian inference by geometric design, explaining both the necessity of attention and the failure of flat architectures. Bayesian wind tunnels provide a foundation for mechanistically connecting small, verifiable systems to reasoning phenomena observed in large language models.

[543] Post-Training Quantization of OpenPangu Models for Efficient Deployment on Atlas A2

Yilun Luo, Huaqing Zheng, Haoqian Meng, Wenyuan Liu, Peng Zhang

Main category: cs.LG

TL;DR: Huawei’s openPangu-Embedded models optimized for Ascend NPUs using low-bit quantization (INT8 and W4A8) to reduce memory/latency overhead from CoT reasoning while preserving accuracy.

Details

Motivation: Chain-of-Thought (CoT) reasoning in openPangu-Embedded models generates extended reasoning traces that cause substantial memory and latency overheads, making practical deployment on Ascend NPUs challenging.

Method: Introduce a unified low-bit inference framework supporting INT8 (W8A8) and W4A8 quantization, transforming FP16 computations into more efficient integer arithmetic, specifically optimized for openPangu-Embedded models on Atlas A2 NPUs.

Result: INT8 quantization preserves over 90% of FP16 baseline accuracy with 1.5x prefill speedup on Atlas A2. W4A8 quantization significantly reduces memory consumption with moderate accuracy trade-off. Both approaches effectively facilitate efficient CoT reasoning on Ascend NPUs.

Conclusion: Low-bit quantization effectively enables efficient Chain-of-Thought reasoning on Ascend NPUs while maintaining high model fidelity, addressing computational constraints for practical deployment of openPangu-Embedded models.

Abstract: Huawei’s openPangu-Embedded-1B and openPangu-Embedded-7B are variants of the openPangu large language model, designed for efficient deployment on Ascend NPUs. The 7B variant supports three distinct Chain-of-Thought (CoT) reasoning paradigms, namely slow_think, auto_think, and no_think, while the 1B variant operates exclusively in the no_think mode, which employs condensed reasoning for higher efficiency. Although CoT reasoning enhances capability, the generation of extended reasoning traces introduces substantial memory and latency overheads, posing challenges for practical deployment on Ascend NPUs. This paper addresses these computational constraints by leveraging low-bit quantization, which transforms FP16 computations into more efficient integer arithmetic. We introduce a unified low-bit inference framework, supporting INT8 (W8A8) and W4A8 quantization, specifically optimized for openPangu-Embedded models on the Atlas A2. Our comprehensive evaluation on code generation benchmarks (HumanEval and MBPP) demonstrates the efficacy of this approach. INT8 quantization consistently preserves over 90% of the FP16 baseline accuracy and achieves a 1.5x prefill speedup on the Atlas A2. Furthermore, W4A8 quantization significantly reduces memory consumption, albeit with a moderate trade-off in accuracy. These findings collectively indicate that low-bit quantization effectively facilitates efficient CoT reasoning on Ascend NPUs, maintaining high model fidelity.

[544] Geometric Scaling of Bayesian Inference in LLMs

Naman Agarwal, Siddhartha R. Dalal, Vishal Misra

Main category: cs.LG

TL;DR: Modern language models preserve geometric structures that enable Bayesian inference, with value representations organizing along an entropy-aligned axis that correlates with predictive uncertainty.

Details

Motivation: To investigate whether the geometric signatures observed in small transformers trained in controlled settings (showing Bayesian inference capabilities) persist in production-grade language models, and to understand the role of this geometry in uncertainty representation.

Method: Analyzed multiple language model families (Pythia, Phi-2, Llama-3, Mistral) to examine geometric properties of value representations. Performed targeted interventions on the entropy-aligned axis in Pythia-410M during in-context learning, comparing effects of removing/perturbing this axis versus random-axis interventions.

Result: Found that last-layer value representations organize along a single dominant axis strongly correlated with predictive entropy. Domain-restricted prompts collapse this structure into low-dimensional manifolds similar to synthetic settings. Interventions on the entropy-aligned axis disrupt local uncertainty geometry, while random-axis interventions leave it intact, but single-layer manipulations don’t proportionally degrade Bayesian-like behavior.

Conclusion: Modern language models preserve the geometric substrate enabling Bayesian inference observed in synthetic settings, organizing approximate Bayesian updates along this substrate. The geometry serves as a privileged readout of uncertainty rather than a singular computational bottleneck.

Abstract: Recent work has shown that small transformers trained in controlled “wind-tunnel’’ settings can implement exact Bayesian inference, and that their training dynamics produce a geometric substrate – low-dimensional value manifolds and progressively orthogonal keys – that encodes posterior structure. We investigate whether this geometric signature persists in production-grade language models. Across Pythia, Phi-2, Llama-3, and Mistral families, we find that last-layer value representations organize along a single dominant axis whose position strongly correlates with predictive entropy, and that domain-restricted prompts collapse this structure into the same low-dimensional manifolds observed in synthetic settings. To probe the role of this geometry, we perform targeted interventions on the entropy-aligned axis of Pythia-410M during in-context learning. Removing or perturbing this axis selectively disrupts the local uncertainty geometry, whereas matched random-axis interventions leave it intact. However, these single-layer manipulations do not produce proportionally specific degradation in Bayesian-like behavior, indicating that the geometry is a privileged readout of uncertainty rather than a singular computational bottleneck. Taken together, our results show that modern language models preserve the geometric substrate that enables Bayesian inference in wind tunnels, and organize their approximate Bayesian updates along this substrate.

[545] Extreme-value forest fire prediction A study of the Loss Function in an Ordinality Scheme

Nicolas Caron, Christophe Guyeux, Hassan Noura, Benjamin Aynes

Main category: cs.LG

TL;DR: First ordinal classification framework for wildfire severity forecasting in France, showing ordinal-aware loss functions improve prediction of extreme events over standard approaches.

Details

Motivation: Wildfires are highly imbalanced natural hazards with rare extreme events that are challenging to predict but critical for operational decision-making. Current approaches don't adequately address the ordinal nature of severity levels or extreme event prediction.

Method: Introduced ordinal classification framework with neural models comparing standard cross-entropy against ordinal-aware objectives including Weighted Kappa Loss (WKLoss) and proposed probabilistic TDeGPD loss derived from truncated discrete exponentiated Generalized Pareto Distribution. Extensive benchmarking over multiple architectures with real operational data.

Result: Ordinal supervision substantially improves model performance over conventional approaches. WKLoss achieved best overall results with +0.1 IoU gain on most extreme severity classes while maintaining competitive calibration quality. However, performance remains limited for rarest events due to extremely low dataset representation.

Conclusion: Integrating severity ordering, data imbalance considerations, and seasonality risk is crucial for wildfire forecasting systems. Future work should incorporate seasonal dynamics and uncertainty information to improve extreme-event prediction reliability.

Abstract: Wildfires are highly imbalanced natural hazards in both space and severity, making the prediction of extreme events particularly challenging. In this work, we introduce the first ordinal classification framework for forecasting wildfire severity levels directly aligned with operational decision-making in France. Our study investigates the influence of loss-function design on the ability of neural models to predict rare yet critical high-severity fire occurrences. We compare standard cross-entropy with several ordinal-aware objectives, including the proposed probabilistic TDeGPD loss derived from a truncated discrete exponentiated Generalized Pareto Distribution. Through extensive benchmarking over multiple architectures and real operational data, we show that ordinal supervision substantially improves model performance over conventional approaches. In particular, the Weighted Kappa Loss (WKLoss) achieves the best overall results, with more than +0.1 IoU (Intersection Over Union) gain on the most extreme severity classes while maintaining competitive calibration quality. However, performance remains limited for the rarest events due to their extremely low representation in the dataset. These findings highlight the importance of integrating both severity ordering, data imbalance considerations, and seasonality risk into wildfire forecasting systems. Future work will focus on incorporating seasonal dynamics and uncertainty information into training to further improve the reliability of extreme-event prediction.

[546] Green’s-Function Spherical Neural Operators for Biological Heterogeneity

Hao Tang, Hao Chen, Hao Li, Chao Li

Main category: cs.LG

TL;DR: GSNO is a spherical neural operator framework that uses designable Green’s functions to balance geometric inductive biases with real-world heterogeneity modeling through three specialized operator solutions.

Details

Motivation: Existing spherical deep learning approaches struggle to balance strong spherical geometric inductive biases with the need to model real-world heterogeneity in biological and physical systems.

Method: Introduces Designable Green’s Function (DGF) framework, then proposes Green’s-Function Spherical Neural Operator (GSNO) with three operator solutions: Equivariant Solution for symmetry-consistent modeling, Invariant Solution to eliminate nuisance heterogeneity, and Anisotropic Solution for anisotropic systems like fibers.

Result: GSNO demonstrates superiority across multiple applications including spherical MNIST, Shallow Water Equation, diffusion MRI fiber prediction, cortical parcellation, and molecule structure modeling.

Conclusion: GSNO successfully adapts to real-world heterogeneous systems with nuisance variability and anisotropy while retaining spectral efficiency and spherical geometry.

Abstract: Spherical deep learning has been widely applied to a broad range of real-world problems. Existing approaches often face challenges in balancing strong spherical geometric inductive biases with the need to model real-world heterogeneity. To solve this while retaining spherical geometry, we first introduce a designable Green’s function framework (DGF) to provide new spherical operator solution strategy: Design systematic Green’s functions under rotational group. Based on DGF, to model biological heterogeneity, we propose Green’s-Function Spherical Neural Operator (GSNO) fusing 3 operator solutions: (1) Equivariant Solution derived from Equivariant Green’s Function for symmetry-consistent modeling; (2) Invariant Solution derived from Invariant Green’s Function to eliminate nuisance heterogeneity, e.g., consistent background field; (3) Anisotropic Solution derived from Anisotropic Green’s Function to model anisotropic systems, especially fibers with preferred direction. Therefore, the resulting model, GSNO can adapt to real-world heterogeneous systems with nuisance variability and anisotropy while retaining spectral efficiency. Evaluations on spherical MNIST, Shallow Water Equation, diffusion MRI fiber prediction, cortical parcellation and molecule structure modeling demonstrate the superiority of GSNO.

[547] ReLA: Representation Learning and Aggregation for Job Scheduling with Reinforcement Learning

Zhengyi Kwan, Wei Zhang, Aik Beng Ng, Zhengkui Wang, Simon See

Main category: cs.LG

TL;DR: ReLA is a reinforcement learning scheduler using structured representation learning and aggregation for job scheduling, achieving state-of-the-art performance across various problem scales.

Details

Motivation: Existing job scheduling solutions suffer from long running times or poor schedule quality, especially as problem scale increases, creating a need for more efficient and effective scheduling approaches.

Method: ReLA uses structured representation learning with self-attention and convolution for intra-entity learning, cross-attention for inter-entity learning, and aggregates these representations in a multi-scale architecture to support RL decision-making.

Result: ReLA achieves best makespan in most settings, reducing optimality gap by 13.0% on non-large instances and 78.6% on large-scale instances, with average gaps lowered to 7.3% and 2.1% respectively.

Conclusion: ReLA’s learned representations and aggregation provide strong decision support for RL scheduling, enabling fast job completion and practical real-world applications.

Abstract: Job scheduling is widely used in real-world manufacturing systems to assign ordered job operations to machines under various constraints. Existing solutions remain limited by long running time or insufficient schedule quality, especially when problem scale increases. In this paper, we propose ReLA, a reinforcement-learning (RL) scheduler built on structured representation learning and aggregation. ReLA first learns diverse representations from scheduling entities, including job operations and machines, using two intra-entity learning modules with self-attention and convolution and one inter-entity learning module with cross-attention. These modules are applied in a multi-scale architecture, and their outputs are aggregated to support RL decision-making. Across experiments on small, medium, and large job instances, ReLA achieves the best makespan in most tested settings over the latest solutions. On non-large instances, ReLA reduces the optimality gap of the SOTA baseline by 13.0%, while on large-scale instances it reduces the gap by 78.6%, with the average optimality gaps lowered to 7.3% and 2.1%, respectively. These results confirm that ReLA’s learned representations and aggregation provide strong decision support for RL scheduling, and enable fast job completion and decision-making for real-world applications.

[548] The Geometry of the Pivot: A Note on Lazy Pivoted Cholesky and Farthest Point Sampling

Gil Shabat

Main category: cs.LG

TL;DR: Pivoted Cholesky decomposition for kernel matrices is geometrically equivalent to Farthest Point Sampling in RKHS with implicit Gram-Schmidt orthogonalization.

Details

Motivation: While Pivoted Cholesky is widely used for scaling Gaussian Processes via low-rank kernel approximations, its geometric interpretation in kernel methods remains unclear despite well-known algebraic properties.

Method: The paper provides a geometric interpretation showing that the pivotal selection step corresponds to Farthest Point Sampling using the kernel metric, and the Cholesky factor construction represents implicit Gram-Schmidt orthogonalization in the Reproducing Kernel Hilbert Space.

Result: The authors demonstrate the mathematical equivalence between Pivoted Cholesky decomposition and FPS in RKHS, providing both theoretical derivation and a minimalist Python implementation to connect theory with practice.

Conclusion: The paper bridges the gap between algebraic and geometric understanding of Pivoted Cholesky decomposition for kernel methods, showing it’s essentially Farthest Point Sampling with orthogonalization in the kernel feature space.

Abstract: Low-rank approximations of large kernel matrices are ubiquitous in machine learning, particularly for scaling Gaussian Processes to massive datasets. The Pivoted Cholesky decomposition is a standard tool for this task, offering a computationally efficient, greedy low-rank approximation. While its algebraic properties are well-documented in numerical linear algebra, its geometric intuition within the context of kernel methods often remains obscure. In this note, we elucidate the geometric interpretation of the algorithm within the Reproducing Kernel Hilbert Space (RKHS). We demonstrate that the pivotal selection step is mathematically equivalent to Farthest Point Sampling (FPS) using the kernel metric, and that the Cholesky factor construction is an implicit Gram-Schmidt orthogonalization. We provide a concise derivation and a minimalist Python implementation to bridge the gap between theory and practice.

[549] A Gap Between Decision Trees and Neural Networks

Akash Kumar

Main category: cs.LG

TL;DR: Shallow ReLU networks struggle to approximate axis-aligned decision trees with geometrically simple boundaries due to conflicts between interpretability (geometric simplicity) and accuracy, measured by Radon total variation (RTV) seminorm.

Details

Motivation: To understand the trade-off between interpretability (geometric simplicity of decision boundaries) and accurate approximation of axis-aligned decision trees by shallow neural networks. Decision trees produce rule-based, axis-aligned decision regions, while shallow ReLU networks are typically trained as score models.

Method: Analyze infinite-width, bounded-norm, single-hidden-layer ReLU networks through the Radon total variation (RTV) seminorm, which controls geometric complexity of level sets. Study hard tree indicators and various smoothing surrogates (piecewise-linear ramp, sigmoidal, Gaussian convolution). Construct smooth barrier scores with finite RTV that exactly recover box decision sets.

Result: Hard tree indicators have infinite RTV. Common smoothing surrogates (piecewise-linear and sigmoidal) also have infinite RTV in dimensions d>1. Gaussian convolution yields finite RTV but with exponential dependence on d. Constructed smooth barrier scores with finite RTV can exactly recover box decision sets with polynomial calibration bounds under mild conditions.

Conclusion: There’s a fundamental accuracy-complexity tradeoff when approximating axis-aligned decision trees with shallow neural networks. The paper separates classification (recovering decision sets) from score learning (learning calibrated scores), showing that finite RTV scores can exactly recover box decisions while maintaining geometric simplicity.

Abstract: We study when geometric simplicity of decision boundaries, used here as a notion of interpretability, can conflict with accurate approximation of axis-aligned decision trees by shallow neural networks. Decision trees induce rule-based, axis-aligned decision regions (finite unions of boxes), whereas shallow ReLU networks are typically trained as score models whose predictions are obtained by thresholding. We analyze the infinite-width, bounded-norm, single-hidden-layer ReLU class through the Radon total variation ($\mathrm{R}\mathrm{TV}$) seminorm, which controls the geometric complexity of level sets. We first show that the hard tree indicator $1_A$ has infinite $\mathrm{R}\mathrm{TV}$. Moreover, two natural split-wise continuous surrogates–piecewise-linear ramp smoothing and sigmoidal (logistic) smoothing–also have infinite $\mathrm{R}\mathrm{TV}$ in dimensions $d>1$, while Gaussian convolution yields finite $\mathrm{R}\mathrm{TV}$ but with an explicit exponential dependence on $d$. We then separate two goals that are often conflated: classification after thresholding (recovering the decision set) versus score learning (learning a calibrated score close to $1_A$). For classification, we construct a smooth barrier score $S_A$ with finite $\mathrm{R}\mathrm{TV}$ whose fixed threshold $τ=1$ exactly recovers the box. Under a mild tube-mass condition near $\partial A$, we prove an $L_1(P)$ calibration bound that decays polynomially in a sharpness parameter, along with an explicit $\mathrm{R}\mathrm{TV}$ upper bound in terms of face measures. Experiments on synthetic unions of rectangles illustrate the resulting accuracy–complexity tradeoff and how threshold selection shifts where training lands along it.

cs.MA

[550] AI Agents as Policymakers in Simulated Epidemics

Goshi Aoki, Navid Ghaffarzadegan

Main category: cs.MA

TL;DR: AI agents can model policy decisions in epidemics when given basic domain theory, showing human-like reactive behavior and improved decision quality with systems-level knowledge.

Details

Motivation: AI agents are increasingly used for specialized tasks but their potential as computational models of decision-making in complex social systems remains underexplored, particularly for studying policy decisions during epidemics.

Method: Developed a generative AI agent acting as a city mayor in a simulated SEIR epidemic environment. The agent receives weekly epidemiological updates, evaluates situations, and sets business restriction levels. Equipped with dynamic memory weighting past events by recency, tested in single- and ensemble-agent settings across varying complexity environments.

Result: The agent exhibits human-like reactive behavior (tightening restrictions with rising cases, relaxing with declining risk). Providing brief systems-level knowledge about epidemic dynamics and feedbacks between disease spread and behavioral responses substantially improves decision quality and stability.

Conclusion: Generative AI agents, when situated in structured environments and guided by minimal domain theory, can serve as powerful computational models for studying decision-making and policy design in complex social systems. Theory-informed prompting can shape emergent policy behavior in AI agents.

Abstract: AI agents are increasingly deployed as quasi-autonomous systems for specialized tasks, yet their potential as computational models of decision-making remains underexplored. We develop a generative AI agent to study repetitive policy decisions during an epidemic, embedding the agent, prompted to act as a city mayor, within a simulated SEIR environment. Each week, the agent receives updated epidemiological information, evaluates the evolving situation, and sets business restriction levels. The agent is equipped with a dynamic memory that weights past events by recency and is evaluated in both single- and ensemble-agent settings across environments of varying complexity. Across scenarios, the agent exhibits human-like reactive behavior, tightening restrictions in response to rising cases and relaxing them as risk declines. Crucially, providing the agent with brief systems-level knowledge of epidemic dynamics, highlighting feedbacks between disease spread and behavioral responses, substantially improves decision quality and stability. The results illustrate how theory-informed prompting can shape emergent policy behavior in AI agents. These findings demonstrate that generative AI agents, when situated in structured environments and guided by minimal domain theory, can serve as powerful computational models for studying decision-making and policy design in complex social systems.

[551] From Idea to Co-Creation: A Planner-Actor-Critic Framework for Agent Augmented 3D Modeling

Jin Gao, Saichandu Juluri

Main category: cs.MA

TL;DR: Multi-agent self-reflection framework with Planner-Actor-Critic architecture improves 3D modeling quality over single-prompt approaches through iterative feedback and human supervision.

Details

Motivation: Existing single-prompt agents for 3D modeling directly execute commands via tools like Blender MCP, but lack iterative refinement and quality control mechanisms.

Method: Planner-Actor-Critic architecture where Planner coordinates steps, Actor executes modeling commands, Critic provides iterative feedback, with human users as supervisors and advisors throughout the process.

Result: Improvements in geometric accuracy, aesthetic quality, and task completion rates across diverse 3D modeling scenarios; reduces modeling errors and increases complexity/quality compared to direct single-prompt execution.

Conclusion: Structured agent self-reflection augmented by human oversight produces higher-quality 3D models while maintaining efficient workflow integration through real-time Blender synchronization.

Abstract: We present a framework that extends the Actor-Critic architecture to creative 3D modeling through multi-agent self-reflection and human-in-the-loop supervision. While existing approaches rely on single-prompt agents that directly execute modeling commands via tools like Blender MCP, our approach introduces a Planner-Actor-Critic architecture. In this design, the Planner coordinates modeling steps, the Actor executes them, and the Critic provides iterative feedback, while human users act as supervisors and advisors throughout the process. Through systematic comparison between single-prompt modeling and our reflective multi-agent approach, we demonstrate improvements in geometric accuracy, aesthetic quality, and task completion rates across diverse 3D modeling scenarios. Our evaluation reveals that critic-guided reflection, combined with human supervisory input, reduces modeling errors and increases complexity and quality of the result compared to direct single-prompt execution. This work establishes that structured agent self-reflection, when augmented by human oversight and advisory guidance, produces higher-quality 3D models while maintaining efficient workflow integration through real-time Blender synchronization.

[552] FinDeepForecast: A Live Multi-Agent System for Benchmarking Deep Research Agents in Financial Forecasting

Xiangyu Li, Xuan Yao, Guohao Qi, Fengbin Zhu, Kelvin J. L. Koa, Xiang Yao Ng, Ziyang Liu, Xingyu Ni, Chang Liu, Yonghui Yang, Yang Zhang, Wenjie Wang, Fuli Feng, Chao Wang, Huanbo Luan, Xiaofen Xing, Xiangmin Xu, Tat-Seng Chua, Ke-Wei Huang

Main category: cs.MA

TL;DR: FinDeepForecast is a live evaluation system for Deep Research agents on financial forecasting tasks, showing they outperform baselines but still lack true forward-looking reasoning.

Details

Motivation: There's a gap in comprehensive, live evaluation of Deep Research agents' forecasting performance on real-world, research-oriented tasks in high-stakes domains like finance.

Method: Introduces FinDeepForecast, a live end-to-end multi-agent system with dual-track taxonomy for generating recurrent and non-recurrent forecasting tasks at corporate and macro levels, creating FinDeepForecastBench over 10 weeks across 8 economies and 1,314 companies.

Result: DR agents consistently outperform strong baselines but still fall short of genuine forward-looking financial reasoning. The system generates a comprehensive benchmark evaluating 13 representative methods.

Conclusion: FinDeepForecast provides a consistent framework to facilitate future advancements of DR agents in research-oriented financial forecasting, with publicly available benchmark and leaderboard.

Abstract: Deep Research (DR) Agents powered by advanced Large Language Models (LLMs) have fundamentally shifted the paradigm for completing complex research tasks. Yet, a comprehensive and live evaluation of their forecasting performance on real-world, research-oriented tasks in high-stakes domains (e.g., finance) remains underexplored. We introduce FinDeepForecast, the first live, end-to-end multi-agent system for automatically evaluating DR agents by continuously generating research-oriented financial forecasting tasks. This system is equipped with a dual-track taxonomy, enabling the dynamic generation of recurrent and non-recurrent forecasting tasks at both corporate and macro levels. With this system, we generate FinDeepForecastBench, a weekly evaluation benchmark over a ten-week horizon, encompassing 8 global economies and 1,314 listed companies, and evaluate 13 representative methods. Extensive experiments show that, while DR agents consistently outperform strong baselines, their performance still falls short of genuine forward-looking financial reasoning. We expect the proposed FinDeepForecast system to consistently facilitate future advancements of DR agents in research-oriented financial forecasting tasks. The benchmark and leaderboard are publicly available on the OpenFinArena Platform.

[553] A Novel Convex Layers Strategy for Circular Formation in Multi-Agent Systems

Gautam Kumar, Ashwini Ratnoo

Main category: cs.MA

TL;DR: One-shot collision-free solution for distributing agents on a circular periphery using convex layers and search space regions defined by initial positions only.

Details

Motivation: To solve the conflict-free circular distribution problem without requiring continuous computation or communication, using only initial agent positions to determine final goal assignments.

Method: Constructs convex layers (nested convex polygons) from initial agent positions, defines search space regions for each agent as areas between lines normal to supporting edges, and assigns unique goal positions within these search spaces at initial time.

Result: Demonstrates effective collision-free distribution through illustrative examples and extensive Monte-Carlo studies with various practical attributes.

Conclusion: Presents a novel one-shot solution to circular distribution that requires only initial positions and guarantees collision-free paths, contrasting with existing methods that need ongoing computation.

Abstract: This article considers the problem of conflict-free distribution of point-sized agents on a circular periphery encompassing all agents. The two key elements of the proposed policy include the construction of a set of convex layers (nested convex polygons) using the initial positions of the agents, and a novel search space region for each of the agents. The search space for an agent on a convex layer is defined as the region enclosed between the lines passing through the agent’s position and normal to its supporting edges. Guaranteeing collision-free paths, a goal assignment policy designates a unique goal position within the search space of an agent at the initial time itself, requiring no further computation thereafter. In contrast to the existing literature, this work presents a one-shot, collision-free solution to the circular distribution problem by utilizing only the initial positions of the agents. Illustrative examples and extensive Monte-Carlo studies considering various practical attributes demonstrate the effectiveness of the proposed method.

[554] Agent+P: Guiding UI Agents via Symbolic Planning

Shang Ma, Xusheng Xiao, Yanfang Ye

Main category: cs.MA

TL;DR: AGENT+P is a framework that uses symbolic planning and UI transition graphs to guide LLM-based UI agents, reducing hallucinations and improving efficiency in UI automation tasks.

Details

Motivation: LLM-based UI agents often hallucinate in long-horizon tasks due to lack of understanding of global UI transition structure, leading to inefficient exploration and failure in complex automation scenarios.

Method: Models app’s UI transition structure as a UI Transition Graph (UTG), reformulates UI automation as pathfinding on UTG, uses symbolic planner to generate provably correct optimal high-level plans, and integrates as plug-and-play framework with existing UI agents.

Result: On AndroidWorld benchmark, AGENT+P improves success rates of state-of-the-art UI agents by up to 14.31% and reduces action steps by 37.70%.

Conclusion: AGENT+P effectively addresses hallucination issues in LLM-based UI agents by leveraging symbolic planning and UTG modeling, significantly improving both success rates and efficiency in UI automation tasks.

Abstract: Large Language Model (LLM)-based UI agents show great promise for UI automation but often hallucinate in long-horizon tasks due to their lack of understanding of the global UI transition structure. To address this, we introduce AGENT+P, a novel framework that leverages symbolic planning to guide LLM-based UI agents. Specifically, we model an app’s UI transition structure as a UI Transition Graph (UTG), which allows us to reformulate the UI automation task as a pathfinding problem on the UTG. This further enables an off-the-shelf symbolic planner to generate a provably correct and optimal high-level plan, preventing the agent from redundant exploration and guiding the agent to achieve the automation goals. AGENT+P is designed as a plug-and-play framework to enhance existing UI agents. Evaluation on the AndroidWorld benchmark demonstrates that AGENT+P improves the success rates of state-of-the-art UI agents by up to 14.31% and reduces the action steps by 37.70%.

cs.MM

eess.AS

[555] Latent-Level Enhancement with Flow Matching for Robust Automatic Speech Recognition

Da-Hee Yang, Joon-Hyuk Chang

Main category: eess.AS

TL;DR: FM-Refiner: A plug-and-play flow matching module that refines distorted latent representations from ASR encoder outputs to improve noise-robust speech recognition without fine-tuning ASR parameters.

Details

Motivation: Traditional speech enhancement at waveform level doesn't always improve ASR due to residual distortions and mismatches with ASR encoder's latent space. Need complementary approach at latent level.

Method: Proposes FM-Refiner module that operates on output latents of pretrained CTC-based ASR encoder. Uses flow matching to map imperfect latents (from noisy or enhanced-but-imperfect speech) toward clean counterparts. Applied only at inference without fine-tuning ASR parameters.

Result: FM-Refiner consistently reduces word error rate when applied to noisy inputs and when combined with conventional speech enhancement front-ends.

Conclusion: Latent-level refinement via flow matching provides lightweight, effective complement to existing speech enhancement approaches for robust ASR.

Abstract: Noise-robust automatic speech recognition (ASR) has been commonly addressed by applying speech enhancement (SE) at the waveform level before recognition. However, speech-level enhancement does not always translate into consistent recognition improvements due to residual distortions and mismatches with the latent space of the ASR encoder. In this letter, we introduce a complementary strategy termed latent-level enhancement, where distorted representations are refined during ASR inference. Specifically, we propose a plug-and-play Flow Matching Refinement module (FM-Refiner) that operates on the output latents of a pretrained CTC-based ASR encoder. Trained to map imperfect latents-either directly from noisy inputs or from enhanced-but-imperfect speech-toward their clean counterparts, the FM-Refiner is applied only at inference, without fine-tuning ASR parameters. Experiments show that FM-Refiner consistently reduces word error rate, both when directly applied to noisy inputs and when combined with conventional SE front-ends. These results demonstrate that latent-level refinement via flow matching provides a lightweight and effective complement to existing SE approaches for robust ASR.

[556] LLMs-Integrated Automatic Hate Speech Recognition Using Controllable Text Generation Models

Ryutaro Oshima, Yuya Hosoda, Youji Iiguni

Main category: eess.AS

TL;DR: ASR model for hate speech detection using LLMs with simultaneous transcription and censorship, enhanced by LLM-generated dataset with CoT prompting and curriculum learning.

Details

Motivation: Need for effective hate speech detection in ASR systems while preventing exposure to harmful content, with limited annotated hate speech datasets available.

Method: Integrates ASR encoder with LLM decoder for simultaneous tasks. Uses CoT prompting with cultural context to generate text samples, converts to speech via TTS, filters using text classification models, and applies curriculum learning with adjustable hate level thresholds.

Result: Achieves 58.6% masking accuracy for hate-related words, surpassing previous baselines. Curriculum training improves efficiency for both transcription and censorship tasks.

Conclusion: Proposed method effectively combines ASR and LLMs for hate speech detection, with curriculum learning and filtered dataset generation addressing data scarcity and improving performance.

Abstract: This paper proposes an automatic speech recognition (ASR) model for hate speech using large language models (LLMs). The proposed method integrates the encoder of the ASR model with the decoder of the LLMs, enabling simultaneous transcription and censorship tasks to prevent the exposure of harmful content. Instruction tuning of the LLM to mask hate-related words with specific tokens requires an annotated hate speech dataset, which is limited. We generate text samples using an LLM with the Chain-of-Thought (CoT) prompting technique guided by cultural context and examples and then convert them into speech samples using a text-to-speech (TTS) system. However, some of them contain non-hate speech samples with hate-related words, which degrades the censorship performance. This paper filters the samples which text classification models correctly label as hate content. By adjusting the threshold for the number of correct answer models, we can control the level of hate in the generated dataset, allowing us to train the LLMs through curriculum learning in a gradual manner. Experimental results show that the proposed method achieves a masking accuracy of 58.6% for hate-related words, surpassing previous baselines. We also confirm that the curriculum training contributes to the efficiency of both transcription and censorship tasks.

[557] Gradient-based Optimisation of Modulation Effects

Alistair Carson, Alec Wright, Stefan Bilbao

Main category: eess.AS

TL;DR: A differentiable DSP framework for modeling guitar modulation effects (phasers, flangers, chorus) with zero-latency inference, using time-frequency training and low-frequency weighted loss functions to avoid local minima in delay time optimization.

Details

Motivation: Existing machine learning approaches for analog modulation effect emulation are either limited to single effect types or suffer from high computational cost/latency compared to traditional digital implementations.

Method: A differentiable digital signal processing framework trained in time-frequency domain but operating in time-domain at inference (zero latency). Uses low-frequency weighting of loss functions to avoid convergence to local minima when learning delay times.

Result: The model produces sound output that is perceptually indistinguishable from analog reference units in some cases, though challenges remain for effects with long delay times and feedback.

Conclusion: The framework successfully models multiple modulation effects with zero latency, but further work is needed to handle effects with long delay times and feedback structures.

Abstract: Modulation effects such as phasers, flangers and chorus effects are heavily used in conjunction with the electric guitar. Machine learning based emulation of analog modulation units has been investigated in recent years, but most methods have either been limited to one class of effect or suffer from a high computational cost or latency compared to canonical digital implementations. Here, we build on previous work and present a framework for modelling flanger, chorus and phaser effects based on differentiable digital signal processing. The model is trained in the time-frequency domain, but at inference operates in the time-domain, requiring zero latency. We investigate the challenges associated with gradient-based optimisation of such effects, and show that low-frequency weighting of loss functions avoids convergence to local minima when learning delay times. We show that when trained against analog effects units, sound output from the model is in some cases perceptually indistinguishable from the reference, but challenges still remain for effects with long delay times and feedback.

[558] TellWhisper: Tell Whisper Who Speaks When

Yifan Hu, Peiji Yang, Zhisheng Wang, Yicheng Zhong, Rui Liu

Main category: eess.AS

TL;DR: TellWhisper: A unified framework for multi-speaker ASR that jointly models speaker identity and temporal information within the speech encoder using time-speaker rotary positional encoding (TS-RoPE) and hyperbolic speaker classification (Hyper-SD).

Details

Motivation: Existing multi-speaker ASR approaches decouple temporal modeling ("when") and speaker modeling ("who"), which is brittle under rapid turn-taking and overlapping speech, leading to degraded performance. Current methods either inject speaker cues before encoding (causing irreversible information loss) or fuse identity after encoding (entangling acoustic content with speaker identity).

Method: 1) TS-RoPE (Time-Speaker Rotary Positional Encoding): Derives time coordinates from frame indices and speaker coordinates from speaker activity and pause cues, using region-specific rotation angles to capture per-speaker continuity, speaker-turn transitions, and state dynamics. 2) Hyper-SD: Casts speaker classification in hyperbolic space to enhance inter-class separation and refine frame-level speaker activity estimates.

Result: Extensive experiments demonstrate the effectiveness of the proposed approach, showing improved performance in multi-speaker ASR tasks, particularly under challenging conditions like rapid turn-taking and overlapping speech.

Conclusion: TellWhisper provides a unified framework that jointly models speaker identity and temporal information within the speech encoder, addressing limitations of existing decoupled approaches and improving performance in multi-speaker ASR scenarios.

Abstract: Multi-speaker automatic speech recognition (MASR) aims to predict ‘‘who spoke when and what’’ from multi-speaker speech, a key technology for multi-party dialogue understanding. However, most existing approaches decouple temporal modeling and speaker modeling when addressing ‘‘when’’ and ‘‘who’’: some inject speaker cues before encoding (e.g., speaker masking), which can cause irreversible information loss; others fuse identity by mixing speaker posteriors after encoding, which may entangle acoustic content with speaker identity. This separation is brittle under rapid turn-taking and overlapping speech, often leading to degraded performance. To address these limitations, we propose TellWhisper, a unified framework that jointly models speaker identity and temporal within the speech encoder. Specifically, we design TS-RoPE, a time-speaker rotary positional encoding: time coordinates are derived from frame indices, while speaker coordinates are derived from speaker activity and pause cues. By applying region-specific rotation angles, the model explicitly captures per-speaker continuity, speaker-turn transitions, and state dynamics, enabling the attention mechanism to simultaneously attend to ‘‘when’’ and ‘‘who’’. Moreover, to estimate frame-level speaker activity, we develop Hyper-SD, which casts speaker classification in hyperbolic space to enhance inter-class separation and refine speaker-activity estimates. Extensive experiments demonstrate the effectiveness of the proposed approach.

eess.IV

[559] Towards a Unified Theoretical Framework for Self-Supervised MRI Reconstruction

Siying Xu, Kerstin Hammernik, Daniel Rueckert, Sergios Gatidis, Thomas Küstner

Main category: eess.IV

TL;DR: UNITS provides a unified theoretical framework for self-supervised MRI reconstruction that achieves supervised performance without needing fully-sampled reference data.

Details

Motivation: MRI acquisition times are too long, and while deep learning helps, supervised methods require fully-sampled reference data that's hard to obtain. Existing self-supervised approaches are fragmented and lack theoretical foundation.

Method: UNITS unifies prior self-supervised strategies into a common framework, introduces sampling stochasticity and flexible data utilization, and provides theoretical guarantees that SSL can match supervised performance.

Result: The framework improves network generalization under out-of-domain distributions, stabilizes training, and establishes a foundation for interpretable and clinically applicable self-supervised MRI reconstruction.

Conclusion: UNITS serves as both a theoretical foundation and practical paradigm for self-supervised MRI reconstruction, enabling clinically viable accelerated imaging without requiring fully-sampled reference data.

Abstract: The demand for high-resolution, non-invasive imaging continues to drive innovation in magnetic resonance imaging (MRI), yet prolonged acquisition times hinder accessibility and real-time applications. While deep learning-based reconstruction methods have accelerated MRI, their predominant supervised paradigm depends on fully-sampled reference data that are challenging to acquire. Recently, self-supervised learning (SSL) approaches have emerged as promising alternatives, but most are empirically designed and fragmented. Therefore, we introduce UNITS (Unified Theory for Self-supervision), a general framework for self-supervised MRI reconstruction. UNITS unifies prior SSL strategies within a common formalism, enabling consistent interpretation and systematic benchmarking. We prove that SSL can achieve the same expected performance as supervised learning. Under this theoretical guarantee, we introduce sampling stochasticity and flexible data utilization, which improve network generalization under out-of-domain distributions and stabilize training. Together, these contributions establish UNITS as a theoretical foundation and a practical paradigm for interpretable, generalizable, and clinically applicable self-supervised MRI reconstruction.

[560] Scalable neural pushbroom architectures for real-time denoising of hyperspectral images onboard satellites

Ziyao Yi, Davide Piccinini, Diego Valsesia, Tiziano Bianchi, Enrico Magli

Main category: eess.IV

TL;DR: Proposes a neural network design for onboard hyperspectral image denoising on satellites with three key objectives: high-quality inference with low complexity, dynamic power scalability, and fault tolerance.

Details

Motivation: Next-gen Earth observation satellites need intelligent models onboard to reduce latency for time-critical applications. Hyperspectral imagers pose unique challenges not addressed by traditional computer vision: high-quality inference with low complexity, dynamic power scalability, and radiation-induced fault tolerance.

Method: Proposes mixture of denoisers resilient to radiation faults with power scaling capability. Each denoiser uses innovative causal architecture processing images line-by-line with memory of past lines to match pushbroom sensor acquisition and limit memory requirements.

Result: Architecture runs in real-time on low-power hardware (processes one line during next line acquisition), provides competitive denoising quality vs. more complex state-of-the-art models, and enables design space with tradeoffs between power scalability, fault tolerance, and denoising quality.

Conclusion: Proposed neural network design successfully addresses three competing objectives for onboard hyperspectral image denoising: high-quality inference with low complexity, dynamic power scalability, and fault tolerance, enabling real-time processing on satellite hardware.

Abstract: The next generation of Earth observation satellites will seek to deploy intelligent models directly onboard the payload in order to minimize the latency incurred by the transmission and processing chain of the ground segment, for time-critical applications. Designing neural architectures for onboard execution, particularly for satellite-based hyperspectral imagers, poses novel challenges due to the unique constraints of this environment and imaging system that are largely unexplored by the traditional computer vision literature. In this paper, we show that this setting requires addressing three competing objectives, namely high-quality inference with low complexity, dynamic power scalability and fault tolerance. We focus on the problem of hyperspectral image denoising, which is a critical task to enable effective downstream inference, and highlights the constraints of the onboard processing scenario. We propose a neural network design that addresses the three aforementioned objectives with several novel contributions. In particular, we propose a mixture of denoisers that can be resilient to radiation-induced faults as well as allowing for time-varying power scaling. Moreover, each denoiser employs an innovative architecture where an image is processed line-by-line in a causal way, with a memory of past lines, in order to match the acquisition process of pushbroom hyperspectral sensors and greatly limit memory requirements. We show that the proposed architecture can run in real-time, i.e., process one line in the time it takes to acquire the next one, on low-power hardware and provide competitive denoising quality with respect to significantly more complex state-of-the-art models. We also show that the power scalability and fault tolerance objectives provide a design space with multiple tradeoffs between those properties and denoising quality.

[561] Spacecube: A fast inverse hyperspectral georectification system

Thomas P. Watson, Eddie L. Jacobs

Main category: eess.IV

TL;DR: Spacecube is a fast OpenGL-based hyperspectral georectification pipeline that eliminates artifacts and operates faster than real-time.

Details

Motivation: Traditional direct georectification for hyperspectral aerial data is slow and prone to artifacts, limiting efficient data processing and analysis.

Method: Developed Spacecube program implementing a complete hyperspectral georectification pipeline using OpenGL graphics programming with a fast inverse georectification technique.

Result: Spacecube operates substantially faster than real-time, eliminates pixel coverage artifacts, and enables high quality interactive viewing, data exploration, and export.

Conclusion: Spacecube provides an efficient, artifact-free solution for hyperspectral georectification, with source code released publicly for community use.

Abstract: Hyperspectral cameras provide numerous advantages in terms of the utility of the data captured. They capture hundreds of data points per sample (pixel) instead of only the few of RGB or multispectral camera systems. Aerial systems sense such data remotely, but the data must be georectified to produce consistent images before analysis. We find the traditional direct georectification method to be slow, and it is prone to artifacts. To address its downsides, we propose Spacecube, a program that implements a complete hyperspectral georectification pipeline, including our own fast inverse georectification technique, using OpenGL graphics programming technologies. Spacecube operates substantially faster than real-time and eliminates pixel coverage artifacts. It facilitates high quality interactive viewing, data exploration, and export of final products. We release Spacecube’s source code publicly for the community to use.

[562] Federated Learning: A new frontier in the exploration of multi-institutional medical imaging data

Dominika Ciupek, Maciej Malawski, Tomasz Pieciak

Main category: eess.IV

TL;DR: A comprehensive review of federated learning in medical imaging that enables decentralized training of deep learning models while preserving data privacy across multiple institutions.

Details

Motivation: Deep learning in medical imaging requires large datasets, but data sharing is hindered by privacy concerns, ethical agreements, and heterogeneity across institutions (different hardware, protocols, operators). Federated learning addresses these challenges by enabling collaborative model training without centralized data collection.

Method: The paper reviews federated learning principles and algorithms for medical imaging, including aggregation methods, specialized learning algorithms for handling data/model heterogeneity, privacy-preserving techniques, and frameworks for FL system implementation.

Result: The review provides a comprehensive analysis of FL approaches that enable training globally generalized models while maintaining data privacy at each institution, addressing challenges like heterogeneity, privacy attacks, and resource variability.

Conclusion: Federated learning offers a promising solution for collaborative medical imaging research while preserving data privacy, though challenges remain in handling heterogeneity, ensuring security, and optimizing system efficiency. Future directions include improved frameworks and real-world implementations.

Abstract: Artificial intelligence has transformed the perspective of medical imaging, leading to a genuine technological revolution in modern computer-assisted healthcare systems. However, ubiquitously featured deep learning (DL) systems require access to a considerable amount of data, facilitating proper knowledge extraction and generalization. Access to such extensive resources may be hindered due to the time and effort required to convey ethical agreements, set up and carry the acquisition procedures through, and manage the datasets adequately with a particular emphasis on proper anonymization. One of the pivotal challenges in the DL field is data integration from various sources acquired using different hardware vendors, diverse acquisition protocols, experimental setups, and even inter-operator variabilities. In this paper, we review the federated learning (FL) concept that fosters the integration of large-scale heterogeneous datasets from multiple institutions in training DL models. In contrast to a centralized approach, the decentralized FL procedure promotes training DL models while preserving data privacy at each institution involved. We formulate the FL principle and comprehensively review general and specialized medical imaging aggregation and learning algorithms, enabling the generation of a globally generalized model. We meticulously go through the challenges in constructing FL-based systems, such as data and model heterogeneities across the institutions, resilience to potential attacks on data privacy, and the variability in computational and communication resources among the entangled sites that might induce efficiency issues of the entire system. Finally, we explore the up-to-date open frameworks for rapid FL-based algorithm prototyping, comprehensively present real-world implementations of FL systems and shed light on future directions in this intensively growing field.

[563] DermaCon-IN: A Multi-concept Annotated Dermatological Image Dataset of Indian Skin Disorders for Clinical AI Research

Shanawaj S Madarkar, Mahajabeen Madarkar, Madhumitha Venkatesh, Deepanshu Bansal, Teli Prakash, Konda Reddy Mopuri, Vinaykumar MV, KVL Sathwika, Adarsh Kasturi, Gandla Dilip Raj, PVN Supranitha, Harsh Udai

Main category: eess.IV

TL;DR: DermaCon-IN is a new dermatology dataset from South India with 5,450 clinical images from 3,002 patients, annotated with 245 diagnoses using a hierarchical taxonomy, designed to address limitations in existing datasets and advance equitable AI for dermatology.

Details

Motivation: Current dermatology AI development is hindered by datasets that fail to capture real-world clinical and demographic complexity, including region-specific disease distributions, skin tone variation, and underrepresentation of non-Western outpatient populations.

Method: Prospectively curated dataset from outpatient clinics in South India with board-certified dermatologist annotations using a hierarchical, aetiology-based taxonomy adapted from Rook’s classification. Benchmarking of various architectures including convolutional models (ResNet, DenseNet, EfficientNet), transformer-based models (ViT, MaxViT, Swin), and Concept Bottleneck Models.

Result: Established baseline performance for various AI architectures on the DermaCon-IN dataset, exploring integration of anatomical and concept-level cues to guide development of interpretable and clinically realistic models.

Conclusion: DermaCon-IN provides a scalable and representative foundation for advancing dermatology AI, addressing current dataset limitations and supporting development of more equitable and robust diagnostic models.

Abstract: Artificial intelligence is poised to augment dermatological care by enabling scalable image-based diagnostics. Yet, the development of robust and equitable models remains hindered by datasets that fail to capture the clinical and demographic complexity of real-world practice. This complexity stems from region-specific disease distributions, wide variation in skin tones, and the underrepresentation of outpatient scenarios from non-Western populations. We introduce DermaCon-IN, a prospectively curated dermatology dataset comprising 5,450 clinical images from 3,002 patients across outpatient clinics in South India. Each image is annotated by board-certified dermatologists with 245 distinct diagnoses, structured under a hierarchical, aetiology-based taxonomy adapted from Rook’s classification. The dataset captures a wide spectrum of dermatologic conditions and tonal variation commonly seen in Indian outpatient care. We benchmark a range of architectures, including convolutional models (ResNet, DenseNet, EfficientNet), transformer-based models (ViT, MaxViT, Swin), and Concept Bottleneck Models to establish baseline performance and explore how anatomical and concept-level cues may be integrated. These results are intended to guide future efforts toward interpretable and clinically realistic models. DermaCon-IN provides a scalable and representative foundation for advancing dermatology AI.

[564] Leveraging Clinical Text and Class Conditioning for 3D Prostate MRI Generation

Emerson P. Grabke, Babak Taati, Masoom A. Haider

Main category: eess.IV

TL;DR: CCELLA: A novel dual-head conditioning approach for latent diffusion models that uses free-text clinical reports and radiology classification to generate high-quality synthetic medical images with limited data, improving downstream classifier performance.

Details

Motivation: Current medical LDMs face limitations: they rely on short-prompt text encoders, nonmedical LDMs, or large data volumes, which restrict performance and scientific accessibility. There's a need for better conditioning approaches that work with limited medical data.

Method: Proposes CCELLA (Class-Conditioned Efficient Large Language model Adapter) - a dual-head conditioning approach that simultaneously conditions the LDM U-Net with free-text clinical reports and radiology classification. Also introduces a data-efficient LDM pipeline with a joint loss function. Evaluated on 3D prostate MRI data.

Result: Achieved 3D FID score of 0.025 on size-limited 3D prostate MRI dataset, significantly outperforming foundation model (FID 0.070). Synthetic images improved prostate cancer classifier accuracy from 69% to 74%. Classifier trained solely on synthetic images performed comparably to real image training.

Conclusion: CCELLA improves both synthetic image quality and downstream classifier performance using limited data and minimal human annotation. Enables radiology report and class-conditioned LDM training for high-quality medical image synthesis, enhancing LDM performance and scientific accessibility.

Abstract: Objective: Latent diffusion models (LDM) could alleviate data scarcity challenges affecting machine learning development for medical imaging. However, medical LDM strategies typically rely on short-prompt text encoders, nonmedical LDMs, or large data volumes. These strategies can limit performance and scientific accessibility. We propose a novel LDM conditioning approach to address these limitations. Methods: We propose Class-Conditioned Efficient Large Language model Adapter (CCELLA), a novel dual-head conditioning approach that simultaneously conditions the LDM U-Net with free-text clinical reports and radiology classification. We also propose a data-efficient LDM pipeline centered around CCELLA and a proposed joint loss function. We first evaluate our method on 3D prostate MRI against state-of-the-art. We then augment a downstream classifier model training dataset with synthetic images from our method. Results: Our method achieves a 3D FID score of 0.025 on a size-limited 3D prostate MRI dataset, significantly outperforming a recent foundation model with FID 0.070. When training a classifier for prostate cancer prediction, adding synthetic images generated by our method during training improves classifier accuracy from 69% to 74% and outperforms classifiers trained on images generated by prior state-of-the-art. Classifier training solely on our method’s synthetic images achieved comparable performance to real image training. Conclusion: We show that our method improved both synthetic image quality and downstream classifier performance using limited data and minimal human annotation. Significance: The proposed CCELLA-centric pipeline enables radiology report and class-conditioned LDM training for high-quality medical image synthesis given limited data volume and human data annotation, improving LDM performance and scientific accessibility.

[565] Comparative Analysis of Binarization Methods For Medical Image Hashing On Odir Dataset

Nedim Muzoglu

Main category: eess.IV

TL;DR: SDH outperforms LSH, ITQ, and KSH on ODIR dataset with 0.9184 mAP@100 using only 32-bit codes, achieving competitive accuracy with fewer bits than prior methods.

Details

Motivation: To evaluate and compare different binarization methods for medical image retrieval, aiming to find the most effective approach that balances accuracy, storage efficiency, and computational efficiency for practical applications like device inventory management.

Method: Evaluated four binarization methods (LSH, ITQ, KSH, SDH) on the ODIR dataset using deep feature embeddings, comparing their performance with different bit lengths and benchmarking against prior studies.

Result: SDH achieved the best performance with 0.9184 mAP@100 using only 32-bit codes, outperforming LSH, ITQ, and KSH. Despite using significantly fewer bits than prior studies (32 vs 48-256 bits), SDH achieved competitive accuracy close to state-of-the-art results.

Conclusion: SDH is the most effective binarization method among those tested, offering an optimal balance of accuracy, storage efficiency, and computational efficiency for medical image retrieval applications.

Abstract: In this study, we evaluated four binarization methods. Locality-Sensitive Hashing (LSH), Iterative Quantization (ITQ), Kernel-based Supervised Hashing (KSH), and Supervised Discrete Hashing (SDH) on the ODIR dataset using deep feature embeddings. Experimental results show that SDH achieved the best performance, with an mAP@100 of 0.9184 using only 32-bit codes, outperforming LSH, ITQ, and KSH. Compared with prior studies, our method proved highly competitive: Fang et al. reported 0.7528 (Fundus-iSee, 48 bits) and 0.8856 (ASOCT-Cataract, 48 bits), while Wijesinghe et al. achieved 94.01 (KVASIR, 256 bits). Despite using significantly fewer bits, our SDH-based framework reached retrieval accuracy close to the state-of-the-art. These findings demonstrate that SDH is the most effective approach among those tested, offering a practical balance of accuracy, storage, and efficiency for medical image retrieval and device inventory management.

Today’s Research Highlights

Table of Contents

cs.CL

[1] MedPI: Evaluating AI Systems in Medical Patient-facing Interactions

[2] RAGVUE: A Diagnostic View for Explainable and Automated Evaluation of Retrieval-Augmented Generation

[3] V-FAT: Benchmarking Visual Fidelity Against Text-bias

[4] Automatic Construction of Chinese Verb Collostruction Database

[5] POLYCHARTQA: Benchmarking Large Vision-Language Models with Multilingual Chart Question Answering

[6] Attribute-Aware Controlled Product Generation with LLMs for E-commerce

[7] Collective Narrative Grounding: Community-Coordinated Data Contributions to Improve Local AI Systems

[8] TeleTables: A Benchmark for Large Language Models in Telecom Table Interpretation

[9] FronTalk: Benchmarking Front-End Development as Conversational Code Generation with Multi-Modal Feedback

[10] STDD:Spatio-Temporal Dynamics-Driven Token Refinement in Diffusion Language Models

[11] Enhancing Admission Inquiry Responses with Fine-Tuned Models and Retrieval-Augmented Generation

[12] WESR: Scaling and Evaluating Word-level Event-Speech Recognition

[13] Ideology as a Problem: Lightweight Logit Steering for Annotator-Specific Alignment in Social Media Analysis

[14] LLMs for Explainable Business Decision-Making: A Reinforcement Learning Fine-Tuning Approach

[15] A Unified Spoken Language Model with Injected Emotional-Attribution Thinking for Human-like Interaction

[16] Leveraging Language Models and RAG for Efficient Knowledge Discovery in Clinical Environments

[17] Complexity Agnostic Recursive Decomposition of Thoughts

[18] Qwerty AI: Explainable Automated Age Rating and Content Safety Assessment for Russian-Language Screenplays

[19] TrueBrief: Faithful Summarization through Small Language Models

[20] AnimatedLLM: Explaining LLMs with Interactive Visualizations

[21] From Domains to Instances: Dual-Granularity Data Synthesis for LLM Unlearning

[22] RIGOURATE: Quantifying Scientific Exaggeration with Evidence-Aligned Claim Evaluation

[23] Dialect Matters: Cross-Lingual ASR Transfer for Low-Resource Indic Language Varieties

[24] Disco-RAG: Discourse-Aware Retrieval-Augmented Generation

[25] MiJaBench: Revealing Minority Biases in Large Language Models via Hate Speech Jailbreaking

[26] ARREST: Adversarial Resilient Regulation Enhancing Safety and Truth in Large Language Models

[27] Interpreting Transformers Through Attention Head Intervention

[28] Gavel: Agent Meets Checklist for Evaluating LLMs on Long-Context Legal Summarization

[29] Accommodation and Epistemic Vigilance: A Pragmatic Account of Why LLMs Fail to Challenge Harmful Beliefs

[30] Learning to Simulate Human Dialogue

[31] Merging Triggers, Breaking Backdoors: Defensive Poisoning for Instruction-Tuned Language Models

[32] Users Mispredict Their Own Preferences for AI Writing Assistance

[33] Beyond Static Summarization: Proactive Memory Extraction for LLM Agents

[34] Concept Tokens: Learning Behavioral Embeddings Through Concept Definitions

[35] SampoNLP: A Self-Referential Toolkit for Morphological Analysis of Subword Tokenizers

[36] LinguaGame: A Linguistically Grounded Game-Theoretic Paradigm for Multi-Agent Dialogue Generation

[37] GRACE: Reinforcement Learning for Grounded Response and Abstention under Contextual Evidence

[38] BanglaLorica: Design and Evaluation of a Robust Watermarking Algorithm for Large Language Models in Bangla Text Generation

[39] Identifying Good and Bad Neurons for Task-Level Controllable LLMs

[40] FeedEval: Pedagogically Aligned Evaluation of LLM-Generated Essay Feedback

[41] Aligning Text, Code, and Vision: A Multi-Objective Reinforcement Learning Framework for Text-to-Visualization

[42] THaLLE-ThaiLLM: Domain-Specialized Small LLMs for Finance and Thai – Technical Report

[43] On the Limitations of Rank-One Model Editing in Answering Multi-hop Questions

[44] When More Words Say Less: Decoupling Length and Specificity in Image Description Evaluation

[45] Character-R1: Enhancing Role-Aware Reasoning in Role-Playing Agents via RLVR

[46] From National Curricula to Cultural Awareness: Constructing Open-Ended Culture-Specific Question Answering Dataset

[47] MAGA-Bench: Machine-Augment-Generated Text via Alignment Detection Benchmark

[48] SpeechMedAssist: Efficiently and Effectively Adapting Speech Language Models for Medical Consultation

[49] CRANE: Causal Relevance Analysis of Language-Specific Neurons in Multilingual Large Language Models

[50] ToolGate: Contract-Grounded and Verified Tool Execution for LLMs

[51] See, Explain, and Intervene: A Few-Shot Multimodal Agent Framework for Hateful Meme Moderation

[52] Thunder-KoNUBench: A Corpus-Aligned Benchmark for Korean Negation Understanding

[53] PRISM: A Unified Framework for Post-Training LLMs Without Verifiable Rewards

[54] Prior-Informed Zeroth-Order Optimization with Adaptive Direction Alignment for Memory-Efficient LLM Fine-Tuning

[55] DSC2025 – ViHallu Challenge: Detecting Hallucination in Vietnamese LLMs

[56] Fame Fades, Nature Remains: Disentangling the Character Identity of Role-Playing Agents

[57] Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking

[58] Automatic Classifiers Underdetect Emotions Expressed by Men

[59] AM$^3$Safety: Towards Data Efficient Alignment of Multi-modal Multi-turn Safety for MLLMs

[60] RiskAtlas: Exposing Domain-Specific Risks in LLMs through Knowledge-Graph-Guided Harmful Prompt Generation

[61] Tool-MAD: A Multi-Agent Debate Framework for Fact Verification with Diverse Tool Augmentation and Adaptive Retrieval

[62] PILOT-Bench: A Benchmark for Legal Reasoning in the Patent Domain with IRAC-Aligned Classification Tasks

[63] Differential syntactic and semantic encoding in LLMs

[64] Revisiting Judge Decoding from First Principles via Training-Free Distributional Divergence

[65] LANGSAE EDITING: Improving Multilingual Information Retrieval via Post-hoc Language Identity Removal

[66] NC2C: Automated Convexification of Generic Non-Convex Optimization Problems

[67] Belief in Authority: Impact of Authority in Multi-Agent Evaluation Framework

[68] When AI Settles Down: Late-Stage Stability as a Signature of AI-Generated Text Detection

[69] RAAR: Retrieval Augmented Agentic Reasoning for Cross-Domain Misinformation Detection

[70] Token Maturation: Autoregressive Language Generation via Continuous Token Dynamics

[71] MisSpans: Fine-Grained False Span Identification in Cross-Domain Fake News

[72] A Navigational Approach for Comprehensive RAG via Traversal over Proposition Graphs

[73] EvolSQL: Structure-Aware Evolution for Scalable Text-to-SQL Data Synthesis

[74] Mind2Report: A Cognitive Deep Research Agent for Expert-Level Commercial Report Synthesis

[75] CuMA: Aligning LLMs with Sparse Cultural Values via Demographic-Aware Mixture of Adapters

[76] Faithful Summarisation under Disagreement via Belief-Level Aggregation

[77] Can AI-Generated Persuasion Be Detected? Persuaficial Benchmark and AI vs. Human Linguistic Differences