Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 98]
cs.CV [Total: 95]
cs.AI [Total: 55]
cs.SD [Total: 7]
cs.LG [Total: 94]
cs.MA [Total: 6]
cs.MM [Total: 3]
eess.AS [Total: 1]
eess.IV [Total: 2]

cs.CL

[1] Closing the Modality Reasoning Gap for Speech Large Language Models

Chaoren Wang, Heng Lu, Xueyao Zhang, Shujie Liu, Yan Lu, Jinyu Li, Zhizheng Wu

Main category: cs.CL

TL;DR: TARS is a reinforcement learning framework that reduces the modality reasoning gap between speech and text in large language models by aligning speech- and text-conditioned trajectories using representation and behavior alignment.

Details

Motivation: Speech LLMs show significantly weaker reasoning performance on speech inputs compared to text inputs, which is attributed to representational drift across Transformer layers and behavior deviations in long-chain reasoning.

Method: TARS uses reinforcement learning with asymmetric reward design to align text-conditioned and speech-conditioned trajectories. It employs two complementary signals: representation alignment (measuring layer-wise hidden-state similarity) and behavior alignment (evaluating semantic consistency between generated outputs and reference text completions).

Result: Experiments on challenging reasoning benchmarks (MMSU and OBQA) show TARS significantly narrows the modality reasoning gap and achieves state-of-the-art performance among 7B-scale Speech LLMs.

Conclusion: The proposed TARS framework effectively addresses the modality reasoning gap in speech LLMs through reinforcement learning with representation and behavior alignment, leading to improved reasoning performance on speech inputs.

Abstract: Although speech large language models have achieved notable progress, a substantial modality reasoning gap remains: their reasoning performance on speech inputs is markedly weaker than on text. This gap could be associated with representational drift across Transformer layers and behavior deviations in long-chain reasoning. To address this issue, we introduce TARS, a reinforcement-learning framework that aligns text-conditioned and speech-conditioned trajectories through an asymmetric reward design. The framework employs two dense and complementary signals: representation alignment, which measures layer-wise hidden-state similarity between speech- and text-conditioned trajectories, and behavior alignment, which evaluates semantic consistency between generated outputs and reference text completions. Experiments on challenging reasoning benchmarks, including MMSU and OBQA, show that our approach significantly narrows the modality reasoning gap and achieves state-of-the-art performance among 7B-scale Speech LLMs.

[2] Enhancing Foundation Models in Transaction Understanding with LLM-based Sentence Embeddings

Xiran Fan, Zhimeng Jiang, Chin-Chia Michael Yeh, Yuzhong Chen, Yingtong Dou, Menghai Pan, Yan Zheng

Main category: cs.CL

TL;DR: Hybrid framework uses LLM-generated embeddings as semantic initializations for lightweight transaction models, balancing semantic understanding with computational efficiency for payment network analysis.

Details

Motivation: Existing foundation models for transaction analysis lose semantic information by converting rich textual merchant data into discrete tokens, while LLMs offer better semantic understanding but have computational overhead that challenges real-time financial deployment.

Method: Hybrid framework with LLM-generated embeddings as semantic initializations for lightweight models, multi-source data fusion to enrich merchant categorical fields, one-word constraint principle for consistent embedding generation, and systematic data quality handling through noise filtering and context-aware enrichment.

Result: Significant performance improvements across multiple transaction understanding tasks on large-scale transaction datasets.

Conclusion: The proposed hybrid approach effectively balances semantic interpretability with operational efficiency for real-time financial transaction analysis.

Abstract: The ubiquity of payment networks generates vast transactional data encoding rich consumer and merchant behavioral patterns. Recent foundation models for transaction analysis process tabular data sequentially but rely on index-based representations for categorical merchant fields, causing substantial semantic information loss by converting rich textual data into discrete tokens. While Large Language Models (LLMs) can address this limitation through superior semantic understanding, their computational overhead challenges real-time financial deployment. We introduce a hybrid framework that uses LLM-generated embeddings as semantic initializations for lightweight transaction models, balancing interpretability with operational efficiency. Our approach employs multi-source data fusion to enrich merchant categorical fields and a one-word constraint principle for consistent embedding generation across LLM architectures. We systematically address data quality through noise filtering and context-aware enrichment. Experiments on large-scale transaction datasets demonstrate significant performance improvements across multiple transaction understanding tasks.

[3] The Table of Media Bias Elements: A sentence-level taxonomy of media bias types and propaganda techniques

Tim Menzner, Jochen L. Leidner

Main category: cs.CL

TL;DR: Researchers develop a fine-grained, sentence-level taxonomy of media bias with 38 types across 6 families, moving beyond simple left-right political labels to analyze concrete linguistic techniques used in news reporting.

Details

Motivation: Current public debates about media bias focus too much on simplistic left-right political positioning, overlooking the actual linguistic techniques used to convey bias. The authors aim to shift focus from where outlets stand politically to how partiality is expressed at the sentence level.

Method: The researchers analyzed 26,464 sentences from newsroom corpora, user submissions, and browsing. They used iterative close-reading, interdisciplinary theory, and pilot annotation to develop a two-tier taxonomy with 38 elementary bias types organized into 6 functional families. They validated the taxonomy through quantitative analysis of a 155-sentence sample and cross-walk comparisons with existing NLP and communication science taxonomies.

Result: The study produced a comprehensive “table of media-bias elements” with definitions, real-world examples, cognitive/societal drivers, and recognition guidance for each bias type. Quantitative analysis showed prevalence differences across bias types, and comparison with existing taxonomies demonstrated substantial coverage gains and reduced ambiguity.

Conclusion: The research provides a more nuanced, sentence-level framework for analyzing media bias that moves beyond simplistic political spectrum labels, offering a practical tool for identifying specific linguistic techniques used to convey partiality in news reporting.

Abstract: Public debates about “left-” or “right-wing” news overlook the fact that bias is usually conveyed by concrete linguistic manoeuvres that transcend any single political spectrum. We therefore shift the focus from where an outlet allegedly stands to how partiality is expressed in individual sentences. Drawing on 26,464 sentences collected from newsroom corpora, user submissions and our own browsing, we iteratively combine close-reading, interdisciplinary theory and pilot annotation to derive a fine-grained, sentence-level taxonomy of media bias and propaganda. The result is a two-tier schema comprising 38 elementary bias types, arranged in six functional families and visualised as a “table of media-bias elements”. For each type we supply a definition, real-world examples, cognitive and societal drivers, and guidance for recognition. A quantitative survey of a random 155-sentence sample illustrates prevalence differences, while a cross-walk to the best-known NLP and communication-science taxonomies reveals substantial coverage gains and reduced ambiguity.

[4] Lost in Execution: On the Multilingual Robustness of Tool Calling in Large Language Models

Zheng Luo, T Pranav Kutralingam, Ogochukwu N Okoani, Wanpeng Xu, Hua Wei, Xiyang Hu

Main category: cs.CL

TL;DR: LLMs struggle with multilingual tool calling due to parameter value language mismatch, where models generate values in user’s language instead of language-invariant conventions.

Details

Motivation: While LLMs show strong tool-calling performance in English, their robustness in multilingual user interactions remains underexplored, creating a gap in understanding cross-lingual tool execution capabilities.

Method: Introduces MLCL diagnostic benchmark and conducts systematic evaluation across Chinese, Hindi, and Igbo. Performs fine-grained error analysis and evaluates several inference-time system strategies to address language-induced execution errors.

Result: Many failures occur despite correct intent understanding and tool selection. Parameter value language mismatch is identified as dominant failure mode. Inference-time strategies reduce language-induced errors but cannot fully recover English-level performance.

Conclusion: Multilingual tool calling remains challenging due to language mismatch issues, and current mitigation strategies are insufficient to achieve parity with English performance, highlighting need for better multilingual tool-calling capabilities.

Abstract: Large Language Models (LLMs) are increasingly deployed as agents that invoke external tools through structured function calls. While recent work reports strong tool-calling performance under standard English-centric evaluations, the robustness of tool calling under multilingual user interactions remains underexplored. In this work, we introduce MLCL, a diagnostic benchmark, and conduct a systematic evaluation of multilingual tool calling across Chinese, Hindi, and the low-resource language Igbo. Through fine-grained error analysis, we show that many failures occur despite correct intent understanding and tool selection. We identify parameter value language mismatch as a dominant failure mode, where models generate semantically appropriate parameter values in the user’s language, violating language-invariant execution conventions. We further evaluate several inference-time system strategies and find that while these strategies substantially reduce language-induced execution errors, none of them can fully recover English-level performance.

[5] Same Claim, Different Judgment: Benchmarking Scenario-Induced Bias in Multilingual Financial Misinformation Detection

Zhiwei Liu, Yupen Cao, Yuechen Jiang, Mohsinul Kabir, Polydoros Giannouris, Chen Xu, Ziyang Xu, Tianlei Zhu, Tariquzzaman Faisal, Triantafillos Papadopoulos, Yan Wang, Lingfei Qian, Xueqing Peng, Zhuohan Xie, Ye Yuan, Saeed Almheiri, Abdulrazzaq Alnajjar, Mingbin Chen, Harry Stuart, Paul Thompson, Prayag Tiwari, Alejandro Lopez-Lira, Xue Liu, Jimin Huang, Sophia Ananiadou

Main category: cs.CL

TL;DR: Researchers created MFMD-Scen, a benchmark to evaluate behavioral biases in LLMs for multilingual financial misinformation detection across complex real-world scenarios, finding significant biases persist in both commercial and open-source models.

Details

Motivation: LLMs trained on human-authored corpora may inherit human behavioral biases, which can cause instability in financial decision-making. Existing bias research focuses on simple settings, lacking consideration for complex financial environments and high-risk multilingual financial misinformation detection tasks.

Method: Developed MFMD-Scen benchmark with financial experts to create three complex scenario types: (1) role- and personality-based, (2) role- and region-based, (3) role-based with ethnicity/religious beliefs. Built multilingual financial misinformation dataset covering English, Chinese, Greek, and Bengali. Evaluated 22 mainstream LLMs using these integrated scenarios.

Result: Pronounced behavioral biases persist across both commercial and open-source LLMs when evaluated in complex financial misinformation detection scenarios. The benchmark reveals systematic bias patterns in multilingual financial contexts.

Conclusion: LLMs exhibit significant behavioral biases in financial misinformation detection across diverse economic scenarios, highlighting the need for bias-aware development and evaluation in financial applications. The MFMD-Scen benchmark provides a comprehensive tool for systematic bias assessment.

Abstract: Large language models (LLMs) have been widely applied across various domains of finance. Since their training data are largely derived from human-authored corpora, LLMs may inherit a range of human biases. Behavioral biases can lead to instability and uncertainty in decision-making, particularly when processing financial information. However, existing research on LLM bias has mainly focused on direct questioning or simplified, general-purpose settings, with limited consideration of the complex real-world financial environments and high-risk, context-sensitive, multilingual financial misinformation detection tasks (\mfmd). In this work, we propose \mfmdscen, a comprehensive benchmark for evaluating behavioral biases of LLMs in \mfmd across diverse economic scenarios. In collaboration with financial experts, we construct three types of complex financial scenarios: (i) role- and personality-based, (ii) role- and region-based, and (iii) role-based scenarios incorporating ethnicity and religious beliefs. We further develop a multilingual financial misinformation dataset covering English, Chinese, Greek, and Bengali. By integrating these scenarios with misinformation claims, \mfmdscen enables a systematic evaluation of 22 mainstream LLMs. Our findings reveal that pronounced behavioral biases persist across both commercial and open-source models. This project will be available at https://github.com/lzw108/FMD.

[6] Glitter: Visualizing Lexical Surprisal for Readability in Administrative Texts

Jan Černý, Ivana Kvapilíková, Silvie Cinková

Main category: cs.CL

TL;DR: A framework using language model entropy to estimate text readability, visualized to help improve bureaucratic text clarity.

Details

Motivation: To improve readability and clarity of administrative/bureaucratic texts by developing objective measures based on information theory.

Method: Proposes a visualization framework that approximates text information entropy using multiple language models, with the tool available as libre software.

Result: Developed Glitter toolset (available on GitHub) that can estimate readability through entropy visualization.

Conclusion: Information entropy from language models provides a useful approach for estimating and improving text readability, especially for bureaucratic documents.

Abstract: This work investigates how measuring information entropy of text can be used to estimate its readability. We propose a visualization framework that can be used to approximate information entropy of text using multiple language models and visualize the result. The end goal is to use this method to estimate and improve readability and clarity of administrative or bureaucratic texts. Our toolset is available as a libre software on https://github.com/ufal/Glitter.

[7] Large Language Models Are Bad Dice Players: LLMs Struggle to Generate Random Numbers from Statistical Distributions

Minda Zhao, Yilun Du, Mengyu Wang

Main category: cs.CL

TL;DR: Large-scale audit reveals frontier LLMs fail at probabilistic sampling, especially in independent requests, with sampling fidelity degrading with distribution complexity and sampling horizon, necessitating external tools for statistical guarantees.

Details

Motivation: As LLMs become integral to stochastic pipelines in domains like educational assessment and synthetic data construction, faithful sampling from specified probability distributions has become a functional requirement rather than theoretical curiosity.

Method: First large-scale statistically powered audit of native probabilistic sampling in 11 frontier LLMs across 15 distributions using dual-protocol design: Batch Generation (N=1000 samples in one response) and Independent Requests (N=1000 stateless calls).

Result: Sharp protocol asymmetry: batch generation achieves only 13% median pass rate, while independent requests collapse almost entirely (10 of 11 models pass none of distributions). Sampling fidelity degrades with distributional complexity and worsens as sampling horizon N increases. Failures propagate to downstream tasks like MCQ generation and attribute-constrained text-to-image prompt synthesis.

Conclusion: Current LLMs lack a functional internal sampler, necessitating use of external tools for applications requiring statistical guarantees in stochastic pipelines.

Abstract: As large language models (LLMs) transition from chat interfaces to integral components of stochastic pipelines across domains like educational assessment and synthetic data construction, the ability to faithfully sample from specified probability distributions has become a functional requirement rather than a theoretical curiosity. We present the first large-scale, statistically powered audit of native probabilistic sampling in frontier LLMs, benchmarking 11 models across 15 distributions. To disentangle failure modes, we employ a dual-protocol design: Batch Generation, where a model produces N=1000 samples within one response, and Independent Requests, comprising $N=1000$ stateless calls. We observe a sharp protocol asymmetry: batch generation achieves only modest statistical validity, with a 13% median pass rate, while independent requests collapse almost entirely, with 10 of 11 models passing none of the distributions. Beyond this asymmetry, we reveal that sampling fidelity degrades monotonically with distributional complexity and aggravates as the requested sampling horizon N increases. Finally, we demonstrate the propagation of these failures into downstream tasks: models fail to enforce uniform answer-position constraints in MCQ generation and systematically violate demographic targets in attribute-constrained text-to-image prompt synthesis. These findings indicate that current LLMs lack a functional internal sampler, necessitating the use of external tools for applications requiring statistical guarantees.

[8] Tracing Moral Foundations in Large Language Models

Chenxiao Yu, Bowen Yi, Farzan Karimi-Malekabadi, Suhaib Abdurahman, Jinyi Ye, Shrikanth Narayanan, Yue Zhao, Morteza Dehghani

Main category: cs.CL

TL;DR: LLMs show structured moral representations aligned with human judgments, with evidence of partially disentangled moral mechanisms emerging from language statistics alone.

Details

Motivation: To determine whether LLMs' moral judgments reflect genuine internal conceptual structure or just superficial "moral mimicry" by examining how moral foundations are encoded and organized.

Method: Multi-level approach using Moral Foundations Theory: (1) layer-wise analysis of concept representations, (2) pretrained sparse autoencoders to identify moral features, (3) causal steering interventions with dense vectors and sparse features.

Result: Both models represent moral foundations in structured, layer-dependent ways aligned with human judgments. SAE features show semantic links to specific foundations, and steering produces predictable behavioral shifts, demonstrating causal connections.

Conclusion: Moral concepts in LLMs are distributed, layered, and partly disentangled, suggesting pluralistic moral structure can emerge as latent patterns from language statistics alone, not just mimicry.

Abstract: Large language models (LLMs) often produce human-like moral judgments, but it is unclear whether this reflects an internal conceptual structure or superficial ``moral mimicry.’’ Using Moral Foundations Theory (MFT) as an analytic framework, we study how moral foundations are encoded, organized, and expressed within two instruction-tuned LLMs: Llama-3.1-8B-Instruct and Qwen2.5-7B-Instruct. We employ a multi-level approach combining (i) layer-wise analysis of MFT concept representations and their alignment with human moral perceptions, (ii) pretrained sparse autoencoders (SAEs) over the residual stream to identify sparse features that support moral concepts, and (iii) causal steering interventions using dense MFT vectors and sparse SAE features. We find that both models represent and distinguish moral foundations in a structured, layer-dependent way that aligns with human judgments. At a finer scale, SAE features show clear semantic links to specific foundations, suggesting partially disentangled mechanisms within shared representations. Finally, steering along either dense vectors or sparse features produces predictable shifts in foundation-relevant behavior, demonstrating a causal connection between internal representations and moral outputs. Together, our results provide mechanistic evidence that moral concepts in LLMs are distributed, layered, and partly disentangled, suggesting that pluralistic moral structure can emerge as a latent pattern from the statistical regularities of language alone.

[9] Do LLMs Need Inherent Reasoning Before Reinforcement Learning? A Study in Korean Self-Correction

Hongjin Kim, Jaewook Lee, Kiyoung Lee, Jong-hun Shin, Soojong Lim, Oh-Woog Kwon

Main category: cs.CL

TL;DR: RL alone doesn’t improve Korean reasoning in LLMs; aligning Korean-specific neurons in early layers through fine-tuning enables RL effectiveness, showing multilingual reasoning enhancement requires eliciting existing capabilities rather than injecting new knowledge.

Details

Motivation: LLMs show strong reasoning in high-resource languages like English but perform poorly in low-resource languages like Korean. The study investigates whether reinforcement learning (RL) can enhance Korean reasoning to match English performance.

Method: Tested RL on Korean reasoning, found limited improvements. Explored fine-tuning strategies focusing on aligning internal reasoning processes with Korean inputs by tuning Korean-specific neurons in early layers. Created a self-correction code-switching dataset to facilitate alignment.

Result: RL alone yields limited improvements without inherent Korean reasoning capabilities. Aligning Korean-specific neurons in early layers unlocks RL’s effectiveness, leading to significant performance gains in mathematical reasoning and self-correction tasks.

Conclusion: Multilingual reasoning enhancement depends on effectively eliciting and aligning existing reasoning capabilities rather than injecting new linguistic knowledge. Internal translation and neuron-level tuning are key to multilingual reasoning alignment in LLMs.

Abstract: Large Language Models (LLMs) demonstrate strong reasoning and self-correction abilities in high-resource languages like English, but their performance remains limited in low-resource languages such as Korean. In this study, we investigate whether reinforcement learning (RL) can enhance Korean reasoning abilities to a degree comparable to English. Our findings reveal that RL alone yields limited improvements when applied to models lacking inherent Korean reasoning capabilities. To address this, we explore several fine-tuning strategies and show that aligning the model’s internal reasoning processes with Korean inputs-particularly by tuning Korean-specific neurons in early layers-is key to unlocking RL’s effectiveness. We introduce a self-correction code-switching dataset to facilitate this alignment and observe significant performance gains in both mathematical reasoning and self-correction tasks. Ultimately, we conclude that the crucial factor in multilingual reasoning enhancement is not injecting new linguistic knowledge, but effectively eliciting and aligning existing reasoning capabilities. Our study provides a new perspective on how internal translation and neuron-level tuning contribute to multilingual reasoning alignment in LLMs.

[10] Towards Valid Student Simulation with Large Language Models

Zhihao Yuan, Yunze Xiao, Ming Li, Weihao Xuan, Richard Tong, Mona Diab, Tom Mitchell

Main category: cs.CL

TL;DR: The paper presents a framework for LLM-based student simulation that addresses the “competence paradox” where capable LLMs unrealistically emulate partially knowledgeable learners, proposing Epistemic State Specification and Goal-by-Environment frameworks for constrained generation.

Details

Motivation: Current LLM-based student simulations suffer from the "competence paradox" - broadly capable language models asked to emulate partially knowledgeable learners produce unrealistic error patterns and learning dynamics, limiting their usefulness as scientific and pedagogical instruments.

Method: The paper reframes student simulation as a constrained generation problem using Epistemic State Specification (ESS) to define what simulated learners can access, how errors are structured, and how learner state evolves. It also introduces a Goal-by-Environment framework to situate systems according to behavioral objectives and deployment contexts.

Result: The work synthesizes prior literature, formalizes key design dimensions for LLM-based student simulation, and articulates open challenges related to validity, evaluation, and ethical risks rather than proposing new systems or benchmarks.

Conclusion: The paper argues that epistemic fidelity (faithful representation of learner knowledge states) should be prioritized over surface realism as a prerequisite for using LLM-based simulated students as reliable scientific and pedagogical instruments.

Abstract: This paper presents a conceptual and methodological framework for large language model (LLM) based student simulation in educational settings. The authors identify a core failure mode, termed the “competence paradox” in which broadly capable LLMs are asked to emulate partially knowledgeable learners, leading to unrealistic error patterns and learning dynamics. To address this, the paper reframes student simulation as a constrained generation problem governed by an explicit Epistemic State Specification (ESS), which defines what a simulated learner can access, how errors are structured, and how learner state evolves over time. The work further introduces a Goal-by-Environment framework to situate simulated student systems according to behavioral objectives and deployment contexts. Rather than proposing a new system or benchmark, the paper synthesizes prior literature, formalizes key design dimensions, and articulates open challenges related to validity, evaluation, and ethical risks. Overall, the paper argues for epistemic fidelity over surface realism as a prerequisite for using LLM-based simulated students as reliable scientific and pedagogical instruments.

[11] The Facade of Truth: Uncovering and Mitigating LLM Susceptibility to Deceptive Evidence

Herun Wan, Jiaying Wu, Minnan Luo, Fanxiao Li, Zhi Zeng, Min-Yen Kan

Main category: cs.CL

TL;DR: LLMs are vulnerable to sophisticated misleading evidence despite resisting direct misinformation; MisBelief framework generates deceptive claims through multi-role interactions; models show 93% increased belief in falsehoods; Deceptive Intent Shielding mitigates this vulnerability.

Details

Motivation: LLMs need to maintain factual beliefs against misleading information to reliably assist human decision-making. While current models resist explicit misinformation, there's a fundamental vulnerability to sophisticated, hard-to-falsify evidence that needs to be systematically studied.

Method: Introduces MisBelief framework that generates misleading evidence via collaborative, multi-round interactions among multi-role LLMs, mimicking subtle defeasible reasoning and progressive refinement. Evaluates 7 representative LLMs on 4,800 instances across three difficulty levels. Proposes Deceptive Intent Shielding (DIS) as a governance mechanism to infer deceptive intent.

Result: Models are robust to direct misinformation but highly sensitive to refined misleading evidence: belief scores in falsehoods increase by average of 93.0%, fundamentally compromising downstream recommendations. DIS consistently mitigates belief shifts and promotes more cautious evidence evaluation.

Conclusion: LLMs have a critical vulnerability to sophisticated misleading evidence despite resisting direct misinformation. The MisBelief framework effectively exposes this weakness, and Deceptive Intent Shielding provides a promising governance mechanism to protect models from deceptive reasoning patterns.

Abstract: To reliably assist human decision-making, LLMs must maintain factual internal beliefs against misleading injections. While current models resist explicit misinformation, we uncover a fundamental vulnerability to sophisticated, hard-to-falsify evidence. To systematically probe this weakness, we introduce MisBelief, a framework that generates misleading evidence via collaborative, multi-round interactions among multi-role LLMs. This process mimics subtle, defeasible reasoning and progressive refinement to create logically persuasive yet factually deceptive claims. Using MisBelief, we generate 4,800 instances across three difficulty levels to evaluate 7 representative LLMs. Results indicate that while models are robust to direct misinformation, they are highly sensitive to this refined evidence: belief scores in falsehoods increase by an average of 93.0%, fundamentally compromising downstream recommendations. To address this, we propose Deceptive Intent Shielding (DIS), a governance mechanism that provides an early warning signal by inferring the deceptive intent behind evidence. Empirical results demonstrate that DIS consistently mitigates belief shifts and promotes more cautious evidence evaluation.

[12] Illusions of Confidence? Diagnosing LLM Truthfulness via Neighborhood Consistency

Haoming Xu, Ningyuan Zhao, Yunzhi Yao, Weihong Xu, Hongru Wang, Xinle Deng, Shumin Deng, Jeff Z. Pan, Huajun Chen, Ningyu Zhang

Main category: cs.CL

TL;DR: The paper proposes Neighbor-Consistency Belief (NCB) to measure belief robustness in LLMs, showing that even facts with perfect self-consistency can collapse under mild contextual interference, and introduces Structure-Aware Training to reduce knowledge brittleness.

Details

Motivation: Current LLM evaluations rely too much on point-wise confidence metrics like Self-Consistency, which can mask brittle beliefs. Real-world deployment requires models to maintain truthful beliefs under contextual perturbations, but existing methods don't adequately test belief robustness.

Method: Proposes Neighbor-Consistency Belief (NCB) as a structural measure of belief robustness that evaluates response coherence across a conceptual neighborhood. Introduces a cognitive stress-testing protocol to probe output stability under contextual interference. Presents Structure-Aware Training (SAT) to optimize context-invariant belief structure.

Result: Experiments show that even facts answered with perfect self-consistency can rapidly collapse under mild contextual interference. High-NCB data demonstrates greater resistance to interference. Structure-Aware Training reduces long-tail knowledge brittleness by approximately 30%.

Conclusion: NCB provides a better measure of belief robustness than point-wise confidence metrics. Structural approaches to belief evaluation and training can significantly improve LLM reliability in real-world deployments where contextual perturbations are common.

Abstract: As Large Language Models (LLMs) are increasingly deployed in real-world settings, correctness alone is insufficient. Reliable deployment requires maintaining truthful beliefs under contextual perturbations. Existing evaluations largely rely on point-wise confidence like Self-Consistency, which can mask brittle belief. We show that even facts answered with perfect self-consistency can rapidly collapse under mild contextual interference. To address this gap, we propose Neighbor-Consistency Belief (NCB), a structural measure of belief robustness that evaluates response coherence across a conceptual neighborhood. To validate the efficiency of NCB, we introduce a new cognitive stress-testing protocol that probes outputs stability under contextual interference. Experiments across multiple LLMs show that the performance of high-NCB data is relatively more resistant to interference. Finally, we present Structure-Aware Training (SAT), which optimizes context-invariant belief structure and reduces long-tail knowledge brittleness by approximately 30%. Code will be available at https://github.com/zjunlp/belief.

[13] MemBuilder: Reinforcing LLMs for Long-Term Memory Construction via Attributed Dense Rewards

Zhiyu Shen, Ziming Wu, Fuming Lai, Shaobing Lian, Yanghui Rao

Main category: cs.CL

TL;DR: MemBuilder is a reinforcement learning framework that trains LLMs to build multi-dimensional memory for long-term dialogues using dense rewards and contribution-aware gradient weighting, enabling a 4B model to outperform closed-source baselines.

Details

Motivation: Standard retrieval mechanisms fail to capture temporal evolution in long-term dialogues, and current memory-augmented systems either rely on static prompting of closed-source models or suffer from ineffective training with sparse rewards.

Method: MemBuilder uses reinforcement learning with synthetic session-level question generation for dense intermediate rewards, and contribution-aware gradient weighting that scales policy updates based on each memory component’s downstream impact.

Result: MemBuilder enables a 4B-parameter model to outperform state-of-the-art closed-source baselines and shows strong generalization across long-term dialogue benchmarks.

Conclusion: The framework successfully addresses sparse reward and multi-dimensional memory attribution challenges, demonstrating that properly trained smaller models can outperform larger closed-source systems in long-term dialogue consistency.

Abstract: Maintaining consistency in long-term dialogues remains a fundamental challenge for LLMs, as standard retrieval mechanisms often fail to capture the temporal evolution of historical states. While memory-augmented frameworks offer a structured alternative, current systems rely on static prompting of closed-source models or suffer from ineffective training paradigms with sparse rewards. We introduce MemBuilder, a reinforcement learning framework that trains models to orchestrate multi-dimensional memory construction with attributed dense rewards. MemBuilder addresses two key challenges: (1) Sparse Trajectory-Level Rewards: we employ synthetic session-level question generation to provide dense intermediate rewards across extended trajectories; and (2) Multi-Dimensional Memory Attribution: we introduce contribution-aware gradient weighting that scales policy updates based on each component’s downstream impact. Experimental results show that MemBuilder enables a 4B-parameter model to outperform state-of-the-art closed-source baselines, exhibiting strong generalization across long-term dialogue benchmarks.

[14] Can We Predict Before Executing Machine Learning Agents?

Jingsheng Zheng, Jintian Zhang, Yujie Luo, Yuren Mao, Yunjun Gao, Lun Du, Huajun Chen, Ningyu Zhang

Main category: cs.CL

TL;DR: LLM-based agents can predict solution quality before execution using verified data analysis reports, achieving 61.5% accuracy and enabling 6x faster convergence through a Predict-then-Verify approach.

Details

Motivation: Current autonomous ML agents are limited by the Generate-Execute-Feedback paradigm, suffering from an Execution Bottleneck where hypothesis evaluation requires expensive physical execution. The paper aims to bypass these constraints by internalizing execution priors for predictive reasoning.

Method: Formalizes Data-centric Solution Preference task and constructs a corpus of 18,438 pairwise comparisons. Uses LLMs primed with Verified Data Analysis Reports for predictive capabilities. Implements FOREAGENT with a Predict-then-Verify loop that substitutes runtime checks with instantaneous predictive reasoning.

Result: LLMs achieve 61.5% accuracy in predicting solution preferences with robust confidence calibration. FOREAGENT achieves 6x acceleration in convergence while surpassing execution-based baselines by +6%.

Conclusion: The Predict-then-Verify framework successfully bypasses physical execution constraints by internalizing execution priors, enabling faster and more efficient autonomous scientific discovery while maintaining or improving performance over traditional execution-based approaches.

Abstract: Autonomous machine learning agents have revolutionized scientific discovery, yet they remain constrained by a Generate-Execute-Feedback paradigm. Previous approaches suffer from a severe Execution Bottleneck, as hypothesis evaluation relies strictly on expensive physical execution. To bypass these physical constraints, we internalize execution priors to substitute costly runtime checks with instantaneous predictive reasoning, drawing inspiration from World Models. In this work, we formalize the task of Data-centric Solution Preference and construct a comprehensive corpus of 18,438 pairwise comparisons. We demonstrate that LLMs exhibit significant predictive capabilities when primed with a Verified Data Analysis Report, achieving 61.5% accuracy and robust confidence calibration. Finally, we instantiate this framework in FOREAGENT, an agent that employs a Predict-then-Verify loop, achieving a 6x acceleration in convergence while surpassing execution-based baselines by +6%. Our code and dataset will be publicly available soon at https://github.com/zjunlp/predict-before-execute.

[15] FlashMem: Distilling Intrinsic Latent Memory via Computation Reuse

Yubo Hou, Zhisheng Chen, Tao Wan, Zengchang Qin

Main category: cs.CL

TL;DR: FlashMem is a framework that enables efficient long-term memory for LLMs by reusing computation from hidden states instead of using separate memory modules, achieving 5x faster inference while maintaining performance.

Details

Motivation: LLMs lack dynamic context preservation, forcing agents to repeatedly reprocess history for long-term autonomy. Current memory approaches use separate encoders that decouple memory from reasoning, creating inefficiency.

Method: Uses computation reuse to distill intrinsic memory from transient reasoning states. Identifies last hidden state as sufficient statistic for interaction history. Uses Shared-KV Consolidator to synthesize memory by attending directly to frozen cache, eliminating redundant parameters. Includes parameter-free Cognitive Monitor using attention entropy to trigger consolidation only during high epistemic uncertainty.

Result: Matches performance of heavy baselines while reducing inference latency by 5 times, effectively bridging efficiency and persistent cognition.

Conclusion: FlashMem successfully enables efficient long-term memory for LLMs by leveraging intrinsic computation reuse, eliminating the need for separate memory modules while maintaining performance and significantly improving inference speed.

Abstract: The stateless architecture of Large Language Models inherently lacks the mechanism to preserve dynamic context, compelling agents to redundantly reprocess history to maintain long-horizon autonomy. While latent memory offers a solution, current approaches are hindered by architectural segregation, relying on auxiliary encoders that decouple memory from the reasoning backbone. We propose FlashMem, a framework that distills intrinsic memory directly from transient reasoning states via computation reuse. Leveraging the property that internal representations uniquely encode input trajectories, FlashMem identifies the last hidden state as a sufficient statistic for the interaction history. This enables a Shared-KV Consolidator to synthesize memory by attending directly to the backbone’s frozen cache, eliminating redundant re-parameterization. Furthermore, a parameter-free Cognitive Monitor leverages attention entropy to adaptively trigger consolidation only when high epistemic uncertainty is detected. Experiments demonstrate that FlashMem matches the performance of heavy baselines while reducing inference latency by 5 times, effectively bridging the gap between efficiency and persistent cognition.

[16] CHisAgent: A Multi-Agent Framework for Event Taxonomy Construction in Ancient Chinese Cultural Systems

Xuemei Tang, Chengxi Yan, Jinghang Gu, Chu-Ren Huang

Main category: cs.CL

TL;DR: CHisAgent is a multi-agent LLM framework that automatically constructs historical taxonomies for ancient Chinese contexts by combining bottom-up induction, top-down expansion, and evidence-guided enrichment.

Details

Motivation: LLMs have limited historical and cultural reasoning capabilities, especially in non-English contexts like Chinese history. Manual taxonomy construction is costly and doesn't scale, but taxonomies are effective for organizing historical knowledge.

Method: Three-stage multi-agent framework: 1) Inducer (bottom-up) derives initial hierarchy from raw historical corpora, 2) Expander (top-down) introduces missing intermediate concepts using LLM world knowledge, 3) Enricher (evidence-guided) integrates external structured historical resources for faithfulness.

Result: Constructed a large-scale, domain-aware event taxonomy covering politics, military, diplomacy, and social life in ancient China using the Twenty-Four Histories. Evaluations show improved structural coherence and coverage, and the taxonomy supports cross-cultural alignment.

Conclusion: CHisAgent successfully addresses the scalability challenge of historical taxonomy construction while improving knowledge organization for ancient Chinese contexts, demonstrating practical utility for cross-cultural historical analysis.

Abstract: Despite strong performance on many tasks, large language models (LLMs) show limited ability in historical and cultural reasoning, particularly in non-English contexts such as Chinese history. Taxonomic structures offer an effective mechanism to organize historical knowledge and improve understanding. However, manual taxonomy construction is costly and difficult to scale. Therefore, we propose \textbf{CHisAgent}, a multi-agent LLM framework for historical taxonomy construction in ancient Chinese contexts. CHisAgent decomposes taxonomy construction into three role-specialized stages: a bottom-up \textit{Inducer} that derives an initial hierarchy from raw historical corpora, a top-down \textit{Expander} that introduces missing intermediate concepts using LLM world knowledge, and an evidence-guided \textit{Enricher} that integrates external structured historical resources to ensure faithfulness. Using the \textit{Twenty-Four Histories}, we construct a large-scale, domain-aware event taxonomy covering politics, military, diplomacy, and social life in ancient China. Extensive reference-free and reference-based evaluations demonstrate improved structural coherence and coverage, while further analysis shows that the resulting taxonomy supports cross-cultural alignment.

[17] Double: Breaking the Acceleration Limit via Double Retrieval Speculative Parallelism

Yuhao Shen, Tianyu Liu, Junyi Shen, Jinyang Wu, Quan Kong, Li Huan, Cong Wang

Main category: cs.CL

TL;DR: Double Retrieval Speculative Parallelism (Double) breaks speedup limits of parallel speculative decoding by enabling iterative draft retrieval and authoritative target guidance, achieving 5.3× speedup on LLaMA3.3-70B without training.

Details

Motivation: Parallel Speculative Decoding has two fundamental limitations: (1) theoretical speedup ceiling limited by draft/target model speed ratio, and (2) computational waste from mid-sequence token rejections causing pipeline stalls.

Method: Double bridges SD and PSD with synchronous retrieval mechanism. Draft model performs iterative retrieval speculations to break speedup limits, while target model performs authoritative retrieval to generate multi-token guidance, eliminating rollbacks from rejections.

Result: Achieves state-of-the-art speedup of 5.3× on LLaMA3.3-70B and 2.8× on Qwen3-32B, significantly outperforming training-based method EAGLE-3. The approach is training-free and lossless.

Conclusion: Double successfully addresses the Retrieval Precision-Efficiency Dilemma through novel synchronous retrieval mechanism, enabling dramatic speedup without model training while maintaining lossless generation.

Abstract: Parallel Speculative Decoding (PSD) accelerates traditional Speculative Decoding (SD) by overlapping draft generation with verification. However, it remains hampered by two fundamental challenges: (1) a theoretical speedup ceiling dictated by the speed ratio between the draft and target models, and (2) high computational waste and pipeline stall due to mid-sequence token rejections of early errors. To address these limitations, we introduce \textsc{Double} (Double Retrieval Speculative Parallelism). By bridging the gap between SD and PSD, our framework resolves the Retrieval \emph{Precision-Efficiency Dilemma} through a novel synchronous mechanism. Specifically, we enable the draft model to execute iterative retrieval speculations to break the theoretical speedup limits; to alleviate rejections without rollback, the target model performs authoritative retrieval to generate multi-token guidance. \textsc{Double} is entirely training-free and lossless. Extensive experiments demonstrate state-of-the-art speedup of $\textbf{5.3}\times$ on LLaMA3.3-70B and $\textbf{2.8}\times$ on Qwen3-32B, significantly outperforming the advanced method EAGLE-3 that requires extensive model training.

[18] Can Large Language Models Differentiate Harmful from Argumentative Essays? Steps Toward Ethical Essay Scoring

Hongjin Kim, Jeonghyun Kang, Harksoo Kim

Main category: cs.CL

TL;DR: LLMs and AES systems fail to properly identify and score harmful essays containing racism, gender bias, and other problematic content, often giving them high scores despite ethical issues.

Details

Motivation: Current AES systems and LLMs overlook ethically problematic elements in essays, erroneously assigning high scores to harmful content that propagates racism, gender bias, and other sensitive issues.

Method: Introduced the Harmful Essay Detection (HED) benchmark containing essays with sensitive topics like racism and gender bias to test LLMs’ ability to recognize and score harmful content.

Result: 1) LLMs need improvement to distinguish harmful from argumentative essays; 2) Both AES models and LLMs fail to consider ethical dimensions during scoring.

Conclusion: There’s a critical need to develop more robust AES systems that are sensitive to ethical implications of content, addressing the gap in current automated scoring technologies.

Abstract: This study addresses critical gaps in Automated Essay Scoring (AES) systems and Large Language Models (LLMs) with regard to their ability to effectively identify and score harmful essays. Despite advancements in AES technology, current models often overlook ethically and morally problematic elements within essays, erroneously assigning high scores to essays that may propagate harmful opinions. In this study, we introduce the Harmful Essay Detection (HED) benchmark, which includes essays integrating sensitive topics such as racism and gender bias, to test the efficacy of various LLMs in recognizing and scoring harmful content. Our findings reveal that: (1) LLMs require further enhancement to accurately distinguish between harmful and argumentative essays, and (2) both current AES models and LLMs fail to consider the ethical dimensions of content during scoring. The study underscores the need for developing more robust AES systems that are sensitive to the ethical implications of the content they are scoring.

[19] Generation-Based and Emotion-Reflected Memory Update: Creating the KEEM Dataset for Better Long-Term Conversation

Jeonghyun Kang, Hongjin Kim, Harksoo Kim

Main category: cs.CL

TL;DR: KEEM dataset enables dynamic generation of integrative memories for conversational AI, preserving both factual information and emotional context to improve long-term conversation quality.

Details

Motivation: Existing memory update approaches in conversational systems use simple accumulation or operation-based methods that cause information conflicts and struggle to track users' current states accurately, limiting the system's ability to maintain meaningful long-term conversations.

Method: Introduces the Keep Emotional and Essential Memory (KEEM) dataset, a generation-based approach that dynamically creates integrative memories by preserving essential factual information while incorporating emotional context and causal relationships from user interactions.

Result: The KEEM dataset enables more nuanced understanding of user interactions by seamlessly updating system memory with both emotional and essential data, addressing limitations of previous memory update methods.

Conclusion: KEEM’s integrative memory generation approach promotes deeper empathy and enhances conversational systems’ ability to respond meaningfully in open-domain conversations by maintaining comprehensive emotional and factual context over time.

Abstract: In this work, we introduce the Keep Emotional and Essential Memory (KEEM) dataset, a novel generation-based dataset designed to enhance memory updates in long-term conversational systems. Unlike existing approaches that rely on simple accumulation or operation-based methods, which often result in information conflicts and difficulties in accurately tracking a user’s current state, KEEM dynamically generates integrative memories. This process not only preserves essential factual information but also incorporates emotional context and causal relationships, enabling a more nuanced understanding of user interactions. By seamlessly updating a system’s memory with both emotional and essential data, our approach promotes deeper empathy and enhances the system’s ability to respond meaningfully in open-domain conversations.

[20] ReasonAny: Incorporating Reasoning Capability to Any Model via Simple and Effective Model Merging

Junyao Yang, Chen Qian, Dongrui Liu, Wen Shen, Yong Liu, Jing Shao

Main category: cs.CL

TL;DR: ReasonAny is a novel model merging framework that enables “Reasoning + X” capabilities by addressing performance collapse through contrastive gradient identification, allowing domain-specialized models to acquire reasoning abilities without compromising their core functions.

Details

Motivation: While Large Reasoning Models (LRMs) excel at chain-of-thought reasoning, equipping domain-specialized models with such reasoning capabilities ("Reasoning + X") remains challenging. Existing model merging methods suffer from destructive performance collapse that weakens both reasoning depth and domain-specific utility.

Method: ReasonAny uses Contrastive Gradient Identification based on the insight that reasoning ability resides in parameter regions with low gradient sensitivity (contrary to assumptions that domain capabilities correspond to high-magnitude parameters). This framework identifies and preserves reasoning capabilities during model merging.

Result: Experiments across safety, biomedicine, and finance domains show that ReasonAny effectively synthesizes “Reasoning + X” capabilities, significantly outperforming state-of-the-art baselines while retaining robust reasoning performance.

Conclusion: ReasonAny provides a training-free solution to the “Reasoning + X” challenge by addressing the fundamental issue of reasoning-domain performance collapse through novel gradient-based analysis, enabling domain-specialized models to acquire reasoning capabilities without compromising their core functions.

Abstract: Large Reasoning Models (LRMs) with long chain-of-thought reasoning have recently achieved remarkable success. Yet, equipping domain-specialized models with such reasoning capabilities, referred to as “Reasoning + X”, remains a significant challenge. While model merging offers a promising training-free solution, existing methods often suffer from a destructive performance collapse: existing methods tend to both weaken reasoning depth and compromise domain-specific utility. Interestingly, we identify a counter-intuitive phenomenon underlying this failure: reasoning ability predominantly resides in parameter regions with low gradient sensitivity, contrary to the common assumption that domain capabilities correspond to high-magnitude parameters. Motivated by this insight, we propose ReasonAny, a novel merging framework that resolves the reasoning-domain performance collapse through Contrastive Gradient Identification. Experiments across safety, biomedicine, and finance domains show that ReasonAny effectively synthesizes “Reasoning + X” capabilities, significantly outperforming state-of-the-art baselines while retaining robust reasoning performance.

[21] Can large language models interpret unstructured chat data on dynamic group decision-making processes? Evidence on joint destination choice

Sung-Yoo Lim, Koki Sato, Kiyoshi Takami, Giancarlos Parady, Eui-Jin Kim

Main category: cs.CL

TL;DR: LLMs can automate extraction of explicit decision-making factors from group chat data but struggle with nuanced implicit factors, requiring human oversight for complex social activity analysis.

Details

Motivation: Traditional travel surveys struggle to capture complex joint activity-travel decisions in social groups. While unstructured chat data offers insights, manual annotation is labor-intensive and requires understanding of context-dependent meanings shaped by social and cultural norms.

Method: Developed a prompting framework inspired by knowledge acquisition process to guide LLMs in extracting decision-making factors from group chats. Sequentially extracts: group-level restaurant choice set and outcome, individual preferences for each alternative, and specific attributes driving preferences. Evaluated using human-annotated ground truth dataset with quantitative analysis and qualitative error analysis.

Result: LLMs reliably capture explicit decision-making factors but struggle to identify nuanced implicit factors that human annotators readily identify. The study identifies specific contexts where LLM-based extraction can be trusted versus where human oversight remains essential.

Conclusion: LLMs show potential for automating analysis of non-traditional data sources on social activities but have limitations in capturing implicit factors. Human oversight remains essential for nuanced decision-making analysis, highlighting both the promise and constraints of LLM-based approaches.

Abstract: Social activities result from complex joint activity-travel decisions between group members. While observing the decision-making process of these activities is difficult via traditional travel surveys, the advent of new types of data, such as unstructured chat data, can help shed some light on these complex processes. However, interpreting these decision-making processes requires inferring both explicit and implicit factors. This typically involves the labor-intensive task of manually annotating dialogues to capture context-dependent meanings shaped by the social and cultural norms. This study evaluates the potential of Large Language Models (LLMs) to automate and complement human annotation in interpreting decision-making processes from group chats, using data on joint eating-out activities in Japan as a case study. We designed a prompting framework inspired by the knowledge acquisition process, which sequentially extracts key decision-making factors, including the group-level restaurant choice set and outcome, individual preferences of each alternative, and the specific attributes driving those preferences. This structured process guides the LLM to interpret group chat data, converting unstructured dialogues into structured tabular data describing decision-making factors. To evaluate LLM-driven outputs, we conduct a quantitative analysis using a human-annotated ground truth dataset and a qualitative error analysis to examine model limitations. Results show that while the LLM reliably captures explicit decision-making factors, it struggles to identify nuanced implicit factors that human annotators readily identified. We pinpoint specific contexts when LLM-based extraction can be trusted versus when human oversight remains essential. These findings highlight both the potential and limitations of LLM-based analysis for incorporating non-traditional data sources on social activities.

[22] ACR: Adaptive Context Refactoring via Context Refactoring Operators for Multi-Turn Dialogue

Jiawei Shen, Jia Zhu, Hanghui Guo, Weijie Shi, Yue Cui, Qingyu Niu, Guoqing Ma, Yidan Liang, Jingjiang Liu, Yiling Wang, Shimin Di, Jiajie Xu

Main category: cs.CL

TL;DR: ACR Framework dynamically refactors dialogue context to address contextual inertia and state drift in LLMs, improving multi-turn dialogue performance while reducing token usage.

Details

Motivation: LLMs struggle with maintaining alignment across long multi-turn dialogues, facing issues like contextual inertia (stuck in outdated context) and state drift (drifting from established facts). Existing approaches (extended context windows, external memory, compression) fail to adequately address these core problems.

Method: Proposes Adaptive Context Refactoring (ACR) Framework with: 1) Library of context refactoring operators to dynamically reshape interaction history, 2) Teacher-guided self-evolving training paradigm that learns when to intervene and how to refactor, decoupling context management from reasoning.

Result: Extensive experiments show ACR significantly outperforms existing baselines in multi-turn dialogue while reducing token consumption.

Conclusion: ACR effectively addresses contextual inertia and state drift through dynamic context monitoring and refactoring, offering a promising approach for improving LLM performance in long multi-turn dialogues.

Abstract: Large Language Models (LLMs) have shown remarkable performance in multi-turn dialogue. However, in multi-turn dialogue, models still struggle to stay aligned with what has been established earlier, follow dependencies across many turns, and avoid drifting into incorrect facts as the interaction grows longer. Existing approaches primarily focus on extending the context window, introducing external memory, or applying context compression, yet these methods still face limitations such as \textbf{contextual inertia} and \textbf{state drift}. To address these challenges, we propose the \textbf{A}daptive \textbf{C}ontext \textbf{R}efactoring \textbf{(ACR)} Framework, which dynamically monitors and reshapes the interaction history to mitigate contextual inertia and state drift actively. ACR is built on a library of context refactoring operators and a teacher-guided self-evolving training paradigm that learns when to intervene and how to refactor, thereby decoupling context management from the reasoning process. Extensive experiments on multi-turn dialogue demonstrate that our method significantly outperforms existing baselines while reducing token consumption.

[23] Data Augmented Pipeline for Legal Information Extraction and Reasoning

Nguyen Minh Phuong, Ha-Thanh Nguyen, May Myo Zin, Ken Satoh

Main category: cs.CL

TL;DR: LLM-based pipeline for legal domain data augmentation in Information Extraction tasks, reducing manual annotation effort and improving system robustness.

Details

Motivation: To address the high manual effort required for data annotation in legal domain Information Extraction tasks and improve system robustness through data augmentation.

Method: Proposes a pipeline leveraging Large Language Models (LLMs) for data augmentation in Information Extraction tasks within the legal domain.

Result: The method is both simple and effective, significantly reducing manual annotation effort while enhancing the robustness of Information Extraction systems.

Conclusion: The proposed LLM-based data augmentation method is generalizable and applicable to various NLP tasks beyond the legal domain.

Abstract: In this paper, we propose a pipeline leveraging Large Language Models (LLMs) for data augmentation in Information Extraction tasks within the legal domain. The proposed method is both simple and effective, significantly reducing the manual effort required for data annotation while enhancing the robustness of Information Extraction systems. Furthermore, the method is generalizable, making it applicable to various Natural Language Processing (NLP) tasks beyond the legal domain.

[24] Text Detoxification in isiXhosa and Yorùbá: A Cross-Lingual Machine Learning Approach for Low-Resource African Languages

Abayomi O. Agbeyangi

Main category: cs.CL

TL;DR: This paper presents a hybrid approach for automatic text detoxification in low-resource African languages (isiXhosa and Yorùbá), combining interpretable toxicity detection with controlled rewriting, achieving effective detoxification while preserving non-toxic content.

Details

Motivation: Toxic language is a major barrier to safe online participation, but robust mitigation tools are scarce for African languages. The study addresses this critical gap by focusing on automatic text detoxification for low-resource African languages.

Method: A novel hybrid methodology: 1) lightweight TF-IDF and Logistic Regression model for transparent toxicity detection, and 2) controlled lexicon- and token-guided rewriting component. A parallel corpus of toxic to neutral rewrites was developed to train and evaluate the model.

Result: Detection component achieved stratified K-fold accuracies of 61-72% (isiXhosa) and 72-86% (Yorùbá), with per-language ROC-AUCs up to 0.88. The rewriting component successfully detoxified all detected toxic sentences while preserving 100% of non-toxic sentences.

Conclusion: Scalable, interpretable machine learning detectors combined with rule-based edits offer a competitive and resource-efficient solution for culturally adaptive safety tooling, setting a new benchmark for low-resource Text Style Transfer in African languages.

Abstract: Toxic language is one of the major barrier to safe online participation, yet robust mitigation tools are scarce for African languages. This study addresses this critical gap by investigating automatic text detoxification (toxic to neutral rewriting) for two low-resource African languages, isiXhosa and Yorùbá. The work contributes a novel, pragmatic hybrid methodology: a lightweight, interpretable TF-IDF and Logistic Regression model for transparent toxicity detection, and a controlled lexicon- and token-guided rewriting component. A parallel corpus of toxic to neutral rewrites, which captures idiomatic usage, diacritics, and code switching, was developed to train and evaluate the model. The detection component achieved stratified K-fold accuracies of 61-72% (isiXhosa) and 72-86% (Yorùbá), with per-language ROC-AUCs up to 0.88. The rewriting component successfully detoxified all detected toxic sentences while preserving 100% of non-toxic sentences. These results demonstrate that scalable, interpretable machine learning detectors combined with rule-based edits offer a competitive and resource-efficient solution for culturally adaptive safety tooling, setting a new benchmark for low-resource Text Style Transfer (TST) in African languages.

[25] GIFT: Games as Informal Training for Generalizable LLMs

Nuoyan Lyu, Bingbing Xu, Weihao Meng, Yige Yuan, Yang Zhang, Zhiyong Huang, Tat-Seng Chua, Huawei Shen

Main category: cs.CL

TL;DR: LLMs lack practical wisdom and generalizable intelligence despite formal learning success. The paper proposes using games as informal learning environments with a Nested Training Framework to prevent task interference and enhance generalization.

Details

Motivation: LLMs excel at formal tasks like math and coding but struggle with practical wisdom, strategic creativity, and social reasoning - human-like cognitive abilities. This gap exists because LLMs lack informal learning, which relies on interactive feedback rather than goal-oriented instruction.

Method: Proposes using Games as primary environments for LLM informal learning, leveraging their intrinsic rewards and abstracted complexity. Introduces Nested Training Framework with sequential task composition (explicit “AND” objective) instead of naive task mixing (implicit “OR” objective). Uses GRPO-based reinforcement learning across Matrix Games, TicTacToe, and Who’s the Spy games.

Result: Game-based informal learning prevents task interference and significantly enhances model generalization across broad ability-oriented benchmarks. The framework successfully cultivates diverse competencies through interactive feedback.

Conclusion: Games provide effective environments for LLM informal learning, and the Nested Training Framework enables simultaneous mastery of multiple abilities, bridging the gap between formal learning success and practical wisdom needed for human-like generalizable intelligence.

Abstract: While Large Language Models (LLMs) have achieved remarkable success in formal learning tasks such as mathematics and code generation, they still struggle with the “practical wisdom” and generalizable intelligence, such as strategic creativity and social reasoning, that characterize human cognition. This gap arises from a lack of informal learning, which thrives on interactive feedback rather than goal-oriented instruction. In this paper, we propose treating Games as a primary environment for LLM informal learning, leveraging their intrinsic reward signals and abstracted complexity to cultivate diverse competencies. To address the performance degradation observed in multi-task learning, we introduce a Nested Training Framework. Unlike naive task mixing optimizing an implicit “OR” objective, our framework employs sequential task composition to enforce an explicit “AND” objective, compelling the model to master multiple abilities simultaneously to achieve maximal rewards. Using GRPO-based reinforcement learning across Matrix Games, TicTacToe, and Who’s the Spy games, we demonstrate that integrating game-based informal learning not only prevents task interference but also significantly bolsters the model’s generalization across broad ability-oriented benchmarks. The framework and implementation are publicly available.

[26] Multilingual Amnesia: On the Transferability of Unlearning in Multilingual LLMs

Alireza Dehghanpour Farashah, Aditi Khandelwal, Marylou Fauchard, Zhuan Shi, Negar Rostamzadeh, Golnoosh Farnadi

Main category: cs.CL

TL;DR: The paper studies multilingual unlearning in LLMs, extending benchmarks to 10 languages and finding that unlearning is more stable in high-resource languages with syntactic similarity being the strongest predictor of cross-lingual transfer.

Details

Motivation: Multilingual LLMs present unique safety and fairness challenges across diverse linguistic contexts, but existing unlearning research focuses mainly on monolingual (English) settings, ignoring complexities of cross-lingual knowledge transfer and biases in multilingual environments.

Method: The study uses the Aya-Expanse 8B model to examine multilingual unlearning in two settings: data unlearning and concept unlearning. They extend benchmarks for factual knowledge and stereotypes to 10 languages (English, French, Arabic, Japanese, Russian, Farsi, Korean, Hindi, Hebrew, Indonesian) through translation, covering five language families and varying resource levels.

Result: Unlearning in high-resource languages is generally more stable, with asymmetric transfer effects observed between typologically related languages. Linguistic distance analysis shows that syntactic similarity is the strongest predictor of cross-lingual unlearning behavior.

Conclusion: Multilingual unlearning exhibits language-dependent patterns where resource availability and syntactic relationships significantly influence unlearning stability and cross-lingual transfer, highlighting the need for language-aware approaches to safety and fairness in multilingual LLMs.

Abstract: As multilingual large language models become more widely used, ensuring their safety and fairness across diverse linguistic contexts presents unique challenges. While existing research on machine unlearning has primarily focused on monolingual settings, typically English, multilingual environments introduce additional complexities due to cross-lingual knowledge transfer and biases embedded in both pretraining and fine-tuning data. In this work, we study multilingual unlearning using the Aya-Expanse 8B model under two settings: (1) data unlearning and (2) concept unlearning. We extend benchmarks for factual knowledge and stereotypes to ten languages through translation: English, French, Arabic, Japanese, Russian, Farsi, Korean, Hindi, Hebrew, and Indonesian. These languages span five language families and a wide range of resource levels. Our experiments show that unlearning in high-resource languages is generally more stable, with asymmetric transfer effects observed between typologically related languages. Furthermore, our analysis of linguistic distances indicates that syntactic similarity is the strongest predictor of cross-lingual unlearning behavior.

[27] A Framework for Personalized Persuasiveness Prediction via Context-Aware User Profiling

Sejun Park, Yoonah Park, Jongwon Lim, Yohan Jo

Main category: cs.CL

TL;DR: Proposes a context-aware user profiling framework with query generator and profiler components to improve persuasiveness prediction by leveraging user history, achieving up to +13.77%p F1 score improvement.

Details

Motivation: Current persuasiveness prediction methods lack systematic frameworks to effectively leverage users' past activities (conversations, values, reasoning styles) for personalized prediction, despite the importance of considering target persuadee characteristics.

Method: Two-component framework: 1) Query generator that creates optimal queries to retrieve persuasion-relevant records from user history, 2) Profiler that summarizes retrieved records into context-dependent profiles to inform persuasiveness prediction models.

Result: Evaluation on ChangeMyView Reddit dataset shows consistent improvements over existing methods across multiple predictor models, with gains up to +13.77%p in F1 score. Analysis reveals effective profiles are context-dependent and predictor-specific rather than static.

Conclusion: Task-oriented, context-dependent user profiling is crucial for personalized persuasiveness prediction, moving beyond static attributes or surface-level similarity to create dynamic, predictor-specific user representations.

Abstract: Estimating the persuasiveness of messages is critical in various applications, from recommender systems to safety assessment of LLMs. While it is imperative to consider the target persuadee’s characteristics, such as their values, experiences, and reasoning styles, there is currently no established systematic framework to optimize leveraging a persuadee’s past activities (e.g., conversations) to the benefit of a persuasiveness prediction model. To address this problem, we propose a context-aware user profiling framework with two trainable components: a query generator that generates optimal queries to retrieve persuasion-relevant records from a user’s history, and a profiler that summarizes these records into a profile to effectively inform the persuasiveness prediction model. Our evaluation on the ChangeMyView Reddit dataset shows consistent improvements over existing methods across multiple predictor models, with gains of up to +13.77%p in F1 score. Further analysis shows that effective user profiles are context-dependent and predictor-specific, rather than relying on static attributes or surface-level similarity. Together, these results highlight the importance of task-oriented, context-dependent user profiling for personalized persuasiveness prediction.

Hao Yang, Hongyuan Lu, Dingkang Yang, Wenliang Yang, Peng Sun, Xiaochuan Zhang, Jun Xiao, Kefan He, Wai Lam, Yang Liu, Xinhua Zeng

Main category: cs.CL

TL;DR: Stephanie2 is a step-wise dialogue agent with active waiting and message-pace adaptation that outperforms Stephanie1 in naturalness and engagement.

Details

Motivation: Existing step-by-step AI chatting systems lack active waiting mechanisms and exhibit unnatural message pacing, making them less natural in human-like instant messaging conversations.

Method: Stephanie2 uses active waiting and message-pace adaptation, explicitly deciding at each step whether to send or wait, and models latency as thinking time plus typing time. A time-window-based dual-agent dialogue system generates pseudo dialogue histories for evaluation.

Result: Stephanie2 clearly outperforms Stephanie1 on metrics like naturalness and engagement, and achieves higher pass rate on human evaluation with role identification Turing test.

Conclusion: Stephanie2 represents a novel next-generation step-wise decision-making dialogue agent that addresses pacing issues in AI chat systems, achieving more natural human-like conversation flow.

Abstract: Instant-messaging human social chat typically progresses through a sequence of short messages. Existing step-by-step AI chatting systems typically split a one-shot generation into multiple messages and send them sequentially, but they lack an active waiting mechanism and exhibit unnatural message pacing. In order to address these issues, we propose Stephanie2, a novel next-generation step-wise decision-making dialogue agent. With active waiting and message-pace adaptation, Stephanie2 explicitly decides at each step whether to send or wait, and models latency as the sum of thinking time and typing time to achieve more natural pacing. We further introduce a time-window-based dual-agent dialogue system to generate pseudo dialogue histories for human and automatic evaluations. Experiments show that Stephanie2 clearly outperforms Stephanie1 on metrics such as naturalness and engagement, and achieves a higher pass rate on human evaluation with the role identification Turing test.

[29] Afri-MCQA: Multimodal Cultural Question Answering for African Languages

Atnafu Lambebo Tonja, Srija Anand, Emilio Villa-Cueva, Israel Abebe Azime, Jesujoba Oluwadara Alabi, Muhidin A. Mohamed, Debela Desalegn Yadeta, Negasi Haile Abadi, Abigail Oppong, Nnaemeka Casmir Obiefuna, Idris Abdulmumin, Naome A Etori, Eric Peter Wairagala, Kanda Patrick Tshinu, Imanigirimbabazi Emmanuel, Gabofetswe Malema, Alham Fikri Aji, David Ifeoluwa Adelani, Thamar Solorio

Main category: cs.CL

TL;DR: Afri-MCQA is the first multilingual cultural QA benchmark covering 15 African languages across text and speech, created by native speakers, revealing poor LLM performance on African languages and cultural knowledge.

Details

Motivation: Africa has over one-third of the world's languages but is underrepresented in AI research, creating a need for culturally grounded benchmarks to evaluate and improve AI systems for African languages.

Method: Created Afri-MCQA benchmark with 7.5k Q&A pairs across 15 African languages from 12 countries, offering parallel English-African language pairs across text and speech modalities, entirely created by native speakers. Conducted benchmarking experiments with LLMs including control experiments to separate linguistic competence from cultural knowledge.

Result: Open-weight LLMs perform poorly across evaluated cultures, with near-zero accuracy on open-ended VQA when queried in native language or speech. Significant performance gaps exist between native languages and English for both text and speech modalities.

Conclusion: Findings highlight the need for speech-first approaches, culturally grounded pretraining, and cross-lingual cultural transfer. The Afri-MCQA benchmark is released to support more inclusive multimodal AI development in African languages.

Abstract: Africa is home to over one-third of the world’s languages, yet remains underrepresented in AI research. We introduce Afri-MCQA, the first Multilingual Cultural Question-Answering benchmark covering 7.5k Q&A pairs across 15 African languages from 12 countries. The benchmark offers parallel English-African language Q&A pairs across text and speech modalities and was entirely created by native speakers. Benchmarking large language models (LLMs) on Afri-MCQA shows that open-weight models perform poorly across evaluated cultures, with near-zero accuracy on open-ended VQA when queried in native language or speech. To evaluate linguistic competence, we include control experiments meant to assess this specific aspect separate from cultural knowledge, and we observe significant performance gaps between native languages and English for both text and speech. These findings underscore the need for speech-first approaches, culturally grounded pretraining, and cross-lingual cultural transfer. To support more inclusive multimodal AI development in African languages, we release our Afri-MCQA under academic license or CC BY-NC 4.0 on HuggingFace (https://huggingface.co/datasets/Atnafu/Afri-MCQA)

[30] Multimodal In-context Learning for ASR of Low-resource Languages

Zhaolin Li, Jan Niehues

Main category: cs.CL

TL;DR: Speech LLMs can learn unseen endangered languages via multimodal in-context learning (MICL), improving ASR performance without target-language training data.

Details

Motivation: ASR covers few languages due to data scarcity; existing ICL methods focus on high-resource languages and text-only settings, leaving unseen languages underserved.

Method: Use speech LLMs (Phi-4, Qwen3-Omni) with MICL on 3 endangered languages; analyze attention patterns; combine stronger acoustic model with speech LLM via MICL-based hypothesis selection.

Result: MICL effectively learns unseen languages; cross-lingual transfer improves efficiency; attention shows layer-dependent modality preferences; MICL-based ASR system outperforms prompt-based approaches.

Conclusion: Speech LLMs can learn unseen languages via MICL, enabling ASR for endangered languages without supervised data; cross-lingual transfer matches/exceeds corpus-trained models.

Abstract: Automatic speech recognition (ASR) still covers only a small fraction of the world’s languages, mainly due to supervised data scarcity. In-context learning (ICL) with large language models (LLMs) addresses this problem, but prior work largely focuses on high-resource languages covered during training and text-only settings. This paper investigates whether speech LLMs can learn unseen languages with multimodal ICL (MICL), and how this learning can be used to improve ASR. We conduct experiments with two speech LLMs, Phi-4 and Qwen3-Omni, on three diverse endangered languages. Firstly, we find that MICL is effective for unseen languages, leveraging both speech and text modalities. We further show that cross-lingual transfer learning improves MICL efficiency on target languages without training on them. Moreover, we analyze attention patterns to interpret MICL mechanisms, and we observe layer-dependent preferences between audio and text context, with an overall bias towards text. Finally, we show that prompt-based ASR with speech LLMs performs poorly on unseen languages, motivating a simple ASR system that combines a stronger acoustic model with a speech LLM via MICL-based selection of acoustic hypotheses. Results show that MICL consistently improves ASR performance, and that cross-lingual transfer learning matches or outperforms corpus-trained language models without using target-language data. Our code is publicly available.

[31] Visualising Information Flow in Word Embeddings with Diffusion Tensor Imaging

Thomas Fabian

Main category: cs.CL

TL;DR: The paper presents a novel tool using diffusion tensor imaging (DTI) on word embeddings to analyze information flow in natural language expressions within LLMs, going beyond single-word analysis to examine contextual language processing.

Details

Motivation: Existing methods for analyzing LLM representations focus on single word embeddings visualized as points, which ignores the context in which words are used and doesn't capture how information flows through natural language expressions.

Method: The authors apply diffusion tensor imaging (DTI), a technique from medical imaging, to word embeddings to track information flow between words in natural language expressions across different layers of LLMs.

Result: DTI reveals information flow patterns between word embeddings, allows comparison of different model structures, identifies opportunities for pruning under-utilized layers, and shows differences in information flows for specific tasks like pronoun resolution and metaphor detection.

Conclusion: The DTI-based model provides novel insights into how LLMs represent actual natural language expressions, extending beyond isolated word embedding analysis and improving the interpretability of NLP models.

Abstract: Understanding how large language models (LLMs) represent natural language is a central challenge in natural language processing (NLP) research. Many existing methods extract word embeddings from an LLM, visualise the embedding space via point-plots, and compare the relative positions of certain words. However, this approach only considers single words and not whole natural language expressions, thus disregards the context in which a word is used. Here we present a novel tool for analysing and visualising information flow in natural language expressions by applying diffusion tensor imaging (DTI) to word embeddings. We find that DTI reveals how information flows between word embeddings. Tracking information flows within the layers of an LLM allows for comparing different model structures and revealing opportunities for pruning an LLM’s under-utilised layers. Furthermore, our model reveals differences in information flows for tasks like pronoun resolution and metaphor detection. Our results show that our model permits novel insights into how LLMs represent actual natural language expressions, extending the comparison of isolated word embeddings and improving the interpretability of NLP models.

[32] Analysing Differences in Persuasive Language in LLM-Generated Text: Uncovering Stereotypical Gender Patterns

Amalie Brogaard Pauli, Maria Barrett, Max Müller-Eberstein, Isabelle Augenstein, Ira Assent

Main category: cs.CL

TL;DR: LLMs generate gender-biased persuasive language that reflects social stereotypes when drafting interpersonal messages.

Details

Motivation: As LLMs are increasingly used for everyday communication tasks including drafting persuasive messages, it's essential to understand how user instructions affect persuasive language generation and whether generated language differs when targeting different groups.

Method: Proposed a framework to evaluate how persuasive language generation is affected by recipient gender, sender intent, or output language. Evaluated 13 LLMs and 16 languages using pairwise prompt instructions, and assessed model responses on 19 categories of persuasive language using an LLM-as-judge setup grounded in social psychology and communication science.

Result: Revealed significant gender differences in persuasive language generated across all models, with patterns reflecting biases consistent with gender-stereotypical linguistic tendencies documented in social psychology and sociolinguistics.

Conclusion: LLMs exhibit systematic gender biases in persuasive language generation that mirror real-world social stereotypes, highlighting the need for careful consideration of how these models are deployed for interpersonal communication tasks.

Abstract: Large language models (LLMs) are increasingly used for everyday communication tasks, including drafting interpersonal messages intended to influence and persuade. Prior work has shown that LLMs can successfully persuade humans and amplify persuasive language. It is therefore essential to understand how user instructions affect the generation of persuasive language, and to understand whether the generated persuasive language differs, for example, when targeting different groups. In this work, we propose a framework for evaluating how persuasive language generation is affected by recipient gender, sender intent, or output language. We evaluate 13 LLMs and 16 languages using pairwise prompt instructions. We evaluate model responses on 19 categories of persuasive language using an LLM-as-judge setup grounded in social psychology and communication science. Our results reveal significant gender differences in the persuasive language generated across all models. These patterns reflect biases consistent with gender-stereotypical linguistic tendencies documented in social psychology and sociolinguistics.

[33] AutoMonitor-Bench: Evaluating the Reliability of LLM-Based Misbehavior Monitor

Shu Yang, Jingyu Hu, Tong Li, Hanqi Yan, Wenxuan Wang, Di Wang

Main category: cs.CL

TL;DR: AutoMonitor-Bench is the first benchmark for evaluating LLM-based misbehavior monitors across diverse tasks and failure modes, revealing reliability challenges and safety-utility trade-offs.

Details

Motivation: There's a need to systematically evaluate the reliability of LLM-based misbehavior monitors across different tasks and failure modes, as current monitoring approaches lack comprehensive benchmarking.

Method: Created AutoMonitor-Bench with 3,010 annotated test samples across question answering, code generation, and reasoning tasks. Evaluated 22 LLMs using Miss Rate and False Alarm Rate metrics. Also built a 153,581-sample training corpus to fine-tune Qwen3-4B-Instruction for investigating training effects.

Result: Found substantial variability in monitoring performance across models, consistent trade-off between MR and FAR (safety-utility tension), and that training on known misbehavior datasets doesn’t necessarily improve performance on unseen, implicit misbehaviors.

Conclusion: Reliable, scalable misbehavior monitoring remains challenging, highlighting the need for future work on task-aware designing and training strategies for LLM-based monitors.

Abstract: We introduce AutoMonitor-Bench, the first benchmark designed to systematically evaluate the reliability of LLM-based misbehavior monitors across diverse tasks and failure modes. AutoMonitor-Bench consists of 3,010 carefully annotated test samples spanning question answering, code generation, and reasoning, with paired misbehavior and benign instances. We evaluate monitors using two complementary metrics: Miss Rate (MR) and False Alarm Rate (FAR), capturing failures to detect misbehavior and oversensitivity to benign behavior, respectively. Evaluating 12 proprietary and 10 open-source LLMs, we observe substantial variability in monitoring performance and a consistent trade-off between MR and FAR, revealing an inherent safety-utility tension. To further explore the limits of monitor reliability, we construct a large-scale training corpus of 153,581 samples and fine-tune Qwen3-4B-Instruction to investigate whether training on known, relatively easy-to-construct misbehavior datasets improves monitoring performance on unseen and more implicit misbehaviors. Our results highlight the challenges of reliable, scalable misbehavior monitoring and motivate future work on task-aware designing and training strategies for LLM-based monitors.

[34] One Script Instead of Hundreds? On Pretraining Romanized Encoder Language Models

Benedikt Ebing, Lennart Keller, Goran Glavaš

Main category: cs.CL

TL;DR: Romanization improves cross-lingual transfer by exposing lexical overlap, but its effectiveness varies by script type - works well for segmental scripts but harms morphosyllabic scripts like Chinese/Japanese, with no evidence of negative interference from increased vocabulary overlap.

Details

Motivation: Previous work on romanization focused on favorable setups (high-resource Latin to low-resource non-Latin, or closely related languages), leaving unclear whether romanization harms high-resource languages or is suitable for general-purpose multilingual LMs due to potential information loss.

Method: Pretrained encoder language models from scratch on both romanized and original texts for six typologically diverse high-resource languages. Used two romanizers with different fidelity profiles. Investigated two degradation sources: (1) loss of script-specific information and (2) negative cross-lingual interference from increased vocabulary overlap.

Result: Negligible performance loss for languages with segmental scripts, but significant degradation for morphosyllabic scripts (Chinese and Japanese) that higher-fidelity romanization mitigates but cannot fully recover. No evidence that increased subword overlap induces negative interference. Romanization improves encoding efficiency (fertility) for segmental scripts at negligible performance cost.

Conclusion: Romanization is beneficial for segmental-script languages but problematic for morphosyllabic scripts due to information loss. Increased vocabulary overlap doesn’t cause negative interference, making romanization a viable strategy for improving cross-lingual transfer in multilingual LMs for appropriate script types.

Abstract: Exposing latent lexical overlap, script romanization has emerged as an effective strategy for improving cross-lingual transfer (XLT) in multilingual language models (mLMs). Most prior work, however, focused on setups that favor romanization the most: (1) transfer from high-resource Latin-script to low-resource non-Latin-script languages and/or (2) between genealogically closely related languages with different scripts. It thus remains unclear whether romanization is a good representation choice for pretraining general-purpose mLMs, or, more precisely, if information loss associated with romanization harms performance for high-resource languages. We address this gap by pretraining encoder LMs from scratch on both romanized and original texts for six typologically diverse high-resource languages, investigating two potential sources of degradation: (i) loss of script-specific information and (ii) negative cross-lingual interference from increased vocabulary overlap. Using two romanizers with different fidelity profiles, we observe negligible performance loss for languages with segmental scripts, whereas languages with morphosyllabic scripts (Chinese and Japanese) suffer degradation that higher-fidelity romanization mitigates but cannot fully recover. Importantly, comparing monolingual LMs with their mLM counterpart, we find no evidence that increased subword overlap induces negative interference. We further show that romanization improves encoding efficiency (i.e., fertility) for segmental scripts at a negligible performance cost.

[35] Simplify-This: A Comparative Analysis of Prompt-Based and Fine-Tuned LLMs

Eilam Cohen, Itamar Bul, Danielle Inbar, Omri Loewenbach

Main category: cs.CL

TL;DR: Comparative study of fine-tuning vs. prompt engineering for text simplification using encoder-decoder LLMs across multiple benchmarks and evaluation metrics.

Details

Motivation: There's a practical tradeoff between fine-tuning and prompt engineering for LLM-based text simplification, and this study aims to systematically evaluate both paradigms to understand their relative strengths and weaknesses.

Method: Introduces Simplify-This, a comparative study evaluating both fine-tuning and prompt engineering paradigms for text simplification with encoder-decoder LLMs across multiple benchmarks using a range of evaluation metrics.

Result: Fine-tuned models consistently deliver stronger structural simplification, while prompting often attains higher semantic similarity scores but tends to copy inputs. Human evaluation favors fine-tuned outputs overall.

Conclusion: Fine-tuning is generally more effective for text simplification tasks, though prompting has advantages in semantic preservation. The study releases code, datasets, model checkpoints, and prompt templates to support reproducibility and future research.

Abstract: Large language models (LLMs) enable strong text generation, and in general there is a practical tradeoff between fine-tuning and prompt engineering. We introduce Simplify-This, a comparative study evaluating both paradigms for text simplification with encoder-decoder LLMs across multiple benchmarks, using a range of evaluation metrics. Fine-tuned models consistently deliver stronger structural simplification, whereas prompting often attains higher semantic similarity scores yet tends to copy inputs. A human evaluation favors fine-tuned outputs overall. We release code, a cleaned derivative dataset used in our study, checkpoints of fine-tuned models, and prompt templates to facilitate reproducibility and future work.

[36] EnvScaler: Scaling Tool-Interactive Environments for LLM Agent via Programmatic Synthesis

Xiaoshuai Song, Haofei Chang, Guanting Dong, Yutao Zhu, Zhicheng Dou, Ji-Rong Wen

Main category: cs.CL

TL;DR: EnvScaler is an automated framework for scalable tool-interaction environments via programmatic synthesis to train LLMs as agents, addressing limitations of real systems, simulated environments, and manual sandboxes.

Details

Motivation: LLMs need training to act as agents in real-world environments, but current approaches face limitations: restricted access to real systems, hallucinations in LLM-simulated environments, and poor scalability of manually built sandboxes.

Method: EnvScaler has two components: SkelBuilder constructs diverse environment skeletons through topic mining, logic modeling, and quality evaluation; ScenGenerator generates multiple task scenarios and rule-based trajectory validation functions for each environment.

Result: The framework synthesized 191 environments and about 7K scenarios, which were used for SFT and RL training of Qwen3 series models. Results on three benchmarks show significant improvement in LLMs’ ability to solve tasks in complex multi-turn, multi-tool interaction environments.

Conclusion: EnvScaler provides an automated, scalable solution for creating tool-interaction environments to train LLMs as agents, overcoming limitations of existing approaches and demonstrating effectiveness through benchmark improvements.

Abstract: Large language models (LLMs) are expected to be trained to act as agents in various real-world environments, but this process relies on rich and varied tool-interaction sandboxes. However, access to real systems is often restricted; LLM-simulated environments are prone to hallucinations and inconsistencies; and manually built sandboxes are hard to scale. In this paper, we propose EnvScaler, an automated framework for scalable tool-interaction environments via programmatic synthesis. EnvScaler comprises two components. First, SkelBuilder constructs diverse environment skeletons through topic mining, logic modeling, and quality evaluation. Then, ScenGenerator generates multiple task scenarios and rule-based trajectory validation functions for each environment. With EnvScaler, we synthesize 191 environments and about 7K scenarios, and apply them to Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) for Qwen3 series models. Results on three benchmarks show that EnvScaler significantly improves LLMs’ ability to solve tasks in complex environments involving multi-turn, multi-tool interactions. We release our code and data at https://github.com/RUC-NLPIR/EnvScaler.

[37] LLMs as Science Journalists: Supporting Early-stage Researchers in Communicating Their Science to the Public

Milad Alshomary, Grace Li, Anubhav Jangra, Yufang Hou, Kathleen McKeown, Smaranda Muresan

Main category: cs.CL

TL;DR: A framework for training LLMs to act as science journalists helps early-stage researchers learn to communicate their work to the public, outperforming general-purpose LLMs in asking relevant questions about societal impact.

Details

Motivation: Early-stage researchers need help effectively communicating scientific findings to the public. Existing general-purpose LLMs are not optimally aligned for this specific science communication task.

Method: Proposed a framework for training LLMs to emulate the role of a science journalist. Evaluated trained LLM journalists through conversations with both simulated and human researchers, comparing them to general-purpose LLMs.

Result: LLMs trained with the framework ask more relevant questions addressing societal impact of research, prompting researchers to clarify and elaborate on findings. User study participants preferred interacting with the trained LLM journalist over general-purpose LLMs.

Conclusion: The proposed training framework successfully creates LLMs that function as effective science journalists, helping researchers improve public communication of their work and outperforming general-purpose language models for this specific application.

Abstract: The scientific community needs tools that help early-stage researchers effectively communicate their findings and innovations to the public. Although existing general-purpose Large Language Models (LLMs) can assist in this endeavor, they are not optimally aligned for it. To address this, we propose a framework for training LLMs to emulate the role of a science journalist that can be used by early-stage researchers to learn how to properly communicate their papers to the general public. We evaluate the usefulness of our trained LLM Journalists in leading conversations with both simulated and human researchers. %compared to the general-purpose ones. Our experiments indicate that LLMs trained using our framework ask more relevant questions that address the societal impact of research, prompting researchers to clarify and elaborate on their findings. In the user study, the majority of participants who interacted with our trained LLM Journalist appreciated it more than interacting with general-purpose LLMs.

[38] Peek2: A Regex-free implementation of pretokenizers for Byte-level BPE

Liu Zai

Main category: cs.CL

TL;DR: Peek2 is a new Regex-free pretokenizer implementation that replaces cl100k-like pretokenizers with 1.11× throughput improvement while maintaining identical presegmentation results.

Details

Motivation: Current pretokenization in Byte-level BPE tokenizers uses Regex-based approaches which may have performance limitations. There's a need for a faster, safer drop-in replacement that maintains compatibility with existing tokenizers like those used in GPT-3, LLaMa-3, and Qwen-2.5.

Method: Peek2 is a Regex-free algorithm that runs entirely on CPU with stable linear complexity O(n). It serves as a drop-in replacement for cl100k-like pretokenizers, designed with performance and safety in mind while preserving identical presegmentation results.

Result: Peek2 delivers 1.11× improvement in overall throughput across the entire Byte-level BPE encoding process compared to the original Regex-based pretokenizer, while maintaining identical presegmentation results.

Conclusion: Peek2 provides a faster, safer alternative to Regex-based pretokenizers with significant performance improvements while maintaining full compatibility with existing tokenization pipelines used in major language models.

Abstract: Pretokenization is a crucial, sequential pass in Byte-level BPE tokenizers. Our proposed new implementation, Peek2, serves as a drop-in replacement for cl100k-like pretokenizers used in GPT-3, LLaMa-3, and Qwen-2.5. Designed with performance and safety in mind, Peek2 is Regex-free and delivers a $ 1.11\times $ improvement in overall throughput across the entire Byte-level BPE encoding process. This algorithm runs entirely on the CPU, has stable linear complexity $ O(n) $, and provides presegmentation results identical to those of the original Regex-based pretokenizer.

[39] Left, Right, or Center? Evaluating LLM Framing in News Classification and Generation

Molly Kennedy, Ali Parker, Yihong Liu, Hinrich Schütze

Main category: cs.CL

TL;DR: LLMs show systematic centrist framing bias in political summarization, with Grok 4 being most ideologically expressive, while Claude Sonnet 4.5 and Llama 3.1 perform best in bias rating.

Details

Motivation: As LLMs are increasingly used for journalism and text generation, there's concern about political framing biases where subtle wording choices can shape interpretation, potentially influencing public discourse.

Method: Tested nine state-of-the-art LLMs by comparing few-shot ideology predictions against LEFT/CENTER/RIGHT labels, then generated “steered” summaries under FAITHFUL, CENTRIST, LEFT, and RIGHT prompts, scoring all outputs with a single fixed ideology evaluator.

Result: Found pervasive ideological center-collapse in both article-level ratings and generated text, indicating systematic tendency toward centrist framing. Grok 4 was most ideologically expressive generator, while Claude Sonnet 4.5 and Llama 3.1 achieved strongest bias-rating performance among commercial and open-weight models respectively.

Conclusion: LLMs exhibit systematic centrist bias in political framing, raising concerns about their use in journalism where ideological neutrality or expression may be important, with significant variation in ideological expressiveness across different models.

Abstract: Large Language Model (LLM) based summarization and text generation are increasingly used for producing and rewriting text, raising concerns about political framing in journalism where subtle wording choices can shape interpretation. Across nine state-of-the-art LLMs, we study political framing by testing whether LLMs’ classification-based bias signals align with framing behavior in their generated summaries. We first compare few-shot ideology predictions against LEFT/CENTER/RIGHT labels. We then generate “steered” summaries under FAITHFUL, CENTRIST, LEFT, and RIGHT prompts, and score all outputs using a single fixed ideology evaluator. We find pervasive ideological center-collapse in both article-level ratings and generated text, indicating a systematic tendency toward centrist framing. Among evaluated models, Grok 4 is by far the most ideologically expressive generator, while Claude Sonnet 4.5 and Llama 3.1 achieve the strongest bias-rating performance among commercial and open-weight models, respectively.

[40] Semantic NLP Pipelines for Interoperable Patient Digital Twins from Unstructured EHRs

Rafael Brens, Yuqiao Meng, Luoxi Tang, Zhaohan Xi

Main category: cs.CL

TL;DR: Semantic NLP pipeline transforms unstructured EHR notes into FHIR-compliant digital twin representations using NER, concept normalization, and relation extraction.

Details

Motivation: Digital twins are valuable for personalized healthcare but challenging to generate from unstructured EHRs due to variability in clinical documentation and lack of standardized mappings.

Method: Pipeline uses named entity recognition to extract clinical concepts, concept normalization to map entities to SNOMED-CT/ICD-10 standards, and relation extraction to capture structured associations between conditions, medications, and observations.

Result: Evaluation on MIMIC-IV Clinical Database Demo shows high F1-scores for entity and relation extraction, with improved schema completeness and interoperability compared to baseline methods.

Conclusion: The semantic NLP pipeline successfully transforms free-text EHR notes into interoperable FHIR-compliant digital twin representations, addressing key challenges in clinical documentation variability.

Abstract: Digital twins – virtual replicas of physical entities – are gaining traction in healthcare for personalized monitoring, predictive modeling, and clinical decision support. However, generating interoperable patient digital twins from unstructured electronic health records (EHRs) remains challenging due to variability in clinical documentation and lack of standardized mappings. This paper presents a semantic NLP-driven pipeline that transforms free-text EHR notes into FHIR-compliant digital twin representations. The pipeline leverages named entity recognition (NER) to extract clinical concepts, concept normalization to map entities to SNOMED-CT or ICD-10, and relation extraction to capture structured associations between conditions, medications, and observations. Evaluation on MIMIC-IV Clinical Database Demo with validation against MIMIC-IV-on-FHIR reference mappings demonstrates high F1-scores for entity and relation extraction, with improved schema completeness and interoperability compared to baseline methods.

[41] Router-Suggest: Dynamic Routing for Multimodal Auto-Completion in Visually-Grounded Dialogs

Sandeep Mishra, Devichand Budagam, Anubhab Mandal, Bishal Santra, Pawan Goyal, Manish Gupta

Main category: cs.CL

TL;DR: MAC is a multimodal auto-completion task that predicts upcoming characters using both text and visual context, outperforming text-only approaches in user satisfaction while requiring efficient routing between models.

Details

Motivation: Traditional text-only auto-completion fails to leverage visual context in multimodal applications like digital assistants, chatbots, design tools, and healthcare consultations where users share visual information. This limits prediction accuracy and user satisfaction.

Method: 1) Introduced Multimodal Auto-Completion (MAC) task combining text and visual cues; 2) Adapted MMDialog and ImageChat to create benchmark datasets; 3) Evaluated VLMs vs. textual baselines; 4) Proposed Router-Suggest framework that dynamically selects between textual models and VLMs based on dialog context; 5) Created lightweight variant for resource-constrained environments.

Result: Router-Suggest achieves 2.3x to 10x speedup over best-performing VLM. User study shows VLMs significantly outperform textual models in user satisfaction, saving typing effort and improving completion quality in multi-turn conversations.

Conclusion: Multimodal context is essential for effective auto-completion in modern applications. The proposed MAC task and Router-Suggest framework enable smarter, user-aware assistants by balancing accuracy and efficiency through intelligent model selection.

Abstract: Real-time multimodal auto-completion is essential for digital assistants, chatbots, design tools, and healthcare consultations, where user inputs rely on shared visual context. We introduce Multimodal Auto-Completion (MAC), a task that predicts upcoming characters in live chats using partially typed text and visual cues. Unlike traditional text-only auto-completion (TAC), MAC grounds predictions in multimodal context to better capture user intent. To enable this task, we adapt MMDialog and ImageChat to create benchmark datasets. We evaluate leading vision-language models (VLMs) against strong textual baselines, highlighting trade-offs in accuracy and efficiency. We present Router-Suggest, a router framework that dynamically selects between textual models and VLMs based on dialog context, along with a lightweight variant for resource-constrained environments. Router-Suggest achieves a 2.3x to 10x speedup over the best-performing VLM. A user study shows that VLMs significantly excel over textual models on user satisfaction, notably saving user typing effort and improving the quality of completions in multi-turn conversations. These findings underscore the need for multimodal context in auto-completions, leading to smarter, user-aware assistants.

[42] CLewR: Curriculum Learning with Restarts for Machine Translation Preference Learning

Alexandra Dragomir, Florin Brad, Radu Tudor Ionescu

Main category: cs.CL

TL;DR: CLewR integrates curriculum learning with preference optimization for multilingual machine translation, using easy-to-hard curriculum with restarts to prevent catastrophic forgetting of easy examples.

Details

Motivation: Current LLM-based multilingual MT approaches using preference optimization overlook the importance of training data order. Curriculum learning could improve performance by strategically ordering training samples.

Method: CLewR (Curriculum Learning with Restarts) - a novel strategy that reiterates easy-to-hard curriculum multiple times during training to mitigate catastrophic forgetting of easy examples. Integrated with various state-of-the-art preference optimization algorithms.

Result: Demonstrated consistent performance gains across multiple model families (Gemma2, Qwen2.5, Llama3.1) and preference optimization techniques.

Conclusion: Curriculum learning with restarts effectively improves multilingual MT performance by addressing catastrophic forgetting, showing consistent benefits across different models and optimization methods.

Abstract: Large language models (LLMs) have demonstrated competitive performance in zero-shot multilingual machine translation (MT). Some follow-up works further improved MT performance via preference optimization, but they leave a key aspect largely underexplored: the order in which data samples are given during training. We address this topic by integrating curriculum learning into various state-of-the-art preference optimization algorithms to boost MT performance. We introduce a novel curriculum learning strategy with restarts (CLewR), which reiterates easy-to-hard curriculum multiple times during training to effectively mitigate the catastrophic forgetting of easy examples. We demonstrate consistent gains across several model families (Gemma2, Qwen2.5, Llama3.1) and preference optimization techniques. We publicly release our code at https://github.com/alexandra-dragomir/CLewR.

[43] What do the metrics mean? A critical analysis of the use of Automated Evaluation Metrics in Interpreting

Jonathan Downie, Joss Moorkens

Main category: cs.CL

TL;DR: Current automated interpreting quality metrics fail to account for communicative context and are insufficient for measuring authentic interpreting quality on their own.

Details

Motivation: With rapid growth of interpreting technologies (remote interpreting, computer-aided interpreting, automated speech translation, interpreting avatars), there's high demand for quick and efficient quality measurement methods.

Method: Examines recently-proposed automated quality measurement methods and evaluates their suitability for measuring authentic interpreting practice quality.

Result: Automatic metrics as currently proposed cannot account for communicative context and are not viable measures of interpreting quality when used independently.

Conclusion: Context is fundamental to interpreting quality analysis; automated metrics alone cannot adequately measure interpreting quality without considering communicative context.

Abstract: With the growth of interpreting technologies, from remote interpreting and Computer-Aided Interpreting to automated speech translation and interpreting avatars, there is now a high demand for ways to quickly and efficiently measure the quality of any interpreting delivered. A range of approaches to fulfil the need for quick and efficient quality measurement have been proposed, each involving some measure of automation. This article examines these recently-proposed quality measurement methods and will discuss their suitability for measuring the quality of authentic interpreting practice, whether delivered by humans or machines, concluding that automatic metrics as currently proposed cannot take into account the communicative context and thus are not viable measures of the quality of any interpreting provision when used on their own. Across all attempts to measure or even categorise quality in Interpreting Studies, the contexts in which interpreting takes place have become fundamental to the final analysis.

[44] FACTUM: Mechanistic Detection of Citation Hallucination in Long-Form RAG

Maxime Dassen, Rebecca Kotula, Kenton Murray, Andrew Yates, Dawn Lawrie, Efsun Kayi, James Mayfield, Kevin Duh

Main category: cs.CL

TL;DR: FACTUM framework identifies citation hallucination signatures in RAG models using mechanistic analysis of attention/FFN pathways, revealing scale-dependent patterns and outperforming baselines by up to 37.5% AUC.

Details

Motivation: Existing work oversimplifies citation hallucinations in RAG models as mere over-reliance on parametric knowledge. The authors challenge this view and seek to understand the complex internal mechanisms behind citation trustworthiness.

Method: Introduces FACTUM framework with four mechanistic scores measuring distinct contributions of attention and FFN pathways, and alignment between them. Analyzes signatures of correct citations across different model scales.

Result: Identifies two consistent signatures of correct citation: stronger parametric knowledge contribution and greater use of attention sink for synthesis. Reveals citation signatures evolve with model scale (e.g., Llama-3.2-3B shows higher pathway alignment, Llama-3.1-8B shows lower alignment). FACTUM outperforms state-of-the-art baselines by up to 37.5% in AUC.

Conclusion: Citation hallucination is a complex, scale-dependent interplay between internal mechanisms, not just parametric knowledge over-reliance. FACTUM enables more nuanced and reliable RAG systems by capturing these evolving signatures.

Abstract: Retrieval-Augmented Generation (RAG) models are critically undermined by citation hallucinations, a deceptive failure where a model confidently cites a source that fails to support its claim. Existing work often attributes hallucination to a simple over-reliance on the model’s parametric knowledge. We challenge this view and introduce FACTUM (Framework for Attesting Citation Trustworthiness via Underlying Mechanisms), a framework of four mechanistic scores measuring the distinct contributions of a model’s attention and FFN pathways, and the alignment between them. Our analysis reveals two consistent signatures of correct citation: a significantly stronger contribution from the model’s parametric knowledge and greater use of the attention sink for information synthesis. Crucially, we find the signature of a correct citation is not static but evolves with model scale. For example, the signature of a correct citation for the Llama-3.2-3B model is marked by higher pathway alignment, whereas for the Llama-3.1-8B model, it is characterized by lower alignment, where pathways contribute more distinct, orthogonal information. By capturing this complex, evolving signature, FACTUM outperforms state-of-the-art baselines by up to 37.5% in AUC. Our findings reframe citation hallucination as a complex, scale-dependent interplay between internal mechanisms, paving the way for more nuanced and reliable RAG systems.

[45] Continual-learning for Modelling Low-Resource Languages from Large Language Models

Santosh Srinath K, Mudit Somani, Varun Reddy Padala, Prajna Devi Upadhyay, Abhijit Das

Main category: cs.CL

TL;DR: Proposes continual learning with POS-based code-switching and replay adapter to prevent catastrophic forgetting when adapting LLMs to low-resource languages for small language models.

Details

Motivation: Catastrophic forgetting is a major challenge when adapting large language models to create small language models for low-resource languages in multilingual scenarios.

Method: Uses continual learning strategy with parts-of-speech (POS)-based code-switching and replay adapter strategy to mitigate catastrophic forgetting during training.

Result: Experiments on vision-language tasks (visual question answering) and language modeling tasks show successful performance of the proposed architecture.

Conclusion: The proposed approach effectively addresses catastrophic forgetting when training small language models from large language models for multilingual applications.

Abstract: Modelling a language model for a multi-lingual scenario includes several potential challenges, among which catastrophic forgetting is the major challenge. For example, small language models (SLM) built for low-resource languages by adapting large language models (LLMs) pose the challenge of catastrophic forgetting. This work proposes to employ a continual learning strategy using parts-of-speech (POS)-based code-switching along with a replay adapter strategy to mitigate the identified gap of catastrophic forgetting while training SLM from LLM. Experiments conducted on vision language tasks such as visual question answering and language modelling task exhibits the success of the proposed architecture.

[46] iReasoner: Trajectory-Aware Intrinsic Reasoning Supervision for Self-Evolving Large Multimodal Models

Meghana Sunil, Manikandarajan Venmathimaran, Muthu Subash Kavitha

Main category: cs.CL

TL;DR: iReasoner is a self-evolving framework that improves large multimodal models’ reasoning by explicitly eliciting chain-of-thought and rewarding internal agreement, achieving +2.1 point gains on multimodal reasoning benchmarks through unsupervised post-training.

Details

Motivation: Existing self-evolving frameworks for large multimodal models mainly reward final outcomes, leaving intermediate reasoning weakly constrained despite its importance for visually grounded decision making. There's a need for better reasoning-aware self-improvement in unsupervised settings.

Method: Proposes iReasoner framework with Proposer-Solver loop over unlabeled images. Augments outcome-level intrinsic rewards with trajectory-aware signal defined over intermediate reasoning steps. Uses chain-of-thought elicitation and rewards internal agreement between reasoning paths without ground-truth labels or external judges.

Result: Starting from Qwen2.5-VL-7B, iReasoner yields up to +2.1 points across diverse multimodal reasoning benchmarks under fully unsupervised post-training. Demonstrates effectiveness of reasoning-aware self-improvement.

Conclusion: iReasoner serves as a starting point for reasoning-aware self-improvement in large multimodal models in purely unsupervised settings, showing that explicit reasoning elicitation and internal agreement rewards can significantly enhance multimodal reasoning capabilities.

Abstract: Recent work shows that large multimodal models (LMMs) can self-improve from unlabeled data via self-play and intrinsic feedback. Yet existing self-evolving frameworks mainly reward final outcomes, leaving intermediate reasoning weakly constrained despite its importance for visually grounded decision making. We propose iReasoner, a self-evolving framework that improves an LMM’s implicit reasoning by explicitly eliciting chain-of-thought (CoT) and rewarding its internal agreement. In a Proposer–Solver loop over unlabeled images, iReasoner augments outcome-level intrinsic rewards with a trajectory-aware signal defined over intermediate reasoning steps, providing learning signals that distinguish reasoning paths leading to the same answer without ground-truth labels or external judges. Starting from Qwen2.5-VL-7B, iReasoner yields up to $+2.1$ points across diverse multimodal reasoning benchmarks under fully unsupervised post-training. We hope this work serves as a starting point for reasoning-aware self-improvement in LMMs in purely unsupervised settings.

[47] Gender Bias in LLMs: Preliminary Evidence from Shared Parenting Scenario in Czech Family Law

Jakub Harasta, Matej Vasina, Martin Kornel, Tomas Foltynek

Main category: cs.CL

TL;DR: LLMs show gender bias in legal advice for divorce scenarios, with different models exhibiting varying patterns of bias in proposed parenting time allocations.

Details

Motivation: As laypeople increasingly use LLMs for legal self-help, there's concern about potential gender bias in their outputs, especially in sensitive family law contexts where biased advice could significantly impact people's lives.

Method: Researchers created an expert-designed divorce scenario based on Czech family law, testing four state-of-the-art LLMs (GPT-5 nano, Claude Haiku 4.5, Gemini 2.5 Flash, Llama 3.3) in zero-shot interactions. They used two scenario versions (gendered names vs. neutral labels) and introduced nine legally relevant factors to vary factual circumstances and test influence on proposed shared-parenting ratios.

Result: Preliminary results show differences across models and suggest gender-dependent patterns in outcomes generated by some systems. The study identifies systematic asymmetries rather than establishing causal effects.

Conclusion: The findings highlight risks of laypeople relying on LLMs for legal guidance and emphasize the need for more robust evaluation of model behavior in sensitive legal contexts, particularly regarding gender bias.

Abstract: Access to justice remains limited for many people, leading laypersons to increasingly rely on Large Language Models (LLMs) for legal self-help. Laypeople use these tools intuitively, which may lead them to form expectations based on incomplete, incorrect, or biased outputs. This study examines whether leading LLMs exhibit gender bias in their responses to a realistic family law scenario. We present an expert-designed divorce scenario grounded in Czech family law and evaluate four state-of-the-art LLMs GPT-5 nano, Claude Haiku 4.5, Gemini 2.5 Flash, and Llama 3.3 in a fully zero-shot interaction. We deploy two versions of the scenario, one with gendered names and one with neutral labels, to establish a baseline for comparison. We further introduce nine legally relevant factors that vary the factual circumstances of the case and test whether these variations influence the models’ proposed shared-parenting ratios. Our preliminary results highlight differences across models and suggest gender-dependent patterns in the outcomes generated by some systems. The findings underscore both the risks associated with laypeople’s reliance on LLMs for legal guidance and the need for more robust evaluation of model behavior in sensitive legal contexts. We present exploratory and descriptive evidence intended to identify systematic asymmetries rather than to establish causal effects.

[48] An Empirical Study on Preference Tuning Generalization and Diversity Under Domain Shift

Constantinos Karouzos, Xingwei Tan, Nikolaos Aletras

Main category: cs.CL

TL;DR: Preference-tuning improves models for human judgments but degrades performance outside training domains. This study systematically compares alignment objectives and adaptation strategies to mitigate domain shift, finding pseudo-labeling helps reduce degradation.

Details

Motivation: Preference-tuning aligns language models to human judgments but suffers from domain shift - performance degrades outside training domains. The paper aims to systematically study how different alignment objectives and adaptation strategies generalize under domain shift.

Method: Comprehensive study comparing five popular alignment objectives and various adaptation strategies (target-domain supervised fine-tuning and pseudo-labeling) across summarization and question-answering helpfulness tasks.

Result: Reveals systematic differences in generalization across alignment objectives under domain shift. Shows adaptation strategies based on pseudo-labeling can substantially reduce domain-shift degradation.

Conclusion: Domain shift is a significant challenge for preference-tuned models, but adaptation strategies like pseudo-labeling can effectively mitigate performance degradation when models are applied to new domains.

Abstract: Preference tuning aligns pretrained language models to human judgments of quality, helpfulness, or safety by optimizing over explicit preference signals rather than likelihood alone. Prior work has shown that preference-tuning degrades performance and reduces helpfulness when evaluated outside the training domain. However, the extent to which adaptation strategies mitigate this domain shift remains unexplored. We address this challenge by conducting a comprehensive and systematic study of alignment generalization under domain shift. We compare five popular alignment objectives and various adaptation strategies from source to target, including target-domain supervised fine-tuning and pseudo-labeling, across summarization and question-answering helpfulness tasks. Our findings reveal systematic differences in generalization across alignment objectives under domain shift. We show that adaptation strategies based on pseudo-labeling can substantially reduce domain-shift degradation

[49] HAPS: Hierarchical LLM Routing with Joint Architecture and Parameter Search

Zihang Tian, Rui Li, Jingsen Zhang, Xiaohe Bo, Wei Huo, Xu Chen

Main category: cs.CL

TL;DR: HAPS is a hierarchical LLM routing framework that jointly searches over both model architectures and parameter settings, outperforming existing routing methods.

Details

Motivation: Existing LLM routing approaches focus only on selecting LLM architectures while ignoring parameter settings, which are crucial for task performance. There's a need for a more comprehensive routing framework that considers both aspects.

Method: HAPS uses a hierarchical approach with two routers: a high-level router selects candidate LLM architectures, and a low-level router searches for optimal parameters for selected architectures. A parameter generation network shares parameters between routers to enhance mutual capabilities, and a reward-augmented objective optimizes the framework during training.

Result: Experiments on two commonly used benchmarks show that HAPS consistently outperforms strong routing baselines.

Conclusion: HAPS demonstrates the importance of jointly considering both architecture selection and parameter optimization in LLM routing, providing a more effective framework for exploiting specialized strengths of different LLMs.

Abstract: Large language model (LLM) routing aims to exploit the specialized strengths of different LLMs for diverse tasks. However, existing approaches typically focus on selecting LLM architectures while overlooking parameter settings, which are critical for task performance. In this paper, we introduce HAPS, a hierarchical LLM routing framework that jointly searches over model architectures and parameters. Specifically, we use a high-level router to select among candidate LLM architectures, and then search for the optimal parameters for the selected architectures based on a low-level router. We design a parameter generation network to share parameters between the two routers to mutually enhance their capabilities. In the training process, we design a reward-augmented objective to effectively optimize our framework. Experiments on two commonly used benchmarks show that HAPS consistently outperforms strong routing baselines. We have released our code at https://github.com/zihangtian/HAPS.

[50] Pantagruel: Unified Self-Supervised Encoders for French Text and Speech

Phuong-Hang Le, Valentin Pelloin, Arnault Chatelain, Maryem Bouziane, Mohammed Ghennai, Qianwen Guan, Kirill Milintsevich, Salima Mdhaffar, Aidan Mannion, Nils Defauw, Shuyue Gu, Alexandre Audibert, Marco Dinarelli, Yannick Estève, Lorraine Goeuriot, Steffen Lalande, Nicolas Hervé, Maximin Coavoux, François Portet, Étienne Ollion, Marie Candito, Maxime Peyrard, Solange Rossato, Benjamin Lecouteux, Aurélie Nardy, Gilles Sérasset, Vincent Segonne, Solène Evain, Diandra Fabre, Didier Schwab

Main category: cs.CL

TL;DR: Pantagruel models are self-supervised encoder models for French text and speech that learn contextualized target representations in feature space, outperforming existing French baselines on multimodal tasks.

Details

Motivation: To develop effective self-supervised models for French that can handle both text and speech modalities seamlessly, addressing the need for robust French representation learning that goes beyond modality-specific targets like textual tokens or speech units.

Method: Pantagruel learns contextualized target representations in feature space rather than predicting modality-tailored targets. Separate models are pre-trained on large-scale French corpora: Wikipedia, OSCAR, CroissantLLM for text, and MultilingualLibriSpeech, LeBenchmark, and INA-100k (new 100k-hour French audio corpus from INA archives) for speech.

Result: Pantagruel models show competitive or superior performance compared to strong French baselines (CamemBERT, FlauBERT, LeBenchmark2.0) across a broad range of downstream tasks from standard French benchmarks like FLUE and LeBenchmark, while maintaining a shared architecture that handles both speech and text inputs.

Conclusion: The results confirm the effectiveness of feature-space self-supervised objectives for French representation learning and highlight Pantagruel as a robust foundation for multimodal speech-text understanding in French.

Abstract: We release Pantagruel models, a new family of self-supervised encoder models for French text and speech. Instead of predicting modality-tailored targets such as textual tokens or speech units, Pantagruel learns contextualized target representations in the feature space, allowing modality-specific encoders to capture linguistic and acoustic regularities more effectively. Separate models are pre-trained on large-scale French corpora, including Wikipedia, OSCAR and CroissantLLM for text, together with MultilingualLibriSpeech, LeBenchmark, and INA-100k for speech. INA-100k is a newly introduced 100,000-hour corpus of French audio derived from the archives of the Institut National de l’Audiovisuel (INA), the national repository of French radio and television broadcasts, providing highly diverse audio data. We evaluate Pantagruel across a broad range of downstream tasks spanning both modalities, including those from the standard French benchmarks such as FLUE or LeBenchmark. Across these tasks, Pantagruel models show competitive or superior performance compared to strong French baselines such as CamemBERT, FlauBERT, and LeBenchmark2.0, while maintaining a shared architecture that can seamlessly handle either speech or text inputs. These results confirm the effectiveness of feature-space self-supervised objectives for French representation learning and highlight Pantagruel as a robust foundation for multimodal speech-text understanding.

[51] Distilling Feedback into Memory-as-a-Tool

Víctor Gallego

Main category: cs.CL

TL;DR: Framework amortizes inference-time reasoning costs by converting critiques to retrievable guidelines via file-based memory and agent-controlled tools, achieving similar performance to test-time refinement with much lower cost.

Details

Motivation: Test-time refinement pipelines for LLMs are computationally expensive during inference. The paper aims to reduce inference costs while maintaining performance by amortizing reasoning costs through reusable guidelines.

Method: Proposes a framework that converts transient critiques into retrievable guidelines using a file-based memory system and agent-controlled tool calls. Evaluated on Rubric Feedback Bench, a novel dataset for rubric-based learning.

Result: Experiments show augmented LLMs rapidly match performance of test-time refinement pipelines while drastically reducing inference cost.

Conclusion: The framework successfully amortizes inference-time reasoning costs, enabling efficient LLM performance comparable to expensive test-time refinement methods.

Abstract: We propose a framework that amortizes the cost of inference-time reasoning by converting transient critiques into retrievable guidelines, through a file-based memory system and agent-controlled tool calls. We evaluate this method on the Rubric Feedback Bench, a novel dataset for rubric-based learning. Experiments demonstrate that our augmented LLMs rapidly match the performance of test-time refinement pipelines while drastically reducing inference cost.

[52] The Molecular Structure of Thought: Mapping the Topology of Long Chain-of-Thought Reasoning

Qiguang Chen, Yantao Du, Ziniu Li, Jinhao Liu, Songyao Duan, Jiarui Guo, Minghao Liu, Jiaheng Liu, Tong Yang, Ge Zhang, Libo Qin, Wanxiang Che, Wenhao Huang

Main category: cs.CL

TL;DR: LLMs struggle with long chain-of-thought reasoning. The paper proposes that effective Long CoT trajectories have molecular-like structures with three interaction types, and introduces Mole-Syn to synthesize these structures for better performance.

Details

Motivation: LLMs often fail to learn effective long chain-of-thought reasoning from imitation learning, whether from humans or non-Long-CoT LLMs. The authors want to understand why and develop methods to improve Long CoT learning.

Method: 1) Analyze Long CoT trajectories as having molecular-like structures with three interaction types: Deep-Reasoning (covalent-like), Self-Reflection (hydrogen-bond-like), and Self-Exploration (van der Waals-like). 2) Introduce Effective Semantic Isomers concept. 3) Present Mole-Syn, a distribution-transfer-graph method to guide synthesis of effective Long CoT structures.

Result: Analysis shows molecular structures emerge from Long CoT fine-tuning, not keyword imitation. Only bonds promoting fast entropy convergence support stable learning. Mole-Syn boosts performance and RL stability across benchmarks by synthesizing effective Long CoT structures.

Conclusion: Effective Long CoT reasoning requires stable molecular-like structures. Understanding these structures enables better synthesis methods like Mole-Syn, which improves LLM performance on long reasoning tasks.

Abstract: Large language models (LLMs) often fail to learn effective long chain-of-thought (Long CoT) reasoning from human or non-Long-CoT LLMs imitation. To understand this, we propose that effective and learnable Long CoT trajectories feature stable molecular-like structures in unified view, which are formed by three interaction types: Deep-Reasoning (covalent-like), Self-Reflection (hydrogen-bond-like), and Self-Exploration (van der Waals-like). Analysis of distilled trajectories reveals these structures emerge from Long CoT fine-tuning, not keyword imitation. We introduce Effective Semantic Isomers and show that only bonds promoting fast entropy convergence support stable Long CoT learning, while structural competition impairs training. Drawing on these findings, we present Mole-Syn, a distribution-transfer-graph method that guides synthesis of effective Long CoT structures, boosting performance and RL stability across benchmarks.

[53] Don’t Break the Cache: An Evaluation of Prompt Caching for Long-Horizon Agentic Tasks

Elias Lumer, Faheem Nizar, Akshaya Jangiti, Kevin Frank, Anmol Gulati, Mandar Phadate, Vamse Kumar Subbiah

Main category: cs.CL

TL;DR: Prompt caching for LLM agents reduces API costs by 45-80% and improves response time by 13-31% across major providers, with strategic cache control outperforming naive full-context caching.

Details

Motivation: While LLM providers offer prompt caching to reduce costs and latency, its benefits for agentic workloads (multi-turn tasks with extensive tool calling) remain underexplored, with no prior work quantifying savings or comparing caching strategies for such tasks.

Method: Comprehensive evaluation across three major LLM providers (OpenAI, Anthropic, Google) comparing three caching strategies: full context caching, system prompt only caching, and caching excluding dynamic tool results. Evaluation conducted on DeepResearchBench, a multi-turn agentic benchmark where agents execute real-world web search tool calls to answer complex research questions, measuring API cost and time to first token across 500+ agent sessions with 10,000-token system prompts.

Result: Prompt caching reduces API costs by 45-80% and improves time to first token by 13-31% across providers. Strategic cache block control (placing dynamic content at system prompt end, avoiding dynamic function calling, excluding dynamic tool results) provides more consistent benefits than naive full-context caching, which can paradoxically increase latency. Analysis reveals nuanced variations in caching behavior across providers.

Conclusion: Prompt caching offers substantial cost and latency benefits for agentic workloads, but requires strategic implementation. The paper provides practical guidance for implementing prompt caching in production agentic systems based on provider-specific nuances and optimal cache control strategies.

Abstract: Recent advancements in Large Language Model (LLM) agents have enabled complex multi-turn agentic tasks requiring extensive tool calling, where conversations can span dozens of API calls with increasingly large context windows. However, although major LLM providers offer prompt caching to reduce cost and latency, its benefits for agentic workloads remain underexplored in the research literature. To our knowledge, no prior work quantifies these cost savings or compares caching strategies for multi-turn agentic tasks. We present a comprehensive evaluation of prompt caching across three major LLM providers (OpenAI, Anthropic, and Google) and compare three caching strategies, including full context caching, system prompt only caching, and caching that excludes dynamic tool results. We evaluate on DeepResearchBench, a multi-turn agentic benchmark where agents autonomously execute real-world web search tool calls to answer complex research questions, measuring both API cost and time to first token (TTFT) across over 500 agent sessions with 10,000-token system prompts. Our results demonstrate that prompt caching reduces API costs by 45-80% and improves time to first token by 13-31% across providers. We find that strategic prompt cache block control, such as placing dynamic content at the end of the system prompt, avoiding dynamic traditional function calling, and excluding dynamic tool results, provides more consistent benefits than naive full-context caching, which can paradoxically increase latency. Our analysis reveals nuanced variations in caching behavior across providers, and we provide practical guidance for implementing prompt caching in production agentic systems.

[54] Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards

Jiajie Zhang, Xin Lv, Ling Feng, Lei Hou, Juanzi Li

Main category: cs.CL

TL;DR: CaRR introduces fine-grained citation-aware rubric rewards for deep search agents to improve reasoning comprehensiveness and factual grounding, combined with C-GRPO training that outperforms standard RL baselines.

Details

Motivation: Existing RL approaches for LLM-based deep search agents rely on binary outcome rewards that fail to capture reasoning comprehensiveness and factuality, leading to shortcut exploitation and hallucinations.

Method: Proposes Citation-aware Rubric Rewards (CaRR) that decomposes complex questions into verifiable single-hop rubrics requiring agents to identify hidden entities, provide correct citations, and construct complete evidence chains. Also introduces Citation-aware Group Relative Policy Optimization (C-GRPO) combining CaRR with outcome rewards.

Result: C-GRPO consistently outperforms standard outcome-based RL baselines across multiple deep search benchmarks, effectively discourages shortcut exploitation, promotes comprehensive evidence-grounded reasoning, and exhibits strong generalization to open-ended deep research tasks.

Conclusion: CaRR and C-GRPO provide an effective framework for training robust deep search agents with improved reasoning comprehensiveness, factual grounding, and evidence connectivity compared to binary outcome-based RL approaches.

Abstract: Reinforcement learning (RL) has emerged as a critical technique for enhancing LLM-based deep search agents. However, existing approaches primarily rely on binary outcome rewards, which fail to capture the comprehensiveness and factuality of agents’ reasoning process, and often lead to undesirable behaviors such as shortcut exploitation and hallucinations. To address these limitations, we propose \textbf{Citation-aware Rubric Rewards (CaRR)}, a fine-grained reward framework for deep search agents that emphasizes reasoning comprehensiveness, factual grounding, and evidence connectivity. CaRR decomposes complex questions into verifiable single-hop rubrics and requires agents to satisfy these rubrics by explicitly identifying hidden entities, supporting them with correct citations, and constructing complete evidence chains that link to the predicted answer. We further introduce \textbf{Citation-aware Group Relative Policy Optimization (C-GRPO)}, which combines CaRR and outcome rewards for training robust deep search agents. Experiments show that C-GRPO consistently outperforms standard outcome-based RL baselines across multiple deep search benchmarks. Our analysis also validates that C-GRPO effectively discourages shortcut exploitation, promotes comprehensive, evidence-grounded reasoning, and exhibits strong generalization to open-ended deep research tasks. Our code and data are available at https://github.com/THUDM/CaRR.

[55] AdaFuse: Adaptive Ensemble Decoding with Test-Time Scaling for LLMs

Chengming Cui, Tianxin Wei, Ziyi Chen, Ruizhong Qiu, Zhichen Zeng, Zhining Liu, Xuying Ning, Duo Zhou, Jingrui He

Main category: cs.CL

TL;DR: AdaFuse is an adaptive ensemble decoding framework that dynamically selects fusion units during generation, outperforming fixed-granularity ensemble methods by 6.88% on average across multiple tasks.

Details

Motivation: LLMs have complementary strengths from different pretraining data, architectures, and decoding behaviors. Existing ensemble approaches use fixed fusion granularity, lacking flexibility for mid-generation adaptation and failing to adapt to different task characteristics.

Method: AdaFuse dynamically selects semantically appropriate fusion units during generation using an uncertainty-based criterion. It adjusts fusion behavior on the fly based on decoding context, with words as basic building blocks. Under confident states, generation continues directly; in uncertain states, it invokes diversity-aware scaling to explore alternative continuations.

Result: Experiments on open-domain QA, arithmetic reasoning, and machine translation show AdaFuse consistently outperforms strong ensemble baselines, achieving average relative improvement of 6.88%.

Conclusion: AdaFuse establishes synergistic interaction between adaptive ensembling and test-time scaling, where ensemble decisions guide targeted exploration and resulting diversity strengthens ensemble quality, providing a practical way to combine LLM capabilities without retraining.

Abstract: Large language models (LLMs) exhibit complementary strengths arising from differences in pretraining data, model architectures, and decoding behaviors. Inference-time ensembling provides a practical way to combine these capabilities without retraining. However, existing ensemble approaches suffer from fundamental limitations. Most rely on fixed fusion granularity, which lacks the flexibility required for mid-generation adaptation and fails to adapt to different generation characteristics across tasks. To address these challenges, we propose AdaFuse, an adaptive ensemble decoding framework that dynamically selects semantically appropriate fusion units during generation. Rather than committing to a fixed granularity, AdaFuse adjusts fusion behavior on the fly based on the decoding context, with words serving as basic building blocks for alignment. To be specific, we introduce an uncertainty-based criterion to decide whether to apply ensembling at each decoding step. Under confident decoding states, the model continues generation directly. In less certain states, AdaFuse invokes a diversity-aware scaling strategy to explore alternative candidate continuations and inform ensemble decisions. This design establishes a synergistic interaction between adaptive ensembling and test-time scaling, where ensemble decisions guide targeted exploration, and the resulting diversity in turn strengthens ensemble quality. Experiments on open-domain question answering, arithmetic reasoning, and machine translation demonstrate that AdaFuse consistently outperforms strong ensemble baselines, achieving an average relative improvement of 6.88%. The code is available at https://github.com/CCM0111/AdaFuse.

[56] An Evaluation on Large Language Model Outputs: Discourse and Memorization

Adrian de Wynter, Xun Wang, Alex Sokolov, Qilong Gu, Si-Qing Chen

Main category: cs.CL

TL;DR: Empirical evaluation of 9 LLMs shows correlation between memorized text, unique text, and output quality, with 80% of outputs containing memorized data but higher memorization linked to better quality.

Details

Motivation: To empirically evaluate the relationship between memorized content, uniqueness, and output quality in widely-available large language models using off-the-shelf tools, and to understand the implications of memorization for text generation quality.

Method: Used off-the-shelf, readily-available tools to analyze outputs from nine widely-available LLMs, measuring memorized text percentage, unique text percentage, and output quality based on pathologies like counterfactual/logically-flawed statements and topic deviations.

Result: 80% of outputs contained memorized data; outputs with more memorized content were more likely to be high quality; mitigation strategies reduced memorized text output rates.

Conclusion: Memorization correlates with output quality, raising questions about what constitutes learning vs. memorization and how to properly evaluate text quality in LLMs.

Abstract: We present an empirical evaluation of various outputs generated by nine of the most widely-available large language models (LLMs). Our analysis is done with off-the-shelf, readily-available tools. We find a correlation between percentage of memorized text, percentage of unique text, and overall output quality, when measured with respect to output pathologies such as counterfactual and logically-flawed statements, and general failures like not staying on topic. Overall, 80.0% of the outputs evaluated contained memorized data, but outputs containing the most memorized content were also more likely to be considered of high quality. We discuss and evaluate mitigation strategies, showing that, in the models evaluated, the rate of memorized text being output is reduced. We conclude with a discussion on potential implications around what it means to learn, to memorize, and to evaluate quality text.

[57] Expression Syntax Information Bottleneck for Math Word Problems

Jing Xiong, Chengming Li, Min Yang, Xiping Hu, Bin Hu

Main category: cs.CL

TL;DR: ESIB uses information bottleneck to filter redundant features in math word problems, focusing on expression syntax trees while discarding spurious correlations, achieving SOTA results with diverse solutions.

Details

Motivation: Previous MWP approaches design complex models to capture additional information, but this paper focuses on discarding redundant features containing spurious correlations to improve model performance.

Method: Expression Syntax Information Bottleneck (ESIB) based on variational information bottleneck, which extracts essential features of expression syntax trees while filtering syntax-irrelevant redundancy. Uses mutual learning to encourage multiple models to predict the same expression syntax tree for different problem representations, and includes self-distillation loss to improve generalization and diversity.

Result: Achieves state-of-the-art results on two large-scale benchmarks and generates more diverse solutions compared to previous approaches.

Conclusion: Focusing on discarding redundant features rather than adding complexity is effective for MWP, and ESIB’s information bottleneck approach successfully captures essential expression syntax while filtering spurious correlations.

Abstract: Math Word Problems (MWP) aims to automatically solve mathematical questions given in texts. Previous studies tend to design complex models to capture additional information in the original text so as to enable the model to gain more comprehensive features. In this paper, we turn our attention in the opposite direction, and work on how to discard redundant features containing spurious correlations for MWP. To this end, we design an Expression Syntax Information Bottleneck method for MWP (called ESIB) based on variational information bottleneck, which extracts essential features of expression syntax tree while filtering latent-specific redundancy containing syntax-irrelevant features. The key idea of ESIB is to encourage multiple models to predict the same expression syntax tree for different problem representations of the same problem by mutual learning so as to capture consistent information of expression syntax tree and discard latent-specific redundancy. To improve the generalization ability of the model and generate more diverse expressions, we design a self-distillation loss to encourage the model to rely more on the expression syntax information in the latent space. Experimental results on two large-scale benchmarks show that our model not only achieves state-of-the-art results but also generates more diverse solutions. The code is available in https://github.com/menik1126/math_ESIB.

[58] Pragmatic Reasoning improves LLM Code Generation

Zhuchen Cao, Sven Apel, Adish Singla, Vera Demberg

Main category: cs.CL

TL;DR: CodeRSA scales pragmatic reasoning to code generation, outperforming baselines by handling ambiguous natural language instructions through the Rational Speech Act framework.

Details

Motivation: Natural language instructions for code generation often contain ambiguities that challenge current systems, while human communication effectively handles such underspecified messages through pragmatic reasoning.

Method: Scale up the Rational Speech Act framework to naturalistic language-to-code problems, handling multiple meaning-equivalent instruction alternatives (CodeRSA), evaluated with LLMs on HumanEval and MBPP benchmarks.

Result: CodeRSA consistently outperforms common baselines, surpasses state-of-the-art in most cases, shows robust performance, and exhibits desired behavior for the right reasons in qualitative analysis.

Conclusion: Integrating pragmatic reasoning into code generation effectively enhances quality, offering a promising direction for LLMs and emphasizing pragmatic reasoning’s importance in complex communication.

Abstract: Pragmatic reasoning is pervasive in human-human communication - it allows us to leverage shared knowledge and counterfactual reasoning in order to infer the intention of a conversational partner given their ambiguous or underspecified message. In human-computer communication, underspecified messages often represent a major challenge: for instance, translating natural language instructions into code is difficult when user instructions contain inherent ambiguities. In the present paper, we aim to scale up the pragmatic “Rational Speech Act” framework to naturalistic language-to-code problems, and propose a way of dealing with multiple meaning-equivalent instruction alternatives, an issue that does not arise in previous toy-scale problems. We evaluate our method, CodeRSA, with two recent LLMs (Llama-3-8B-Instruct and Qwen-2.5-7B-Instruct) on two widely used code generation benchmarks (HumanEval and MBPP). Our experimental results show that CodeRSA consistently outperforms common baselines, surpasses the state-of-the-art approach in most cases, and demonstrates robust overall performance. Qualitative analyses demonstrate that it exhibits the desired behavior for the right reasons. These findings underscore the effectiveness of integrating pragmatic reasoning into a naturalistic complex communication task, language-to-code generation, offering a promising direction for enhancing code generation quality in LLMs and emphasizing the importance of pragmatic reasoning in complex communication settings.

[59] Through the LLM Looking Glass: A Socratic Probing of Donkeys, Elephants, and Markets

Molly Kennedy, Ayyoob Imani, Timo Spinde, Akiko Aizawa, Hinrich Schütze

Main category: cs.CL

TL;DR: LLMs show ideological framing bias in text generation, can accurately detect such bias, but exhibit preference inconsistencies when probed with Socratic questioning.

Details

Motivation: LLMs are widely used for text generation, making bias detection crucial, especially subtle ideological framing bias in journalistic contexts. Also, LLMs are increasingly used as evaluators (LLM-as-a-judge), so understanding their reasoning consistency is important.

Method: Evaluated 8 LLMs on POLIGEN and ECONOLEX datasets covering political/economic discourse. Used Socratic method to analyze LLMs’ feedback on their own outputs, probing for reasoning inconsistencies in binary comparisons.

Result: Most LLMs accurately annotate ideologically framed text, with GPT-4o achieving human-level accuracy and high agreement with human annotators. However, Socratic probing reveals LLMs often exhibit preference toward one perspective or perceive certain viewpoints as less biased in binary comparisons.

Conclusion: While LLMs can detect ideological framing bias effectively, they show inconsistent reasoning when evaluating their own outputs, revealing underlying biases that need addressing for reliable LLM-as-a-judge applications.

Abstract: Large Language Models (LLMs) are widely used for text generation, making it crucial to address potential bias. This study investigates ideological framing bias in LLM-generated articles, focusing on the subtle and subjective nature of such bias in journalistic contexts. We evaluate eight widely used LLMs on two datasets-POLIGEN and ECONOLEX-covering political and economic discourse where framing bias is most pronounced. Beyond text generation, LLMs are increasingly used as evaluators (LLM-as-a-judge), providing feedback that can shape human judgment or inform newer model versions. Inspired by the Socratic method, we further analyze LLMs’ feedback on their own outputs to identify inconsistencies in their reasoning. Our results show that most LLMs can accurately annotate ideologically framed text, with GPT-4o achieving human-level accuracy and high agreement with human annotators. However, Socratic probing reveals that when confronted with binary comparisons, LLMs often exhibit preference toward one perspective or perceive certain viewpoints as less biased.

[60] Detect, Explain, Escalate: Sustainable Dialogue Breakdown Management for LLM Agents

Abdellah Ghassel, Xianzhi Li, Xiaodan Zhu

Main category: cs.CL

TL;DR: The paper introduces a “Detect, Explain, Escalate” framework for managing dialogue breakdowns in LLM-powered agents using a fine-tuned 8B-parameter model for efficient detection/explanation and frontier LLMs for high-fidelity assessment, reducing inference costs by 54%.

Details

Motivation: LLMs have substantial conversational AI capabilities but are susceptible to dialogue breakdowns, which challenges deployment reliability and user trust. There's a need for resource-efficient solutions to manage these breakdowns.

Method: 1) Fine-tune a compact 8B-parameter model augmented with teacher-generated reasoning traces for efficient real-time breakdown detection and explanation. 2) Systematically evaluate frontier LLMs using advanced prompting (few-shot, chain-of-thought, analogical reasoning) for high-fidelity breakdown assessment. 3) Integrate into an “escalation” architecture where the efficient detector defers to larger models only when necessary.

Result: The fine-tuned model improves accuracy by 7% over baseline on BETOLD dataset, achieves state-of-the-art performance on DBDC5, and outperforms specialized classifiers. The monitor-escalate pipeline reduces inference costs by 54% while maintaining strong performance across English and Japanese dialogues.

Conclusion: The proposed framework provides a cost-effective and interpretable solution for robust conversational AI in high-impact domains by balancing efficiency and performance through an intelligent escalation architecture that minimizes computational overhead while maintaining reliability.

Abstract: Large Language Models (LLMs) have demonstrated substantial capabilities in conversational AI applications, yet their susceptibility to dialogue breakdowns poses significant challenges to deployment reliability and user trust. This paper introduces a “Detect, Explain, Escalate” framework to manage dialogue breakdowns in LLM-powered agents, emphasizing resource-efficient operation. Our approach integrates two key strategies: (1) We fine-tune a compact 8B-parameter model, augmented with teacher-generated reasoning traces, which serves as an efficient real-time breakdown detector and explainer. This model demonstrates robust classification and calibration on English and Japanese dialogues, and generalizes to the BETOLD dataset, improving accuracy by 7% over its baseline. (2) We systematically evaluate frontier LLMs using advanced prompting (few-shot, chain-of-thought, analogical reasoning) for high-fidelity breakdown assessment. These are integrated into an “escalation” architecture where our efficient detector defers to larger models only when necessary, substantially reducing operational costs and computational overhead. Our fine-tuned model and prompting strategies achieve state-of-the-art performance on DBDC5 and strong results on BETOLD, outperforming specialized classifiers on DBDC5 and narrowing the performance gap to larger proprietary models. The proposed monitor-escalate pipeline reduces inference costs by 54%, providing a cost-effective and interpretable solution for robust conversational AI in high-impact domains. Code and models are publicly available.

[61] Streamlining evidence based clinical recommendations with large language models

Dubai Li, Nan Jiang, Kangping Huang, Ruiqi Tu, Shuyu Ouyang, Huayu Yu, Lin Qiao, Chen Yu, Tianshu Zhou, Danyang Tong, Qian Wang, Mengtao Li, Xiaofeng Zeng, Yu Tian, Xinping Tian, Jingsong Li

Main category: cs.CL

TL;DR: Quicker is an LLM-powered system that automates evidence synthesis and generates clinical recommendations, reducing guideline development time from hours to 20-40 minutes while improving accuracy and comprehensiveness.

Details

Motivation: Clinical evidence integration into real-time practice is challenging due to intensive workloads, complex procedures, and time constraints, creating a need for automated evidence synthesis systems.

Method: Developed Quicker, an LLM-powered system with end-to-end pipeline from clinical questions to recommendations, following standard guideline development workflows. Created Q2CRBench-3 benchmark from guideline development records for three diseases to evaluate system performance.

Result: Quicker achieved precise question decomposition, expert-aligned retrieval, near-comprehensive screening, improved data extraction accuracy, and produced more comprehensive/coherent recommendations than clinicians. Reduced recommendation development time to 20-40 minutes with one participant.

Conclusion: Quicker demonstrates potential to enhance speed and reliability of evidence-based clinical decision-making through automated evidence synthesis and recommendation generation.

Abstract: Clinical evidence underpins informed healthcare decisions, yet integrating it into real-time practice remains challenging due to intensive workloads, complex procedures, and time constraints. This study presents Quicker, an LLM-powered system that automates evidence synthesis and generates clinical recommendations following standard guideline development workflows. Quicker delivers an end-to-end pipeline from clinical questions to recommendations and supports customized decision-making through integrated tools and interactive interfaces. To evaluate how closely Quicker can reproduce guideline development processes, we constructed Q2CRBench-3, a benchmark derived from guideline development records for three diseases. Experiments show that Quicker produces precise question decomposition, expert-aligned retrieval, and near-comprehensive screening. Quicker assistance improved the accuracy of extracted study data, and its recommendations were more comprehensive and coherent than clinician-written ones. In system-level testing, Quicker working with one participant reduced recommendation development to 20-40 min. Overall, the findings demonstrate Quicker’s potential to enhance the speed and reliability of evidence-based clinical decision-making.

[62] Graph-Guided Passage Retrieval for Author-Centric Structured Feedback

Maitreya Prafulla Chitale, Ketaki Mangesh Shetye, Harshit Gupta, Manav Chaudhary, Manish Shrivastava, Vasudeva Varma

Main category: cs.CL

TL;DR: AutoRev is an automated system that provides pre-submission feedback to researchers by using graph-based retrieval to reduce LLM context length and generate high-quality, actionable guidance before formal peer review.

Details

Motivation: The academic publication process suffers from a bottleneck in obtaining high-quality pre-submission feedback, which is critical for researchers but difficult to access before formal peer review.

Method: AutoRev uses a graph-based retrieval-augmented generation framework that models papers as hierarchical document graphs, integrating textual and structural representations for efficient content retrieval. This graph-based passage retrieval reduces LLM input context length while maintaining quality.

Result: Experimental results show AutoRev significantly outperforms baselines across multiple automatic evaluation metrics and achieves strong performance in human evaluations.

Conclusion: AutoRev effectively addresses the pre-submission feedback bottleneck by providing automated, structured, and actionable guidance to researchers, with code to be released upon acceptance.

Abstract: Obtaining high-quality, pre-submission feedback is a critical bottleneck in the academic publication lifecycle for researchers. We introduce AutoRev, an automated author-centric feedback system that generates structured, actionable guidance prior to formal peer review. AutoRev employs a graph-based retrieval-augmented generation framework that models each paper as a hierarchical document graph, integrating textual and structural representations to retrieve salient content efficiently. By leveraging graph-based passage retrieval, AutoRev substantially reduces LLM input context length, leading to higher-quality feedback generation. Experimental results demonstrate that AutoRev significantly outperforms baselines across multiple automatic evaluation metrics, while achieving strong performance in human evaluations. Code will be released upon acceptance.

[63] VietMix: A Naturally-Occurring Parallel Corpus and Augmentation Framework for Vietnamese-English Code-Mixed Machine Translation

Hieu Tran, Phuong-Anh Nguyen-Le, Huy Nghiem, Quang-Nhan Nguyen, Wei Ai, Marine Carpuat

Main category: cs.CL

TL;DR: First expert-translated Vietnamese-English code-mixed parallel corpus (VietMix) with data augmentation pipeline improves MT performance for this challenging language pair.

Details

Motivation: MT systems degrade with code-mixed text, especially for low-resource languages like Vietnamese-English that lack dedicated parallel corpora and face challenges like orthographic ambiguity and diacritic omission in informal text.

Method: Created VietMix, the first expert-translated naturally occurring parallel corpus of Vietnamese-English code-mixed text. Developed data augmentation pipeline using iterative fine-tuning and targeted filtering.

Result: Models augmented with VietMix data outperform strong back-translation baselines by up to +3.5 xCOMET points and improve zero-shot models by up to +11.9 points.

Conclusion: Provides foundational resource for Vietnamese-English code-mixed MT and offers transferable framework for building and augmenting corpora in other low-resource settings.

Abstract: Machine translation (MT) systems universally degrade when faced with code-mixed text. This problem is more acute for low-resource languages that lack dedicated parallel corpora. This work directly addresses this gap for Vietnamese-English, a language context characterized by challenges including orthographic ambiguity and the frequent omission of diacritics in informal text. We introduce VietMix, the first expert-translated, naturally occurring parallel corpus of Vietnamese-English code-mixed text. We establish VietMix’s utility by developing a data augmentation pipeline that leverages iterative fine-tuning and targeted filtering. Experiments show that models augmented with our data outperform strong back-translation baselines by up to +3.5 xCOMET points and improve zero-shot models by up to +11.9 points. Our work delivers a foundational resource for a challenging language pair and provides a validated, transferable framework for building and augmenting corpora in other low-resource settings.

[64] Guiding Generative Storytelling with Knowledge Graphs

Zhijun Pan, Antonios Andronis, Eva Hayek, Oscar AP Wilkinson, Ilya Lasy, Annette Parry, Guy Gadney, Tim J. Smith, Mick Grierson

Main category: cs.CL

TL;DR: KG-assisted LLM storytelling pipeline improves narrative quality and user control, especially for action-oriented stories, but not for introspective ones.

Details

Motivation: LLMs struggle with long-form coherence and user-friendly control in story generation. While RAG reduces hallucinations and KG-driven storytelling has been explored, there's a need for KG-assisted long-form generation with editable KGs coupled with LLMs.

Method: Proposed a KG-assisted storytelling pipeline evaluated in a two-stage user study with 15 participants. Participants created prompts, generated stories, and edited KGs to shape narratives. Used quantitative and qualitative analysis.

Result: Improvements concentrated in action-oriented, structurally explicit narratives, but not for introspective stories. Participants reported strong sense of control when editing KGs, describing the experience as engaging, interactive, and playful.

Conclusion: KG-assisted LLM storytelling enhances narrative quality and user control for certain story types, with KG editing providing an engaging and interactive experience for users to shape their narratives.

Abstract: Large language models (LLMs) have shown great potential in story generation, but challenges remain in maintaining long-form coherence and effective, user-friendly control. Retrieval-augmented generation (RAG) has proven effective in reducing hallucinations in text generation; while knowledge-graph (KG)-driven storytelling has been explored in prior work, this work focuses on KG-assisted long-form generation and an editable KG coupled with LLM generation in a two-stage user study. This work investigates how KGs can enhance LLM-based storytelling by improving narrative quality and enabling user-driven modifications. We propose a KG-assisted storytelling pipeline and evaluate it in a user study with 15 participants. Participants created prompts, generated stories, and edited KGs to shape their narratives. Quantitative and qualitative analysis finds improvements concentrated in action-oriented, structurally explicit narratives under our settings, but not for introspective stories. Participants reported a strong sense of control when editing the KG, describing the experience as engaging, interactive, and playful.

[65] Let’s Put Ourselves in Sally’s Shoes: Shoes-of-Others Prefilling Improves Theory of Mind in Large Language Models

Kazutoshi Shinoda, Nobukatsu Hojo, Kyosuke Nishida, Yoshihiro Yamazaki, Keita Suzuki, Hiroaki Sugiyama, Kuniko Saito

Main category: cs.CL

TL;DR: SoO prefilling improves Theory of Mind in LLMs by adding “Let’s put ourselves in A’s shoes” to prompt beginnings, working across diverse contexts without world state changes.

Details

Motivation: Existing inference-time ToM methods are specialized for contexts with world state changes, lacking broader applicability. Fine-tuning degrades generalization, so a more general inference-time approach is needed.

Method: Shoes-of-Others (SoO) prefilling - simply prepends “Let’s put ourselves in A’s shoes.” to LLM outputs, where A is the target character. This minimal prompt modification requires no fine-tuning.

Result: SoO prefilling consistently improves ToM across five mental state categories on conversational and narrative benchmarks without world state changes. Analysis shows it elicits faithful thoughts.

Conclusion: SoO prefilling is an effective, general inference-time method for enhancing ToM in LLMs across broader scenarios, requiring minimal assumptions about context structure.

Abstract: Recent studies have shown that Theory of Mind (ToM) in large language models (LLMs) has not reached human-level performance yet. Since fine-tuning LLMs on ToM datasets often degrades their generalization, several inference-time methods have been proposed to enhance ToM in LLMs. However, existing inference-time methods for ToM are specialized for inferring beliefs from contexts involving changes in the world state. In this study, we present a new inference-time method for ToM, Shoes-of-Others (SoO) prefilling, which makes fewer assumptions about contexts and is applicable to broader scenarios. SoO prefilling simply specifies the beginning of LLM outputs with ``Let’s put ourselves in A’s shoes.’’, where A denotes the target character’s name. We evaluate SoO prefilling on two benchmarks that assess ToM in conversational and narrative contexts without changes in the world state and find that it consistently improves ToM across five categories of mental states. Our analysis suggests that SoO prefilling elicits faithful thoughts, thereby improving the ToM performance.

[66] Bridging External and Parametric Knowledge: Mitigating Hallucination of LLMs with Shared-Private Semantic Synergy in Dual-Stream Knowledge

Yi Sui, Chaozhuo Li, Chen Zhang, Dawei song, Qiuchi Li

Main category: cs.CL

TL;DR: DSSP-RAG is a dual-stream framework that enhances RAG by distinguishing shared/private semantics, detecting hallucinations via cognitive uncertainty, and reducing noise with Energy Quotient metrics.

Details

Motivation: Current RAG systems face challenges with noisy external knowledge that conflicts with LLMs' parametric knowledge, lacking mechanisms to resolve such conflicts, which degrades performance.

Method: Proposes DSSP-RAG with: 1) Mixed-attention mechanism separating shared/private semantics, 2) Unsupervised hallucination detection using LLMs’ cognitive uncertainty, 3) Energy Quotient (EQ) metric based on attention differences to reduce noise.

Result: Extensive experiments show DSSP-RAG achieves superior performance over strong baselines in retrieval-augmented generation tasks.

Conclusion: The framework effectively addresses knowledge conflicts and noise in RAG systems through semantic synergy and uncertainty-based control mechanisms.

Abstract: Retrieval-augmented generation (RAG) aims to mitigate the hallucination of Large Language Models (LLMs) by retrieving and incorporating relevant external knowledge into the generation process. However, the external knowledge may contain noise and conflict with the parametric knowledge of LLMs, leading to degraded performance. Current LLMs lack inherent mechanisms for resolving such conflicts. To fill this gap, we propose a Dual-Stream Knowledge-Augmented Framework for Shared-Private Semantic Synergy (DSSP-RAG). Central to it is the refinement of the traditional self-attention into a mixed-attention that distinguishes shared and private semantics for a controlled knowledge integration. An unsupervised hallucination detection method that captures the LLMs’ intrinsic cognitive uncertainty ensures that external knowledge is introduced only when necessary. To reduce noise in external knowledge, an Energy Quotient (EQ), defined by attention difference matrices between task-aligned and task-misaligned layers, is proposed. Extensive experiments show that DSSP-RAG achieves a superior performance over strong baselines.

[67] MedRiskEval: Medical Risk Evaluation Benchmark of Language Models, On the Importance of User Perspectives in Healthcare Settings

Jean-Philippe Corbeil, Minseon Kim, Maxime Griot, Sheela Agarwal, Alessandro Sordoni, Francois Beaulieu, Paul Vozila

Main category: cs.CL

TL;DR: MedRiskEval is a medical risk evaluation benchmark for LLMs that introduces PatientSafetyBench, a patient-oriented dataset with 466 samples across 5 risk categories, addressing gaps in existing clinician-focused safety evaluations.

Details

Motivation: As LLMs are increasingly adopted in medical applications, their outputs can directly impact human health. Existing risk evaluations focus mainly on general safety benchmarks, but medical applications involve diverse users (patients, general users, clinicians) with varying expertise levels, creating serious safety concerns that need specialized evaluation.

Method: The authors introduce MedRiskEval, a medical risk evaluation benchmark tailored to the medical domain. They create PatientSafetyBench, a new patient-oriented dataset containing 466 samples across 5 critical risk categories. They then evaluate various open- and closed-source LLMs using this new benchmark alongside existing datasets.

Result: The work establishes an initial foundation for safer deployment of LLMs in healthcare by providing a specialized medical risk evaluation framework. The PatientSafetyBench dataset addresses the gap in previous benchmarks that only focused on clinician perspectives.

Conclusion: This research fills a critical gap in LLM safety evaluation for medical applications by creating a domain-specific benchmark that considers diverse user perspectives, particularly patients, and provides a foundation for safer healthcare deployment of language models.

Abstract: As the performance of large language models (LLMs) continues to advance, their adoption in the medical domain is increasing. However, most existing risk evaluations largely focused on general safety benchmarks. In the medical applications, LLMs may be used by a wide range of users, ranging from general users and patients to clinicians, with diverse levels of expertise and the model’s outputs can have a direct impact on human health which raises serious safety concerns. In this paper, we introduce MedRiskEval, a medical risk evaluation benchmark tailored to the medical domain. To fill the gap in previous benchmarks that only focused on the clinician perspective, we introduce a new patient-oriented dataset called PatientSafetyBench containing 466 samples across 5 critical risk categories. Leveraging our new benchmark alongside existing datasets, we evaluate a variety of open- and closed-source LLMs. To the best of our knowledge, this work establishes an initial foundation for safer deployment of LLMs in healthcare.

[68] Mechanistic Indicators of Understanding in Large Language Models

Pierre Beckmann, Matthieu Queloz

Main category: cs.CL

TL;DR: The paper argues that mechanistic interpretability findings challenge the view that LLMs merely imitate language without understanding, proposing a three-tier framework for AI understanding that integrates philosophical theory with empirical evidence.

Details

Motivation: To move beyond the binary debate about whether LLMs truly understand language by integrating mechanistic interpretability findings with philosophical accounts of understanding, creating a more nuanced framework for evaluating AI cognition.

Method: Proposes a three-tier hierarchical framework for understanding in LLMs: conceptual understanding (feature formation), state-of-the-world understanding (factual connections), and principled understanding (compact circuits). Uses mechanistic interpretability evidence to support each tier.

Result: Mechanistic interpretability reveals internal organizations in LLMs that can support understanding-like unification across three tiers, though these mechanisms differ from human cognition in their parallel exploitation of heterogeneous mechanisms.

Conclusion: Integrating philosophical theory with mechanistic evidence allows us to transcend binary debates about AI understanding, enabling a comparative, mechanistically grounded epistemology that explores how AI understanding aligns with and diverges from human understanding.

Abstract: Large language models (LLMs) are often portrayed as merely imitating linguistic patterns without genuine understanding. We argue that recent findings in mechanistic interpretability (MI), the emerging field probing the inner workings of LLMs, render this picture increasingly untenable–but only once those findings are integrated within a theoretical account of understanding. We propose a tiered framework for thinking about understanding in LLMs and use it to synthesize the most relevant findings to date. The framework distinguishes three hierarchical varieties of understanding, each tied to a corresponding level of computational organization: conceptual understanding emerges when a model forms “features” as directions in latent space, learning connections between diverse manifestations of a single entity or property; state-of-the-world understanding emerges when a model learns contingent factual connections between features and dynamically tracks changes in the world; principled understanding emerges when a model ceases to rely on memorized facts and discovers a compact “circuit” connecting these facts. Across these tiers, MI uncovers internal organizations that can underwrite understanding-like unification. However, these also diverge from human cognition in their parallel exploitation of heterogeneous mechanisms. Fusing philosophical theory with mechanistic evidence thus allows us to transcend binary debates over whether AI understands, paving the way for a comparative, mechanistically grounded epistemology that explores how AI understanding aligns with–and diverges from–our own.

[69] GRASP: Generic Reasoning And SPARQL Generation across Knowledge Graphs

Sebastian Walter, Hannah Bast

Main category: cs.CL

TL;DR: Zero-shot SPARQL query generation from natural language using LLMs that strategically explore knowledge graphs without fine-tuning, achieving SOTA results on Wikidata and competitive performance on other KGs.

Details

Motivation: Existing approaches for generating SPARQL queries from natural language often require fine-tuning or extensive training data. The authors aim to develop a zero-shot method that can work across different knowledge graphs and language models without requiring model adaptation.

Method: Uses large language models to strategically explore knowledge graphs by executing SPARQL queries to search for relevant IRIs and literals. The approach doesn’t fine-tune the LLM but instead leverages its reasoning capabilities to navigate the graph structure through iterative querying.

Result: Achieves state-of-the-art results on multiple Wikidata benchmarks despite zero-shot setting. Performs close to best few-shot methods on Freebase. Shows strong overall performance on less commonly evaluated knowledge graphs and benchmarks. Additional studies explore different graph search strategies, feedback mechanisms, and few-shot examples.

Conclusion: The proposed zero-shot approach effectively generates SPARQL queries from natural language by strategically exploring knowledge graphs with LLMs, demonstrating strong performance across diverse benchmarks and knowledge graphs without requiring fine-tuning.

Abstract: We propose a new approach for generating SPARQL queries on RDF knowledge graphs from natural language questions or keyword queries, using a large language model. Our approach does not require fine-tuning. Instead, it uses the language model to explore the knowledge graph by strategically executing SPARQL queries and searching for relevant IRIs and literals. We evaluate our approach on a variety of benchmarks (for knowledge graphs of different kinds and sizes) and language models (of different scales and types, commercial as well as open-source) and compare it with existing approaches. On Wikidata we reach state-of-the-art results on multiple benchmarks, despite the zero-shot setting. On Freebase we come close to the best few-shot methods. On other, less commonly evaluated knowledge graphs and benchmarks our approach also performs well overall. We conduct several additional studies, like comparing different ways of searching the graphs, incorporating a feedback mechanism, or making use of few-shot examples.

[70] Learning to Extract Rational Evidence via Reinforcement Learning for Retrieval-Augmented Generation

Xinping Zhao, Shouzheng Huang, Yan Zhong, Xinshuo Hu, Meishan Zhang, Baotian Hu, Min Zhang

Main category: cs.CL

TL;DR: EviOmni: A reasoning-first approach to extract rational evidence for RAG systems, using unified reasoning-extraction trajectory with knowledge masking and RL optimization.

Details

Motivation: Retrieval noises in RAG systems undermine LLM generation quality, and previous evidence extraction methods without deep thinking risk filtering key clues and lack generalization.

Method: EviOmni integrates evidence reasoning and extraction into unified trajectory, uses knowledge token masking to prevent information leakage, and optimizes via on-policy reinforcement learning with verifiable rewards (answer, length, format).

Result: Extensive experiments on five benchmark datasets show EviOmni provides compact, high-quality evidence, enhances downstream task accuracy, and supports both traditional and agentic RAG systems.

Conclusion: EviOmni’s reasoning-first approach effectively addresses retrieval noise in RAG systems, producing superior evidence extraction that improves overall system performance across different RAG architectures.

Abstract: Retrieval-Augmented Generation (RAG) effectively improves the accuracy of Large Language Models (LLMs). However, retrieval noises significantly undermine the quality of LLMs’ generation, necessitating the development of denoising mechanisms. Previous works extract evidence straightforwardly without deep thinking, which may risk filtering out key clues and struggle with generalization. To this end, we propose EviOmni, which learns to extract rational evidence via reasoning first and then extracting. Specifically, EviOmni integrates evidence reasoning and evidence extraction into one unified trajectory, followed by knowledge token masking to avoid information leakage, optimized via on-policy reinforcement learning with verifiable rewards in terms of answer, length, and format. Extensive experiments on five benchmark datasets show the superiority of EviOmni, which provides compact and high-quality evidence, enhances the accuracy of downstream tasks, and supports both traditional and agentic RAG systems.

[71] Reservoir Computing as a Language Model

Felix Köster, Atsushi Uchida

Main category: cs.CL

TL;DR: This paper compares reservoir computing (traditional and attention-enhanced) with transformer architectures for character-level language modeling, focusing on performance, computational cost, and efficiency trade-offs.

Details

Motivation: LLMs have impressive performance but face bottlenecks in energy consumption and processing speed, limiting accessibility. Reservoir computing offers potential for fast, energy-efficient hardware implementations for natural language processing.

Method: The study compares three approaches: 1) traditional reservoir computing with static linear readout, 2) attention-enhanced reservoir computing with dynamic output weight adaptation, and 3) transformer-based architectures. All models are evaluated with equal numbers of trainable parameters using a consistent pipeline for character-level language modeling.

Result: Transformers achieve superior prediction quality, while reservoir computers offer significantly higher training and inference efficiency. The attention-enhanced reservoir variant shows improved performance over traditional reservoir computing while maintaining efficiency advantages.

Conclusion: Reservoir computing provides a promising alternative to transformers for resource-constrained applications, offering substantial efficiency gains with acceptable performance trade-offs. The study provides guidelines for balancing resource constraints with performance requirements in language modeling tasks.

Abstract: Large Language Models (LLM) have dominated the science and media landscape duo to their impressive performance on processing large chunks of data and produce human-like levels of text. Nevertheless, their huge energy demand and slow processing are still a bottleneck to further increasing quality while also making the models accessible to everyone. To solve this bottleneck, we will investigate how reservoir computing performs on natural text processing, which could enable fast and energy efficient hardware implementations. Studies investigating the use of reservoir computing as a language model remain sparse. In this paper, we compare three distinct approaches for character-level language modeling, two different \emph{reservoir computing} approaches, where only an output layer is trainable, and the well-known \emph{transformer}-based architectures, which fully learn an attention-based sequence representation. We explore the performance, computational cost and prediction accuracy for both paradigms by equally varying the number of trainable parameters for all models. Using a consistent pipeline for all three approaches, we demonstrate that transformers excel in prediction quality, whereas reservoir computers remain highly efficient reducing the training and inference speed. Furthermore, we investigate two types of reservoir computing: a \emph{traditional reservoir} with a static linear readout, and an \emph{attention-enhanced reservoir} that dynamically adapts its output weights via an attention mechanism. Our findings underline how these paradigms scale and offer guidelines to balance resource constraints with performance.

[72] CliCARE: Grounding Large Language Models in Clinical Guidelines for Decision Support over Longitudinal Cancer Electronic Health Records

Dongchen Li, Jitao Liang, Wei Li, Xiaoyu Wang, Longbing Cao, Kun Yu

Main category: cs.CL

TL;DR: CliCARE is a framework that grounds LLMs in clinical guidelines for cancer EHR decision support by converting longitudinal EHRs into Temporal Knowledge Graphs and aligning them with guideline knowledge graphs.

Details

Motivation: LLMs show promise for clinical decision support but face challenges with long, fragmented EHRs, clinical hallucination risks (RAG doesn't incorporate clinical guidelines), and unreliable evaluation metrics in oncology.

Method: Transforms unstructured longitudinal EHRs into patient-specific Temporal Knowledge Graphs (TKGs) to capture long-range dependencies, then grounds decision support by aligning patient trajectories with normative guideline knowledge graphs.

Result: Outperforms baselines (including long-context LLMs and KG-enhanced RAG methods) on both private Chinese cancer dataset and public MIMIC-IV dataset, with high correlation with oncologist assessments.

Conclusion: CliCARE effectively addresses key challenges in LLM-based clinical decision support by grounding in clinical guidelines through knowledge graph alignment, providing evidence-based summaries and recommendations.

Abstract: Large Language Models (LLMs) hold significant promise for improving clinical decision support and reducing physician burnout by synthesizing complex, longitudinal cancer Electronic Health Records (EHRs). However, their implementation in this critical field faces three primary challenges: the inability to effectively process the extensive length and fragmented nature of patient records for accurate temporal analysis; a heightened risk of clinical hallucination, as conventional grounding techniques such as Retrieval-Augmented Generation (RAG) do not adequately incorporate process-oriented clinical guidelines; and unreliable evaluation metrics that hinder the validation of AI systems in oncology. To address these issues, we propose CliCARE, a framework for Grounding Large Language Models in Clinical Guidelines for Decision Support over Longitudinal Cancer Electronic Health Records. The framework operates by transforming unstructured, longitudinal EHRs into patient-specific Temporal Knowledge Graphs (TKGs) to capture long-range dependencies, and then grounding the decision support process by aligning these real-world patient trajectories with a normative guideline knowledge graph. This approach provides oncologists with evidence-grounded decision support by generating a high-fidelity clinical summary and an actionable recommendation. We validated our framework using large-scale, longitudinal data from a private Chinese cancer dataset and the public English MIMIC-IV dataset. In these settings, CliCARE significantly outperforms baselines, including leading long-context LLMs and Knowledge Graph-enhanced RAG methods. The clinical validity of our results is supported by a robust evaluation protocol, which demonstrates a high correlation with assessments made by oncologists.

Furkan Şahinuç, Subhabrata Dutta, Iryna Gurevych

Main category: cs.CL

TL;DR: GREP is a multi-turn evaluation framework for assessing AI-generated scientific writing quality, specifically for related work sections, by integrating domain-specific criteria with expert preferences through fine-grained dimensions and contrastive examples.

Details

Motivation: Current evaluation metrics and LLM-as-judge systems fail to capture expert preferences and domain-specific quality standards needed for assessing AI-generated scientific writing, particularly for challenging tasks like related work generation.

Method: Proposed GREP framework: multi-turn evaluation that decomposes assessment into fine-grained dimensions based on classical related work evaluation criteria and expert preferences, augmented with contrastive examples for contextual guidance.

Result: GREP demonstrates more robust evaluation compared to standard LLM judges, reflects natural scientific writing scenarios, and shows strong correlation with human expert assessments. Also reveals that state-of-the-art LLM generations struggle to meet validation constraints for related work sections.

Conclusion: GREP addresses the critical gap in evaluating AI-generated scientific writing by providing a domain-aware, expert-aligned evaluation framework that enables more realistic human-AI collaborative writing, particularly for challenging tasks like related work generation.

Abstract: Expert domain writing, such as scientific writing, typically demands extensive domain knowledge. Although large language models (LLMs) show promising potential in this task, evaluating the quality of automatically generated scientific writing is a crucial open issue, as it requires knowledge of domain-specific criteria and the ability to discern expert preferences. Conventional task-agnostic automatic evaluation metrics and LLM-as-a-judge systems, primarily designed for mainstream NLP tasks, are insufficient to grasp expert preferences and domain-specific quality standards. To address this gap and support realistic human-AI collaborative writing, we focus on related work generation, one of the most challenging scientific tasks, as an exemplar. We propose GREP, a multi-turn evaluation framework that integrates classical related work evaluation criteria with expert-specific preferences. Our framework decomposes the evaluation into smaller fine-grained dimensions. This localized evaluation is further augmented with contrastive examples to provide detailed contextual guidance for the evaluation dimensions. Empirical investigation reveals that our framework is able to assess the quality of related work sections in a much more robust manner compared to standard LLM judges, reflects natural scenarios of scientific writing, and bears a strong correlation with the assessment of human experts. We also observe that generations from state-of-the-art (SoTA) LLMs struggle to satisfy validation constraints of a suitable related work section.

[74] MAGneT: Coordinated Multi-Agent Generation of Synthetic Multi-Turn Mental Health Counseling Sessions

Aishik Mandal, Tanmoy Chakraborty, Iryna Gurevych

Main category: cs.CL

TL;DR: MAGneT is a multi-agent framework for generating synthetic psychological counseling sessions that outperforms single-agent approaches through specialized LLM agents modeling different psychological techniques.

Details

Motivation: There's growing demand for scalable psychological counseling but a scarcity of high-quality, privacy-compliant training data for AI systems.

Method: Multi-agent framework decomposing counselor response generation into coordinated sub-tasks handled by specialized LLM agents, each modeling a key psychological technique. Also proposes unified evaluation framework with expanded expert assessment across nine counseling dimensions.

Result: MAGneT substantially outperforms existing methods: experts prefer MAGneT-generated sessions in 77.2% of cases, with 3.2% higher general counseling skills and 4.3% higher CBT-specific skills on CTRS. Fine-tuned Llama3-8B-Instruct model outperforms baseline models by 6.9% on average.

Conclusion: MAGneT effectively generates high-quality synthetic psychological counseling data that better captures real counseling structure and nuance, with comprehensive evaluation framework and public release of code and data.

Abstract: The growing demand for scalable psychological counseling highlights the need for high-quality, privacy-compliant data, yet such data remains scarce. Here we introduce MAGneT, a novel multi-agent framework for synthetic psychological counseling session generation that decomposes counselor response generation into coordinated sub-tasks handled by specialized LLM agents, each modeling a key psychological technique. Unlike prior single-agent approaches, MAGneT better captures the structure and nuance of real counseling. We further propose a unified evaluation framework that consolidates diverse automatic metrics and expands expert assessment from four to nine counseling dimensions, thus addressing inconsistencies in prior evaluation protocols. Empirically, MAGneT substantially outperforms existing methods: experts prefer MAGneT-generated sessions in 77.2% of cases, and sessions generated by MAGneT yield 3.2% higher general counseling skills and 4.3% higher CBT-specific skills on cognitive therapy rating scale (CTRS). A open source Llama3-8B-Instruct model fine-tuned on MAGneT-generated data also outperforms models fine-tuned using baseline synthetic datasets by 6.9% on average on CTRS.We also make our code and data public.

[75] Memorization in Large Language Models in Medicine: Prevalence, Characteristics, and Implications

Anran Li, Lingfei Qian, Mengmeng Du, Yu Yin, Yan Hu, Zihao Sun, Yihang Fu, Hyunjae Kim, Erica Stutz, Xuguang Ai, Qianqian Xie, Rui Zhu, Jimin Huang, Yifan Yang, Siru Liu, Yih-Chung Tham, Lucila Ohno-Machado, Hyunghoon Cho, Zhiyong Lu, Hua Xu, Qingyu Chen

Main category: cs.CL

TL;DR: LLMs in medicine show significant memorization of training data across adaptation scenarios, with higher rates than general domain, persistence through fine-tuning, and potential risks for clinical applications.

Details

Motivation: To investigate the extent of memorization in medical LLMs, as memorization can both help retain valuable medical knowledge but also risks reproducing sensitive patient data, reducing generalizability, and causing misleading outputs that hinder clinical adoption.

Method: Systematic analysis of memorization across three adaptation scenarios: (1) continued pretraining on medical corpora, (2) fine-tuning on standard medical benchmarks, and (3) fine-tuning on real-world clinical data including over 13,000 inpatient records from Yale New Haven Health System.

Result: Memorization is prevalent across all adaptation scenarios, significantly higher than in general domain. Distinct characteristics observed between continued pre-training and fine-tuning. Memorization is persistent - up to 87% of content memorized during continued pre-training remains after fine-tuning on new medical tasks.

Conclusion: Medical LLMs exhibit substantial memorization that persists through adaptation, raising concerns about privacy, generalizability, and clinical safety that need to be addressed for responsible deployment in healthcare settings.

Abstract: Large Language Models (LLMs) have demonstrated significant potential in medicine, with many studies adapting them through continued pre-training or fine-tuning on medical data to enhance domain-specific accuracy and safety. However, a key open question remains: to what extent do LLMs memorize medical training data. Memorization can be beneficial when it enables LLMs to retain valuable medical knowledge during domain adaptation. Yet, it also raises concerns. LLMs may inadvertently reproduce sensitive clinical content (e.g., patient-specific details), and excessive memorization may reduce model generalizability, increasing risks of misdiagnosis and making unwarranted recommendations. These risks are further amplified by the generative nature of LLMs, which can not only surface memorized content but also produce overconfident, misleading outputs that may hinder clinical adoption. In this work, we present a study on memorization of LLMs in medicine, assessing its prevalence (how frequently it occurs), characteristics (what is memorized), volume (how much content is memorized), and potential downstream impacts (how memorization may affect medical applications). We systematically analyze common adaptation scenarios: (1) continued pretraining on medical corpora, (2) fine-tuning on standard medical benchmarks, and (3) fine-tuning on real-world clinical data, including over 13,000 unique inpatient records from Yale New Haven Health System. The results demonstrate that memorization is prevalent across all adaptation scenarios and significantly higher than that reported in the general domain. Moreover, memorization has distinct characteristics during continued pre-training and fine-tuning, and it is persistent: up to 87% of content memorized during continued pre-training remains after fine-tuning on new medical tasks.

[76] FS-DFM: Fast and Accurate Long Text Generation with Few-Step Diffusion Language Models

Amin Karimi Monsefi, Nikhil Bhendawade, Manuel Rafael Ciosici, Dominic Culver, Yizhe Zhang, Irina Belousova

Main category: cs.CL

TL;DR: FS-DFM is a discrete flow-matching model that enables high-quality language generation in just 8 steps instead of hundreds/thousands, achieving 128× faster sampling while maintaining perplexity parity.

Details

Motivation: Autoregressive models are serial (one token per pass) causing throughput/latency issues, while diffusion models need hundreds/thousands of steps for quality. Need fast parallel generation without sacrificing quality.

Method: Few-Step Discrete Flow-Matching makes sampling steps an explicit parameter, trains for consistency across step budgets, uses reliable update rule to avoid overshooting, and applies strong teacher guidance distilled from long trajectories.

Result: FS-DFM with 8 steps matches perplexity of 1,024-step baseline for 1,024-token generation using similar-size model, achieving 128× faster sampling with corresponding latency/throughput gains.

Conclusion: FS-DFM enables efficient few-step discrete flow matching for language generation, providing stable, accurate, and controllable sampling with dramatic speed improvements while maintaining quality.

Abstract: Autoregressive language models (ARMs) deliver strong likelihoods, but are inherently serial: they generate one token per forward pass, which limits throughput and inflates latency for long sequences. Diffusion Language Models (DLMs) parallelize across positions and thus appear promising for language generation, yet standard discrete diffusion typically needs hundreds to thousands of model evaluations to reach high quality, trading serial depth for iterative breadth. We introduce FS-DFM, Few-Step Discrete Flow-Matching. A discrete flow-matching model designed for speed without sacrificing quality. The core idea is simple: make the number of sampling steps an explicit parameter and train the model to be consistent across step budgets, so one big move lands where many small moves would. We pair this with a reliable update rule that moves probability in the right direction without overshooting, and with strong teacher guidance distilled from long-run trajectories. Together, these choices make few-step sampling stable, accurate, and easy to control. On language modeling benchmarks, FS-DFM with 8 sampling steps achieves perplexity parity with a 1,024-step discrete-flow baseline for generating 1,024 tokens using a similar-size model, delivering up to 128 times faster sampling and corresponding latency/throughput gains.

[77] UPDESH: Synthesizing Grounded Instruction Tuning Data for 13 Indic

Pranjal A. Chitale, Varun Gumma, Sanchit Ahuja, Prashant Kodali, Manan Uppadhyay, Deepthi Sudharsan, Sunayana Sitaram

Main category: cs.CL

TL;DR: Updesh: A 9.5M synthetic instruction dataset across 13 Indian languages and English, generated from Wikipedia content using large LLMs, showing significant improvements in multilingual AI performance through culturally-grounded data.

Details

Motivation: Addressing the challenge of developing culturally grounded multilingual AI systems for low-resource languages, where synthetic data's effectiveness in multilingual/multicultural contexts is underexplored, and moving beyond dominant English-centric translation approaches.

Method: Bottom-up synthetic data generation using large open-source LLMs (>=235B parameters) grounded in language-specific Wikipedia content, creating Updesh dataset with 9.5M instruction-following data points across 13 Indian languages and English covering diverse reasoning and generative tasks.

Result: High-quality synthetic data confirmed through automated metrics and 10K human assessments; models fine-tuned on Updesh show consistent significant improvements on NLU and NLG evaluations across 13 diverse multilingual datasets; ablation studies demonstrate context-aware, culturally grounded data generation is essential.

Conclusion: Culturally grounded synthetic data generation from language-specific sources (like Wikipedia) is crucial for effective multilingual AI development, offering a promising alternative to top-down translation approaches and enabling better performance for low-resource languages.

Abstract: Developing culturally grounded multilingual AI systems remains challenging, particularly for low-resource languages. While synthetic data offers promise, its effectiveness in multilingual and multicultural contexts is underexplored. We investigate bottom-up synthetic data generation using large open-source LLMs (>= 235B parameters) grounded in language-specific Wikipedia content, complementing dominant top-down translation-based approaches from English. We introduce Updesh, a high-quality large-scale synthetic instruction-following dataset comprising 9.5M data points across 13 Indian languages and English, encompassing diverse reasoning and generative tasks. Comprehensive evaluation using automated metrics and 10K human assessments confirms high data quality. Downstream evaluations performed by fine-tuning models on various datasets and assessing performance across 13 diverse multilingual datasets and model comparative evaluations, demonstrate that models trained on Updesh consistently obtain significant improvements on NLU, NLG evaluations. Finally, through ablation studies and cultural evaluations, we show that context-aware, culturally grounded data generation is essential for effective multilingual AI development .

[78] Fine-tuning Done Right in Model Editing

Wanli Yang, Fei Sun, Rui Tang, Hongyu Zang, Du Su, Qi Cao, Jingang Wang, Huawei Shen, Xueqi Cheng

Main category: cs.CL

TL;DR: Fine-tuning can be effective for model editing when using breadth-first (epoch-based) pipeline instead of depth-first sequential editing, and with localized tuning parameters.

Details

Motivation: Challenge the belief that fine-tuning is ineffective for model editing, arguing that previous failures come from using suboptimal sequential editing pipelines rather than inherent limitations of fine-tuning.

Method: Restore fine-tuning to standard breadth-first (epoch-based) pipeline with mini-batch optimization instead of sequential depth-first editing. Develop LocFT-BF with systematic analysis of optimal tuning parameter locations for localized editing.

Result: LocFT-BF outperforms state-of-the-art methods by large margins, sustains 100K edits and 72B-parameter models (10x beyond prior practice) without sacrificing general capabilities.

Conclusion: Fine-tuning can be a leading method for model editing when properly implemented with breadth-first pipeline and localized tuning, correcting a long-standing misconception and establishing foundation for future research.

Abstract: Fine-tuning, a foundational method for adapting large language models, has long been considered ineffective for model editing. Here, we challenge this belief, arguing that the reported failure arises not from the inherent limitation of fine-tuning itself, but from adapting it to the sequential nature of the editing task, a single-pass depth-first pipeline that optimizes each sample to convergence before moving on. While intuitive, this depth-first pipeline coupled with sample-wise updating over-optimizes each edit and induces interference across edits. Our controlled experiments reveal that simply restoring fine-tuning to the standard breadth-first (i.e., epoch-based) pipeline with mini-batch optimization substantially improves its effectiveness for model editing. Moreover, fine-tuning in editing also suffers from suboptimal tuning parameter locations inherited from prior methods. Through systematic analysis of tuning locations, we derive LocFT-BF, a simple and effective localized editing method built on the restored fine-tuning framework. Extensive experiments across diverse LLMs and datasets demonstrate that LocFT-BF outperforms state-of-the-art methods by large margins. Notably, to our knowledge, it is the first to sustain 100K edits and 72B-parameter models,10 x beyond prior practice, without sacrificing general capabilities. By clarifying a long-standing misconception and introducing a principled localized tuning strategy, we advance fine-tuning from an underestimated baseline to a leading method for model editing, establishing a solid foundation for future research.

[79] Parallel Test-Time Scaling for Latent Reasoning Models

Runyang You, Yongqi Li, Meng Liu, Wenjie Wang, Liqiang Nie, Wenjie Li

Main category: cs.CL

TL;DR: This paper enables parallel test-time scaling (TTS) for latent reasoning models by introducing stochastic sampling strategies (Monte Carlo Dropout and Additive Gaussian Noise) and a Latent Reward Model for trajectory aggregation.

Details

Motivation: While parallel TTS has been successful for token-based Chain-of-Thought reasoning, it remains unclear whether latent reasoning models (which operate in continuous vector spaces) can similarly benefit from parallel TTS due to two key challenges: lack of sampling mechanisms in continuous space and absence of probabilistic signals for trajectory aggregation.

Method: 1) Introduces two uncertainty-inspired stochastic sampling strategies: Monte Carlo Dropout and Additive Gaussian Noise to enable sampling in continuous latent spaces. 2) Designs a Latent Reward Model (LatentRM) trained with step-wise contrastive objective to score and guide latent reasoning trajectories for effective aggregation.

Result: Extensive experiments show that both sampling strategies scale effectively with compute and exhibit distinct exploration dynamics, while LatentRM enables effective trajectory selection. The approach opens a new direction for scalable inference in continuous spaces.

Conclusion: The work successfully enables parallel TTS for latent reasoning models by addressing the key challenges of sampling in continuous spaces and trajectory aggregation, demonstrating that latent models can benefit from parallel scaling similar to token-based approaches.

Abstract: Parallel test-time scaling (TTS) is a pivotal approach for enhancing large language models (LLMs), typically by sampling multiple token-based chains-of-thought in parallel and aggregating outcomes through voting or search. Recent advances in latent reasoning, where intermediate reasoning unfolds in continuous vector spaces, offer a more efficient alternative to explicit Chain-of-Thought, yet whether such latent models can similarly benefit from parallel TTS remains open, mainly due to the absence of sampling mechanisms in continuous space, and the lack of probabilistic signals for advanced trajectory aggregation. This work enables parallel TTS for latent reasoning models by addressing the above issues. For sampling, we introduce two uncertainty-inspired stochastic strategies: Monte Carlo Dropout and Additive Gaussian Noise. For aggregation, we design a Latent Reward Model (LatentRM) trained with step-wise contrastive objective to score and guide latent reasoning. Extensive experiments and visualization analyses show that both sampling strategies scale effectively with compute and exhibit distinct exploration dynamics, while LatentRM enables effective trajectory selection. Together, our explorations open a new direction for scalable inference in continuous spaces. Code and checkpoints released at https://github.com/ModalityDance/LatentTTS

[80] The Price of Thought: A Multilingual Analysis of Reasoning, Performance, and Cost of Negotiation in Large Language Models

Sherzod Hakimov, Roland Bernard, Tim Leiber, Karl Osswald, Kristina Richert, Ruilin Yang, Raffaella Bernardi, David Schlangen

Main category: cs.CL

TL;DR: Reasoning training significantly improves LLM negotiation performance but at high computational cost, with multilingual reasoning differences between open-weight and commercial models.

Details

Motivation: To systematically evaluate how explicit reasoning training affects negotiation abilities of LLMs, comparing commercial and open-weight models across languages, and analyzing performance-cost trade-offs.

Method: Self-play setup across three diverse dialogue games, evaluating reasoning-enabled vs vanilla models in three languages, analyzing performance, cost, language consistency, and strategic adaptation.

Result: Reasoning improves negotiation outcomes significantly (31.4% for GPT-5) but increases cost by nearly 400%. Open-weight models switch to English for internal reasoning even when negotiating in German/Italian, while commercial models maintain language consistency.

Conclusion: Scaling test-time compute through reasoning enhances collaboration and overcomes task complexity, but at substantial cost. Multilingual reasoning patterns differ between model types, potentially impacting explainability gains from reasoning traces.

Abstract: Negotiation is a fundamental challenge for AI agents, as it requires an ability to reason strategically, model opponents, and balance cooperation with competition. We present the first comprehensive study that systematically evaluates how explicit reasoning training affects the negotiation abilities of both commercial and open-weight large language models, comparing these models to their vanilla counterparts across three languages. Using a self-play setup across three diverse dialogue games, we analyse trade-offs between performance and cost, the language consistency of reasoning processes, and the nature of strategic adaptation exhibited by models. Our findings show that enabling reasoning – that is, scaling test time compute – significantly improves negotiation outcomes by enhancing collaboration and helping models overcome task complexities, but comes at a substantial computational cost: reasoning improves GPT-5’s performance by 31.4 % while increasing its cost by nearly 400 %. Most critically, we uncover a significant multilingual reasoning distinction: open-weight models consistently switch to English for their internal reasoning steps, even when negotiating in German or Italian (and thus possibly impacting potential explainability gains through the disclosure of reasoning traces), while a leading commercial model maintains language consistency between reasoning and final output.

[81] ADVICE: Answer-Dependent Verbalized Confidence Estimation

Ki Jung Seo, Sehun Lim, Taeuk Kim

Main category: cs.CL

TL;DR: ADVICE framework improves LLM confidence calibration by making confidence estimates answer-dependent, reducing systematic overconfidence.

Details

Motivation: LLMs can express confidence in natural language but often show systematic overconfidence, whose causes are poorly understood. The paper aims to address this reliability issue in verbalized confidence estimation.

Method: Introduces ADVICE (Answer-Dependent Verbalized Confidence Estimation), a fine-tuning framework that promotes answer-grounded confidence estimation by conditioning confidence on the model’s own answer.

Result: ADVICE substantially improves confidence calibration, shows strong generalization to unseen settings, and maintains task performance without degradation. Gains stem from enhanced answer dependence.

Conclusion: Answer-independence is a primary driver of LLM overconfidence. ADVICE enables trustworthy confidence verbalization by making confidence estimates answer-dependent, shedding light on overconfidence origins.

Abstract: Recent progress in large language models (LLMs) has enabled them to communicate their confidence in natural language, improving transparency and reliability. However, this expressiveness is often accompanied by systematic overconfidence, whose underlying causes remain poorly understood. In this work, we analyze the dynamics of verbalized confidence estimation and identify answer-independence – the failure to condition confidence on the model’s own answer – as a primary driver of this behavior. To address this, we introduce ADVICE (Answer-Dependent Verbalized Confidence Estimation), a fine-tuning framework that promotes answer-grounded confidence estimation. Extensive experiments show that ADVICE substantially improves confidence calibration, while exhibiting strong generalization to unseen settings without degrading task performance. We further demonstrate that these gains stem from enhanced answer dependence, shedding light on the origins of overconfidence and enabling trustworthy confidence verbalization.

[82] KOTOX: A Korean Toxic Dataset for Deobfuscation and Detoxification

Yejin Lee, Su-Hyeon Kim, Hyundong Jin, Dayoung Kim, Yeonsoo Kim, Yo-Sub Han

Main category: cs.CL

TL;DR: KOTOX is a Korean toxic dataset for deobfuscation and detoxification that addresses the challenge of disguised toxic expressions in Korean’s agglutinative language structure.

Details

Motivation: Existing toxic language detection research focuses on non-obfuscated text, but Korean's agglutinative characteristics allow users to easily disguise toxic expressions through obfuscation, which remains largely unexplored in Korean language research.

Method: Researchers categorized Korean obfuscation patterns into linguistically grounded classes, defined transformation rules from real-world examples, and created paired datasets with neutral sentences, toxic sentences, and their obfuscated counterparts.

Result: Models trained on KOTOX dataset better handle obfuscated toxic text without sacrificing performance on non-obfuscated text. This is the first Korean dataset supporting both deobfuscation and detoxification.

Conclusion: KOTOX facilitates better understanding and mitigation of obfuscated toxic content in Korean LLMs, with code and data publicly available for research use.

Abstract: Online communication increasingly amplifies toxic language, and recent research actively explores methods for detecting and rewriting such content. Existing studies primarily focus on non-obfuscated text, which limits robustness in the situation where users intentionally disguise toxic expressions. In particular, Korean allows toxic expressions to be easily disguised through its agglutinative characteristic. However, obfuscation in Korean remains largely unexplored, which motivates us to introduce a KOTOX: Korean toxic dataset for deobfuscation and detoxification. We categorize Korean obfuscation patterns into linguistically grounded classes and define transformation rules derived from real-world examples. Using these rules, we provide paired neutral and toxic sentences alongside their obfuscated counterparts. Models trained on our dataset better handle obfuscated text without sacrificing performance on non-obfuscated text. This is the first dataset that simultaneously supports deobfuscation and detoxification for the Korean language. We expect it to facilitate better understanding and mitigation of obfuscated toxic content in LLM for Korean. Our code and data are available at https://github.com/leeyejin1231/KOTOX.

[83] Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding

Sensen Gao, Shanshan Zhao, Xu Jiang, Lunhao Duan, Yong Xien Chng, Qing-Guo Chen, Weihua Luo, Kaifu Zhang, Jia-Wang Bian, Mingming Gong

Main category: cs.CL

TL;DR: This paper presents a systematic survey of Multimodal RAG for document understanding, proposing a taxonomy, reviewing advances, and highlighting open challenges.

Details

Motivation: Current document understanding approaches have limitations: OCR-based pipelines lose structural detail, while native MLLMs struggle with context modeling. Documents' multimodal nature (text, tables, charts, layout) demands a more advanced paradigm for comprehensive document intelligence.

Method: The paper conducts a systematic survey of Multimodal RAG for document understanding, proposing a taxonomy based on domain, retrieval modality, and granularity. It reviews advances involving graph structures and agentic frameworks.

Result: The survey summarizes key datasets, benchmarks, applications, and industry deployment of Multimodal RAG. It provides a comprehensive overview of the field’s current state.

Conclusion: Multimodal RAG enables holistic retrieval and reasoning across all document modalities, unlocking comprehensive document intelligence. The paper highlights open challenges in efficiency, fine-grained representation, and robustness, providing a roadmap for future progress in document AI.

Abstract: Document understanding is critical for applications from financial analysis to scientific discovery. Current approaches, whether OCR-based pipelines feeding Large Language Models (LLMs) or native Multimodal LLMs (MLLMs), face key limitations: the former loses structural detail, while the latter struggles with context modeling. Retrieval-Augmented Generation (RAG) helps ground models in external data, but documents’ multimodal nature, i.e., combining text, tables, charts, and layout, demands a more advanced paradigm: Multimodal RAG. This approach enables holistic retrieval and reasoning across all modalities, unlocking comprehensive document intelligence. Recognizing its importance, this paper presents a systematic survey of Multimodal RAG for document understanding. We propose a taxonomy based on domain, retrieval modality, and granularity, and review advances involving graph structures and agentic frameworks. We also summarize key datasets, benchmarks, applications and industry deployment, and highlight open challenges in efficiency, fine-grained representation, and robustness, providing a roadmap for future progress in document AI.

[84] From Fact to Judgment: Investigating the Impact of Task Framing on LLM Conviction in Dialogue Systems

Parisa Rabbani, Nimet Beyza Bozdag, Dilek Hakkani-Tür

Main category: cs.CL

TL;DR: LLM judges show significant performance changes (9.24% average) when tasks shift from direct factual queries to conversational judgment tasks, with models exhibiting sycophantic or overly-critical tendencies under social framing.

Details

Motivation: As LLMs are increasingly used as judges for social and conversational tasks, it's unclear whether they can reliably assess tasks requiring social judgment. The paper investigates how LLM conviction changes when tasks are reframed from direct factual queries to conversational judgment tasks.

Method: The evaluation framework contrasts model performance on direct factual queries with assessment of speaker correctness in minimal dialogues, shifting from “Is this statement correct?” to “Is this speaker correct?”. Pressure is applied via simple rebuttals to measure how firmly models maintain positions under conversational pressure.

Result: Models show different tendencies: GPT-4o-mini reveals sycophantic tendencies under social framing, while Llama-8B-Instruct becomes overly-critical. Average performance change of 9.24% across all models demonstrates that minimal dialogue context significantly alters model judgment.

Conclusion: Conversational framing is a key factor in LLM-based evaluation, and the proposed framework offers reproducible methodology for diagnosing model conviction, contributing to development of more trustworthy dialogue systems.

Abstract: LLMs are increasingly employed as judges across a variety of tasks, including those involving everyday social interactions. Yet, it remains unclear whether such LLM-judges can reliably assess tasks that require social or conversational judgment. We investigate how an LLM’s conviction is changed when a task is reframed from a direct factual query to a Conversational Judgment Task. Our evaluation framework contrasts the model’s performance on direct factual queries with its assessment of a speaker’s correctness when the same information is presented within a minimal dialogue, effectively shifting the query from “Is this statement correct?” to “Is this speaker correct?”. Furthermore, we apply pressure in the form of a simple rebuttal (“The previous answer is incorrect.”) to both conditions. This perturbation allows us to measure how firmly the model maintains its position under conversational pressure. Our findings show that while some models like GPT-4o-mini reveal sycophantic tendencies under social framing tasks, others like Llama-8B-Instruct become overly-critical. We observe an average performance change of 9.24% across all models, demonstrating that even minimal dialogue context can significantly alter model judgment, underscoring conversational framing as a key factor in LLM-based evaluation. The proposed framework offers a reproducible methodology for diagnosing model conviction and contributes to the development of more trustworthy dialogue systems.

[85] MajinBook: An open catalogue of digital world literature with likes

Antoine Mazières, Thierry Poibeau

Main category: cs.CL

TL;DR: MajinBook is an open catalogue linking shadow library metadata with Goodreads data to create a high-precision corpus of 539,000+ English books for computational social science research.

Details

Motivation: To facilitate the use of shadow libraries (Library Genesis, Z-Library) for computational social science and cultural analytics by creating structured, enriched datasets that address biases in traditional corpora like HathiTrust.

Method: Links metadata from shadow libraries with structured bibliographic data from Goodreads, prioritizes natively digital EPUB files for machine-readability, includes secondary datasets for French, German, and Spanish, and evaluates linkage accuracy.

Result: Created a corpus of over 539,000 English-language book references spanning three centuries, enriched with publication dates, genres, ratings, and reviews, with all data released openly.

Conclusion: MajinBook provides a valuable resource for computational social science research, addresses legal permissibility concerns under EU and US text/data mining frameworks, and enables large-scale cultural analytics using shadow library content.

Abstract: This data paper introduces MajinBook, an open catalogue designed to facilitate the use of shadow libraries–such as Library Genesis and Z-Library–for computational social science and cultural analytics. By linking metadata from these vast, crowd-sourced archives with structured bibliographic data from Goodreads, we create a high-precision corpus of over 539,000 references to English-language books spanning three centuries, enriched with first publication dates, genres, and popularity metrics like ratings and reviews. Our methodology prioritizes natively digital EPUB files to ensure machine-readable quality, while addressing biases in traditional corpora like HathiTrust, and includes secondary datasets for French, German, and Spanish. We evaluate the linkage strategy for accuracy, release all underlying data openly, and discuss the project’s legal permissibility under EU and US frameworks for text and data mining in research.

[86] Cmprsr: Abstractive Token-Level Question-Agnostic Prompt Compressor

Ivan Zakazov, Berke Argin, Oussama Gabouj, Kamel Charaf, Alexander Sharipov, Alexi Semiz, Lorenzo Drudi, Nicolas Baldwin, Robert West

Main category: cs.CL

TL;DR: Researchers introduce a novel LLM prompt compression paradigm using smaller LLMs to compress inputs for larger ones, develop a comprehensive benchmark, optimize compression performance, and create Cmprsr - a specialized compression model that outperforms existing methods across various compression rates and input lengths.

Details

Motivation: High costs of using black-box Large Language Models (LLMs) motivate the need for efficient prompt compression techniques to reduce input token usage while maintaining performance.

Method: 1) Create comprehensive LLM-as-a-compressor benchmark with 25 models; 2) Use Textgrad-based meta-prompt optimization to improve gpt-4.1-mini; 3) Post-train Qwen3-4B with supervised fine-tuning and Group Relative Policy Optimization to create Cmprsr model; 4) Evaluate on MeetingBank, LongBench, and GSM8k datasets.

Result: Cmprsr demonstrates superiority over extractive and vanilla abstractive compression across all compression rates on lengthy inputs and short prompts, shows strong generalizability across varying input lengths and domains, and closely follows requested compression rates for fine cost-quality trade-off control.

Conclusion: The proposed LLM prompt compression paradigm effectively reduces costs while maintaining performance, with Cmprsr emerging as a highly effective specialized compression model that offers practical control over compression rates and generalizes well across different input types.

Abstract: Motivated by the high costs of using black-box Large Language Models (LLMs), we introduce a novel prompt compression paradigm, under which we use smaller LLMs to compress inputs for the larger ones. We present the first comprehensive LLM-as-a-compressor benchmark spanning 25 open- and closed-source models, which reveals significant disparity in models’ compression ability in terms of (i) preserving semantically important information (ii) following the user-provided compression rate (CR). We further improve the performance of gpt-4.1-mini, the best overall vanilla compressor, with Textgrad-based compression meta-prompt optimization. We also identify the most promising open-source vanilla LLM - Qwen3-4B - and post-train it with a combination of supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO), pursuing the dual objective of CR adherence and maximizing the downstream task performance. We call the resulting model Cmprsr and demonstrate its superiority over both extractive and vanilla abstractive compression across the entire range of compression rates on lengthy inputs from MeetingBank and LongBench as well as short prompts from GSM8k. The latter highlights Cmprsr’s generalizability across varying input lengths and domains. Moreover, Cmprsr closely follows the requested compression rate, offering fine control over the cost-quality trade-off.

[87] Liars’ Bench: Evaluating Lie Detectors for Language Models

Kieron Kretschmar, Walter Laurito, Sharan Maiya, Samuel Marks

Main category: cs.CL

TL;DR: LIARS’ BENCH is a comprehensive testbed with 72,863 examples of lies and honest responses from four LLMs across seven datasets, designed to evaluate lie detection techniques for diverse types of lies that existing methods systematically fail to detect.

Details

Motivation: Existing lie detection techniques for LLMs are validated in narrow settings that don't capture the diverse lies LLMs can generate, creating a need for a more comprehensive evaluation framework.

Method: Created LIARS’ BENCH testbed with 72,863 examples of lies and honest responses from four open-weight models across seven datasets, capturing different types of lies varying along two dimensions: the model’s reason for lying and the object of belief targeted by the lie.

Result: Existing black- and white-box lie detection techniques systematically fail to identify certain types of lies, especially in settings where determining whether the model lied from the transcript alone is impossible.

Conclusion: LIARS’ BENCH reveals limitations in prior lie detection techniques and provides a practical testbed for guiding progress in detecting LLM lies across diverse scenarios.

Abstract: Prior work has introduced techniques for detecting when large language models (LLMs) lie, that is, generate statements they believe are false. However, these techniques are typically validated in narrow settings that do not capture the diverse lies LLMs can generate. We introduce LIARS’ BENCH, a testbed consisting of 72,863 examples of lies and honest responses generated by four open-weight models across seven datasets. Our settings capture qualitatively different types of lies and vary along two dimensions: the model’s reason for lying and the object of belief targeted by the lie. Evaluating three black- and white-box lie detection techniques on LIARS’ BENCH, we find that existing techniques systematically fail to identify certain types of lies, especially in settings where it’s not possible to determine whether the model lied from the transcript alone. Overall, LIARS’ BENCH reveals limitations in prior techniques and provides a practical testbed for guiding progress in lie detection.

[88] A Lightweight Approach to Detection of AI-Generated Texts Using Stylometric Features

Sergey K. Aityan, William Claster, Karthik Sai Emani, Sohni Rais, Thy Tran

Main category: cs.CL

TL;DR: NEULIF is a lightweight AI-generated text detector using stylometric/readability features with CNN/RF classifiers, achieving 97% accuracy while being orders of magnitude smaller than transformer-based methods.

Details

Motivation: Existing AI-generated text detection methods rely on computationally expensive transformer models or ensembles with limited generalization. Lightweight alternatives have significantly lower accuracy, creating a need for efficient yet accurate detection solutions.

Method: Texts are decomposed into stylometric and readability features, then classified using compact Convolutional Neural Network (CNN) or Random Forest (RF) models. The approach focuses on structural insights rather than complex deep learning architectures.

Result: Achieved 97% accuracy (~0.95 F1) with CNN and 95% accuracy (~0.94 F1) with RF on Kaggle AI vs. Human corpus. Models are extremely small (CNN: ~25 MB, RF: ~10.6 MB) with ROC-AUC scores of 99.5% and 95% respectively.

Conclusion: Simplicity guided by structural insights can rival complex approaches in AI-generated content detection. The lightweight models enable efficient deployment on standard CPU devices while maintaining high accuracy, with potential for broader applications across languages and domains.

Abstract: A growing number of AI-generated texts raise serious concerns. Most existing approaches to AI-generated text detection rely on fine-tuning large transformer models or building ensembles, which are computationally expensive and often provide limited generalization across domains. Existing lightweight alternatives achieved significantly lower accuracy on large datasets. We introduce NEULIF, a lightweight approach that achieves best performance in the lightweight detector class, that does not require extensive computational power and provides high detection accuracy. In our approach, a text is first decomposed into stylometric and readability features which are then used for classification by a compact Convolutional Neural Network (CNN) or Random Forest (RF). Evaluated and tested on the Kaggle AI vs. Human corpus, our models achieve 97% accuracy (~ 0.95 F1) for CNN and 95% accuracy (~ 0.94 F1) for the Random Forest, demonstrating high precision and recall, with ROC-AUC scores of 99.5% and 95%, respectively. The CNN (~ 25 MB) and Random Forest (~ 10.6 MB) models are orders of magnitude smaller than transformer-based ensembles and can be run efficiently on standard CPU devices, without sacrificing accuracy. This study also highlights the potential of such models for broader applications across languages, domains, and streaming contexts, showing that simplicity, when guided by structural insights, can rival complexity in AI-generated content detection.

[89] KBQA-R1: Reinforcing Large Language Models for Knowledge Base Question Answering

Xin Sun, Zhongqi Chen, Xing Zheng, Qiang Liu, Shu Wu, Bowen Song, Zilei Wang, Weiqiang Wang, Liang Wang

Main category: cs.CL

TL;DR: KBQA-R1: A reinforcement learning framework for KBQA that shifts from text imitation to interaction optimization, achieving SOTA performance by learning to navigate knowledge bases through execution feedback.

Details

Motivation: Current LLM-based KBQA approaches suffer from two main failures: (1) generating hallucinated queries without verifying schema existence, and (2) rigid template-based reasoning that mimics synthesized traces without true comprehension. There's a need to move beyond text imitation to interaction optimization.

Method: KBQA-R1 treats KBQA as a multi-turn decision process using reinforcement learning. The framework learns to navigate knowledge bases using action lists, employing Group Relative Policy Optimization (GRPO) to refine strategies based on execution feedback. Also introduces Referenced Rejection Sampling (RRS) for data synthesis to align reasoning traces with ground-truth action sequences.

Result: Extensive experiments on WebQSP, GrailQA, and GraphQuestions demonstrate state-of-the-art performance. The framework effectively grounds LLM reasoning in verifiable execution.

Conclusion: KBQA-R1 successfully addresses the limitations of current approaches by shifting from text imitation to interaction optimization via reinforcement learning, enabling more robust and verifiable KBQA through execution-grounded reasoning.

Abstract: Knowledge Base Question Answering (KBQA) challenges models to bridge the gap between natural language and strict knowledge graph schemas by generating executable logical forms. While Large Language Models (LLMs) have advanced this field, current approaches often struggle with a dichotomy of failure: they either generate hallucinated queries without verifying schema existence or exhibit rigid, template-based reasoning that mimics synthesized traces without true comprehension of the environment. To address these limitations, we present \textbf{KBQA-R1}, a framework that shifts the paradigm from text imitation to interaction optimization via Reinforcement Learning. Treating KBQA as a multi-turn decision process, our model learns to navigate the knowledge base using a list of actions, leveraging Group Relative Policy Optimization (GRPO) to refine its strategies based on concrete execution feedback rather than static supervision. Furthermore, we introduce \textbf{Referenced Rejection Sampling (RRS)}, a data synthesis method that resolves cold-start challenges by strictly aligning reasoning traces with ground-truth action sequences. Extensive experiments on WebQSP, GrailQA, and GraphQuestions demonstrate that KBQA-R1 achieves state-of-the-art performance, effectively grounding LLM reasoning in verifiable execution.

[90] Your Reasoning Benchmark May Not Test Reasoning: Revealing Perception Bottleneck in Abstract Reasoning Benchmarks

Xinhe Wang, Jin Huang, Xingjian Zhang, Tianhao Wang, Jiaqi W. Ma

Main category: cs.CL

TL;DR: ARC-style reasoning benchmarks may overstate machine reasoning deficiencies - the performance gap stems more from visual perception limitations than inductive reasoning shortcomings.

Details

Motivation: To challenge the common interpretation that poor performance on ARC-style benchmarks indicates deficiencies in machine reasoning, and instead investigate whether visual perception limitations are the primary bottleneck.

Method: Introduce a two-stage experimental pipeline that separates perception and reasoning: 1) Perception stage: convert each image independently to natural-language descriptions, 2) Reasoning stage: induce and apply rules using these descriptions. Compare this against standard end-to-end evaluation across Mini-ARC, ACRE, and Bongard-LOGO datasets.

Result: Perception capability is the dominant factor in the performance gap. Manual inspection shows ~80% of model failures stem from perception errors rather than reasoning failures.

Conclusion: ARC-style benchmarks conflate perceptual and reasoning challenges, potentially overstating machine reasoning deficiencies. Evaluation protocols should disentangle perception from reasoning when assessing machine intelligence progress.

Abstract: Reasoning benchmarks such as the Abstraction and Reasoning Corpus (ARC) and ARC-AGI are widely used to assess progress in artificial intelligence and are often interpreted as probes of core, so-called ``fluid’’ reasoning abilities. Despite their apparent simplicity for humans, these tasks remain challenging for frontier vision-language models (VLMs), a gap commonly attributed to deficiencies in machine reasoning. We challenge this interpretation and hypothesize that the gap arises primarily from limitations in visual perception rather than from shortcomings in inductive reasoning. To verify this hypothesis, we introduce a two-stage experimental pipeline that explicitly separates perception and reasoning. In the perception stage, each image is independently converted into a natural-language description, while in the reasoning stage a model induces and applies rules using these descriptions. This design prevents leakage of cross-image inductive signals and isolates reasoning from perception bottlenecks. Across three ARC-style datasets, Mini-ARC, ACRE, and Bongard-LOGO, we show that the perception capability is the dominant factor underlying the observed performance gap by comparing the two-stage pipeline with against standard end-to-end one-stage evaluation. Manual inspection of reasoning traces in the VLM outputs further reveals that approximately 80 percent of model failures stem from perception errors. Together, these results demonstrate that ARC-style benchmarks conflate perceptual and reasoning challenges and that observed performance gaps may overstate deficiencies in machine reasoning. Our findings underscore the need for evaluation protocols that disentangle perception from reasoning when assessing progress in machine intelligence.

[91] K-EXAONE Technical Report

Eunbi Choi, Kibong Choi, Seokhee Hong, Junwon Hwang, Hyojin Jeon, Hyunjik Jo, Joonkee Kim, Seonghwan Kim, Soyeon Kim, Sunkyoung Kim, Yireun Kim, Yongil Kim, Haeju Lee, Jinsik Lee, Kyungmin Lee, Sangha Park, Heuiyeen Yeen, Hwan Chang, Stanley Jungkyu Choi, Yejin Choi, Jiwon Ham, Kijeong Jeon, Geunyeong Jeong, Gerrard Jeongwon Jo, Yonghwan Jo, Jiyeon Jung, Naeun Kang, Dohoon Kim, Euisoon Kim, Hayeon Kim, Hyosang Kim, Hyunseo Kim, Jieun Kim, Minu Kim, Myoungshin Kim, Unsol Kim, Youchul Kim, YoungJin Kim, Chaeeun Lee, Chaeyoon Lee, Changhun Lee, Dahm Lee, Edward Hwayoung Lee, Honglak Lee, Jinsang Lee, Jiyoung Lee, Sangeun Lee, Seungwon Lim, Solji Lim, Woohyung Lim, Chanwoo Moon, Jaewoo Park, Jinho Park, Yongmin Park, Hyerin Seo, Wooseok Seo, Yongwoo Song, Sejong Yang, Sihoon Yang, Chang En Yea, Sihyuk Yi, Chansik Yoon, Dongkeun Yoon, Sangyeon Yoon, Hyeongu Yun

Main category: cs.CL

TL;DR: K-EXAONE is a 236B parameter multilingual MoE model with 23B active parameters, supporting 6 languages and 256K context window, performing comparably to similar-sized open models.

Details

Motivation: To develop a powerful proprietary AI foundation model for industrial and research applications that advances AI for better life, with strong multilingual capabilities.

Method: Built on Mixture-of-Experts architecture with 236B total parameters (23B activated during inference), supporting 256K-token context window and six languages: Korean, English, Spanish, German, Japanese, and Vietnamese.

Result: Demonstrates performance comparable to open-weight models of similar size across comprehensive benchmarks spanning reasoning, agentic, general, Korean, and multilingual abilities.

Conclusion: K-EXAONE is positioned as a powerful proprietary AI foundation model suitable for wide range of industrial and research applications, advancing AI for better life.

Abstract: This technical report presents K-EXAONE, a large-scale multilingual language model developed by LG AI Research. K-EXAONE is built on a Mixture-of-Experts architecture with 236B total parameters, activating 23B parameters during inference. It supports a 256K-token context window and covers six languages: Korean, English, Spanish, German, Japanese, and Vietnamese. We evaluate K-EXAONE on a comprehensive benchmark suite spanning reasoning, agentic, general, Korean, and multilingual abilities. Across these evaluations, K-EXAONE demonstrates performance comparable to open-weight models of similar size. K-EXAONE, designed to advance AI for a better life, is positioned as a powerful proprietary AI foundation model for a wide range of industrial and research applications.

[92] Stable-RAG: Mitigating Retrieval-Permutation-Induced Hallucinations in Retrieval-Augmented Generation

Qianchi Zhang, Hainan Zhang, Liang Pang, Hongwei Zheng, Zhiming Zheng

Main category: cs.CL

TL;DR: Stable-RAG addresses LLM sensitivity to document order in retrieval-augmented generation by using permutation sensitivity estimation and cluster-based decoding to produce consistent, accurate answers.

Details

Motivation: Current RAG systems show unexpected sensitivity to the order of retrieved documents, even when the gold document is included and fixed in position. While existing robust RAG methods focus on low-quality retrieval and positional bias, they don't address this permutation sensitivity problem.

Method: Stable-RAG runs the generator under multiple retrieval orders, clusters the hidden states, and decodes from cluster-center representations that capture dominant reasoning patterns. It then aligns hallucinated outputs toward correct answers using these reasoning results.

Result: Experiments on three QA datasets show Stable-RAG significantly improves answer accuracy, reasoning consistency, and robust generalization across datasets, retrievers, and input lengths compared to baselines.

Conclusion: Permutation sensitivity is a critical but overlooked issue in RAG systems, and Stable-RAG provides an effective solution by leveraging permutation sensitivity estimation to produce stable, accurate outputs across different document orders.

Abstract: Retrieval-Augmented Generation (RAG) has become a key paradigm for reducing factual hallucinations in large language models (LLMs), yet little is known about how the order of retrieved documents affects model behavior. We empirically show that under Top-5 retrieval with the gold document included, LLM answers vary substantially across permutations of the retrieved set, even when the gold document is fixed in the first position. This reveals a previously underexplored sensitivity to retrieval permutations. Although robust RAG methods primarily focus on enhancing LLM robustness to low-quality retrieval and mitigating positional bias to distribute attention fairly over long contexts, neither approach directly addresses permutation sensitivity. In this paper, we propose Stable-RAG, which exploits permutation sensitivity estimation to mitigate permutation-induced hallucinations. Stable-RAG runs the generator under multiple retrieval orders, clusters hidden states, and decodes from a cluster-center representation that captures the dominant reasoning pattern. It then uses these reasoning results to align hallucinated outputs toward the correct answer, encouraging the model to produce consistent and accurate predictions across document permutations. Experiments on three QA datasets show that Stable-RAG significantly improves answer accuracy, reasoning consistency and robust generalization across datasets, retrievers, and input lengths compared with baselines.

Haonan Chen, Sicheng Gao, Radu Timofte, Tetsuya Sakai, Zhicheng Dou

Main category: cs.CL

TL;DR: E5-omni is a lightweight explicit alignment method that adapts pretrained vision-language models into robust omni-modal embedding models by addressing modality-dependent similarity scales, imbalanced negative hardness, and cross-modal statistical mismatches.

Details

Motivation: Modern systems need omni-modal embeddings for heterogeneous modalities, but current approaches relying on implicit alignment from pretrained VLMs suffer from three issues: inconsistent similarity scales across modalities, ineffective in-batch negatives due to imbalanced hardness distribution, and mismatched statistical properties across modalities that destabilize rankings.

Method: Three components: (1) modality-aware temperature calibration to align similarity scales, (2) controllable negative curriculum with debiasing to focus on confusing negatives while reducing false negative impact, and (3) batch whitening with covariance regularization to match cross-modal geometry in embedding space.

Result: Experiments on MMEB-V2 and AudioCaps show consistent improvements over strong bi-modal and omni-modal baselines. The recipe also transfers well to other VLM backbones.

Conclusion: E5-omni provides an effective explicit alignment approach that addresses key limitations of implicit alignment in omni-modal embeddings, achieving better performance and releasing the model checkpoint publicly.

Abstract: Modern information systems often involve different types of items, e.g., a text query, an image, a video clip, or an audio segment. This motivates omni-modal embedding models that map heterogeneous modalities into a shared space for direct comparison. However, most recent omni-modal embeddings still rely heavily on implicit alignment inherited from pretrained vision-language model (VLM) backbones. In practice, this causes three common issues: (i) similarity logits have modality-dependent sharpness, so scores are not on a consistent scale; (ii) in-batch negatives become less effective over time because mixed-modality batches create an imbalanced hardness distribution; as a result, many negatives quickly become trivial and contribute little gradient; and (iii) embeddings across modalities show mismatched first- and second-order statistics, which makes rankings less stable. To tackle these problems, we propose e5-omni, a lightweight explicit alignment recipe that adapts off-the-shelf VLMs into robust omni-modal embedding models. e5-omni combines three simple components: (1) modality-aware temperature calibration to align similarity scales, (2) a controllable negative curriculum with debiasing to focus on confusing negatives while reducing the impact of false negatives, and (3) batch whitening with covariance regularization to better match cross-modal geometry in the shared embedding space. Experiments on MMEB-V2 and AudioCaps show consistent gains over strong bi-modal and omni-modal baselines, and the same recipe also transfers well to other VLM backbones. We release our model checkpoint at https://huggingface.co/Haon-Chen/e5-omni-7B.

[94] All That Glisters Is Not Gold: A Benchmark for Reference-Free Counterfactual Financial Misinformation Detection

Yuechen Jiang, Zhiwei Liu, Yupeng Cao, Yueru He, Ziyang Xu, Chen Xu, Zhiyang Deng, Prayag Tiwari, Xi Chen, Alejandro Lopez-Lira, Jimin Huang, Junichi Tsujii, Sophia Ananiadou

Main category: cs.CL

TL;DR: RFC Bench is a benchmark for evaluating LLMs on financial misinformation detection using realistic news paragraphs, featuring two tasks: reference-free detection and comparison-based diagnosis with paired inputs.

Details

Motivation: Financial misinformation detection is challenging because meaning emerges from dispersed cues in realistic news contexts. Current benchmarks don't adequately capture this contextual complexity at the paragraph level.

Method: Created RFC Bench benchmark with two complementary tasks: 1) Reference-free misinformation detection (single input), and 2) Comparison-based diagnosis using paired original-perturbed inputs to provide comparative context.

Result: Models perform substantially better with comparative context (paired inputs) than in reference-free settings. Reference-free detection shows significant weaknesses: unstable predictions and elevated invalid outputs, indicating models struggle to maintain coherent belief states without external grounding.

Conclusion: RFC Bench highlights a critical gap in current models’ reference-free reasoning capabilities and provides a structured testbed for advancing more reliable financial misinformation detection in real-world settings.

Abstract: We introduce RFC Bench, a benchmark for evaluating large language models on financial misinformation under realistic news. RFC Bench operates at the paragraph level and captures the contextual complexity of financial news where meaning emerges from dispersed cues. The benchmark defines two complementary tasks: reference free misinformation detection and comparison based diagnosis using paired original perturbed inputs. Experiments reveal a consistent pattern: performance is substantially stronger when comparative context is available, while reference free settings expose significant weaknesses, including unstable predictions and elevated invalid outputs. These results indicate that current models struggle to maintain coherent belief states without external grounding. By highlighting this gap, RFC Bench provides a structured testbed for studying reference free reasoning and advancing more reliable financial misinformation detection in real world settings.

[95] Interpreting Transformers Through Attention Head Intervention

Mason Kadem, Rong Zheng

Main category: cs.CL

TL;DR: The paper traces the evolution of attention head intervention as a key method for causal interpretability of transformers, marking a paradigm shift from visualization to intervention for validating mechanistic hypotheses.

Details

Motivation: Understanding neural mechanisms is crucial for (1) accountability and control in high-stakes domains, (2) studying digital brains and emergence of cognition, and (3) discovering new knowledge when AI systems outperform humans.

Method: Attention head intervention - a causal interpretability method for transformers that moves beyond visualization to directly intervene and validate mechanistic hypotheses.

Result: Head intervention studies revealed robust empirical findings while also highlighting limitations that complicate interpretation of neural mechanisms.

Conclusion: Attention head intervention represents a paradigm shift from observing correlations to causally validating mechanistic hypotheses, though limitations remain that complicate interpretation.

Abstract: Neural networks are growing more capable on their own, but we do not understand their neural mechanisms. Understanding these mechanisms’ decision-making processes, or mechanistic interpretability, enables (1) accountability and control in high-stakes domains, (2) the study of digital brains and the emergence of cognition, and (3) discovery of new knowledge when AI systems outperform humans. This paper traces how attention head intervention emerged as a key method for causal interpretability of transformers. The evolution from visualization to intervention represents a paradigm shift from observing correlations to causally validating mechanistic hypotheses through direct intervention. Head intervention studies revealed robust empirical findings while also highlighting limitations that complicate interpretation.

[96] Differential syntactic and semantic encoding in LLMs

Santiago Acevedo, Alessandro Laio, Marco Baroni

Main category: cs.CL

TL;DR: Researchers analyze how syntax and semantics are encoded in DeepSeek-V3’s inner layers, finding both are partially linearly encoded and can be decoupled through centroid subtraction.

Details

Motivation: To understand how syntactic and semantic information is encoded in the internal representations of large language models, specifically examining whether these linguistic features are linearly separable and differentially encoded.

Method: Analyze DeepSeek-V3 by averaging hidden representations of sentences with shared syntactic structure or meaning to create “centroids,” then subtract these centroids from sentence vectors to measure impact on similarity with syntactically/semantically matched sentences.

Result: Syntactic and semantic centroids capture significant information; subtracting them strongly affects similarity with matched sentences, suggesting linear encoding. Cross-layer encoding profiles differ, and the two signals can be partially decoupled.

Conclusion: Syntax and semantics are at least partially linearly encoded in LLM representations, with differential encoding patterns across layers, indicating separable linguistic information processing.

Abstract: We study how syntactic and semantic information is encoded in inner layer representations of Large Language Models (LLMs), focusing on the very large DeepSeek-V3. We find that, by averaging hidden-representation vectors of sentences sharing syntactic structure or meaning, we obtain vectors that capture a significant proportion of the syntactic and semantic information contained in the representations. In particular, subtracting these syntactic and semantic ``centroids’’ from sentence vectors strongly affects their similarity with syntactically and semantically matched sentences, respectively, suggesting that syntax and semantics are, at least partially, linearly encoded. We also find that the cross-layer encoding profiles of syntax and semantics are different, and that the two signals can to some extent be decoupled, suggesting differential encoding of these two types of linguistic information in LLM representations.

[97] Learning from Mistakes: Negative Reasoning Samples Enhance Out-of-Domain Generalization

Xueyun Tian, Minghua Ma, Bingbing Xu, Nuoyan Lyu, Wei Li, Heng Dong, Zheng Chu, Yuanzhuo Wang, Huawei Shen

Main category: cs.CL

TL;DR: Incorporating negative reasoning trajectories (incorrect final answers) into supervised fine-tuning improves out-of-domain generalization over positive-only training, with a proposed adaptive loss weighting method (GLOW) that leverages training dynamics.

Details

Motivation: Standard SFT on CoT trajectories only uses positive examples (correct final answers), discarding negative trajectories. This wastes supervision and causes overfitting, limiting OOD generalization. Negative trajectories often contain valid intermediate reasoning despite wrong final answers.

Method: 1) Analyze effects of including negative trajectories in SFT; 2) Identify 22 recurring patterns in negative chains; 3) Propose GLOW (Gain-based LOss Weighting) - adaptive sample-aware scheme that rescales per-sample loss based on inter-epoch progress to exploit distinctive training dynamics.

Result: Negative trajectories yield substantial OOD generalization gains over positive-only training. They moderate loss descent to mitigate overfitting and boost policy entropy by 35.67% during inference. GLOW achieves 5.51% OOD gain over positive-only SFT on Qwen2.5-7B and boosts MMLU from 72.82% to 76.47% as RL initialization.

Conclusion: Incorporating negative reasoning trajectories in SFT is beneficial for OOD generalization. Negative chains serve dual roles: mitigating overfitting during training and facilitating exploration during inference. GLOW effectively leverages unfiltered trajectories through adaptive loss weighting.

Abstract: Supervised fine-tuning (SFT) on chain-of-thought (CoT) trajectories demonstrations is a common approach for enabling reasoning in large language models. Standard practices typically only retain trajectories with correct final answers (positives) while ignoring the rest (negatives). We argue that this paradigm discards substantial supervision and exacerbates overfitting, limiting out-of-domain (OOD) generalization. Specifically, we surprisingly find that incorporating negative trajectories into SFT yields substantial OOD generalization gains over positive-only training, as these trajectories often retain valid intermediate reasoning despite incorrect final answers. To understand this effect in depth, we systematically analyze data, training dynamics, and inference behavior, identifying 22 recurring patterns in negative chains that serve a dual role: they moderate loss descent to mitigate overfitting during training and boost policy entropy by 35.67% during inference to facilitate exploration. Motivated by these observations, we further propose Gain-based LOss Weighting (GLOW), an adaptive, sample-aware scheme that exploits such distinctive training dynamics by rescaling per-sample loss based on inter-epoch progress. Empirically, GLOW efficiently leverages unfiltered trajectories, yielding a 5.51% OOD gain over positive-only SFT on Qwen2.5-7B and boosting MMLU from 72.82% to 76.47% as an RL initialization.

[98] Reverse-engineering NLI: A study of the meta-inferential properties of Natural Language Inference

Rasmus Blanck, Bill Noble, Stergios Chatzikyriakidis

Main category: cs.CL

TL;DR: The paper analyzes the logical properties of Natural Language Inference (NLI) tasks, examining three possible interpretations of NLI labels and evaluating model consistency on meta-inferential properties using SNLI data.

Details

Motivation: NLI is important for evaluating language models' natural language understanding, but the logical properties of the task are poorly understood and often mischaracterized. Understanding what kind of inference NLI actually captures is crucial for properly interpreting model performance on this benchmark task.

Method: The authors formulate three possible readings of the NLI label set and perform comprehensive analysis of their meta-inferential properties. They use SNLI dataset items with shared premises and items generated by LLMs to evaluate models trained on SNLI for meta-inferential consistency.

Result: The analysis provides insights into which reading of the logical relations is actually encoded by the SNLI dataset, revealing how models trained on SNLI handle different interpretations of inference relationships.

Conclusion: The study clarifies the logical foundations of NLI tasks, helping researchers better understand what NLI actually measures and how to properly interpret model performance on this important benchmark for natural language understanding.

Abstract: Natural Language Inference (NLI) has been an important task for evaluating language models for Natural Language Understanding, but the logical properties of the task are poorly understood and often mischaracterized. Understanding the notion of inference captured by NLI is key to interpreting model performance on the task. In this paper we formulate three possible readings of the NLI label set and perform a comprehensive analysis of the meta-inferential properties they entail. Focusing on the SNLI dataset, we exploit (1) NLI items with shared premises and (2) items generated by LLMs to evaluate models trained on SNLI for meta-inferential consistency and derive insights into which reading of the logical relations is encoded by the dataset.

cs.CV

[99] Sketch&Patch++: Efficient Structure-Aware 3D Gaussian Representation

Yuang Shi, Simone Gasparini, Géraldine Morin, Wei Tsang Ooi

Main category: cs.CV

TL;DR: A hybrid Gaussian representation for 3D scenes that separates Sketch Gaussians (high-frequency edges/contours) from Patch Gaussians (low-frequency smooth regions), enabling progressive streaming and efficient compression.

Details

Motivation: Traditional 3D Gaussian Splatting (3DGS) representations lack semantic structure, making them inefficient for streaming and compression. The authors observe that Gaussians naturally exhibit different roles analogous to artistic techniques - some capture fine details like sketches while others cover broader areas like brush strokes.

Method: Proposes a hierarchical adaptive categorization framework using multi-criteria density-based clustering and adaptive quality-driven refinement to separate Gaussians into Sketch Gaussians (high-frequency features) and Patch Gaussians (low-frequency regions), eliminating dependency on external 3D line primitives.

Result: Achieves up to 1.74 dB PSNR improvement, 6.7% SSIM improvement, and 41.4% LPIPS improvement at equivalent model sizes compared to uniform pruning baselines. For indoor scenes, maintains visual quality with only 0.5% of original model size.

Conclusion: The structure-aware hybrid Gaussian representation enables efficient storage, adaptive streaming, and high-fidelity rendering across bandwidth-constrained networks and resource-limited devices, outperforming traditional uniform approaches.

Abstract: We observe that Gaussians exhibit distinct roles and characteristics analogous to traditional artistic techniques – like how artists first sketch outlines before filling in broader areas with color, some Gaussians capture high-frequency features such as edges and contours, while others represent broader, smoother regions analogous to brush strokes that add volume and depth. Based on this observation, we propose a hybrid representation that categorizes Gaussians into (i) Sketch Gaussians, which represent high-frequency, boundary-defining features, and (ii) Patch Gaussians, which cover low-frequency, smooth regions. This semantic separation naturally enables layered progressive streaming, where the compact Sketch Gaussians establish the structural skeleton before Patch Gaussians incrementally refine volumetric detail. In this work, we extend our previous method to arbitrary 3D scenes by proposing a novel hierarchical adaptive categorization framework that operates directly on the 3DGS representation. Our approach employs multi-criteria density-based clustering, combined with adaptive quality-driven refinement. This method eliminates dependency on external 3D line primitives while ensuring optimal parametric encoding effectiveness. Our comprehensive evaluation across diverse scenes, including both man-made and natural environments, demonstrates that our method achieves up to 1.74 dB improvement in PSNR, 6.7% in SSIM, and 41.4% in LPIPS at equivalent model sizes compared to uniform pruning baselines. For indoor scenes, our method can maintain visual quality with only 0.5% of the original model size. This structure-aware representation enables efficient storage, adaptive streaming, and rendering of high-fidelity 3D content across bandwidth-constrained networks and resource-limited devices.

[100] Bi-Orthogonal Factor Decomposition for Vision Transformers

Fenil R. Doshi, Thomas Fel, Talia Konkle, George Alvarez

Main category: cs.CV

TL;DR: BFD is a new analytical framework that disentangles attention mechanisms in Vision Transformers, revealing that content interactions dominate, attention heads specialize, and DINOv2 excels at holistic shape processing through balanced position-content coupling.

Details

Motivation: Current attention maps show where attention focuses but don't reveal whether tokens exchange positional information, content information, or both. There's a lack of principled understanding of what information attention mechanisms actually exchange between tokens in Vision Transformers.

Method: Bi-orthogonal Factor Decomposition (BFD): 1) ANOVA-based decomposition disentangles token activations into orthogonal positional and content factors; 2) SVD of query-key interaction matrix QK^T exposes bi-orthogonal modes showing how these factors mediate communication.

Result: Three key findings: 1) Attention operates primarily through content (content-content interactions dominate, followed by content-position coupling); 2) Attention heads specialize into content-content, content-position, and position-position operators; 3) DINOv2’s superior holistic shape processing comes from intermediate layers preserving positional structure while enriching semantic content.

Conclusion: BFD provides a principled framework to understand how tokens interact through attention and which informational factors (positional vs. semantic) mediate their communication, offering practical insights into Vision Transformer mechanisms and explaining DINOv2’s superior performance.

Abstract: Self-attention is the central computational primitive of Vision Transformers, yet we lack a principled understanding of what information attention mechanisms exchange between tokens. Attention maps describe where weight mass concentrates; they do not reveal whether queries and keys trade position, content, or both. We introduce Bi-orthogonal Factor Decomposition (BFD), a two-stage analytical framework: first, an ANOVA-based decomposition statistically disentangles token activations into orthogonal positional and content factors; second, SVD of the query-key interaction matrix QK^T exposes bi-orthogonal modes that reveal how these factors mediate communication. After validating proper isolation of position and content, we apply BFD to state-of-the-art vision models and uncover three phenomena.(i) Attention operates primarily through content. Content-content interactions dominate attention energy, followed by content-position coupling. DINOv2 allocates more energy to content-position than supervised models and distributes computation across a richer mode spectrum. (ii) Attention mechanisms exhibit specialization: heads differentiate into content-content, content-position, and position-position operators, while singular modes within heads show analogous specialization. (iii) DINOv2’s superior holistic shape processing emerges from intermediate layers that simultaneously preserve positional structure while contextually enriching semantic content. Overall, BFD exposes how tokens interact through attention and which informational factors - positional or semantic - mediate their communication, yielding practical insights into vision transformer mechanisms.

[101] Coding the Visual World: From Image to Simulation Using Vision Language Models

Sagi Eppel

Main category: cs.CV

TL;DR: VLMs can understand and simulate complex systems from images using code generation, showing strong high-level understanding but limited fine-detail perception.

Details

Motivation: To explore whether Vision Language Models (VLMs) can construct mental models of real-world systems depicted in images, similar to human understanding, by testing their ability to recognize, describe, and simulate these systems through code generation.

Method: Im2Sim methodology: VLMs are given natural images of real-world systems and tasked with describing the system and writing code that simulates and generates it. The generated code is executed to produce synthetic images, which are compared against the original. Tested on various complex emergent systems including physical systems (waves, lights, clouds), vegetation, cities, materials, and geological formations.

Result: Leading VLMs (GPT, Gemini) demonstrate capacity to understand and model complex, multi-component systems across multiple abstraction layers and diverse domains. However, they exhibit limited ability to replicate fine details and low-level pattern arrangements. Reveals an asymmetry: VLMs combine high-level, deep visual understanding with limited perception of fine details.

Conclusion: VLMs show promising ability to construct representative models of systems in images, demonstrating high-level understanding of complex systems, but current limitations in fine-detail perception suggest areas for future improvement in visual understanding capabilities.

Abstract: The ability to construct mental models of the world is a central aspect of understanding. Similarly, visual understanding can be viewed as the ability to construct a representative model of the system depicted in an image. This work explores the capacity of Vision Language Models (VLMs) to recognize and simulate the systems and mechanisms depicted in images using the Im2Sim methodology. The VLM is given a natural image of a real-world system (e.g., cities, clouds, vegetation) and is tasked with describing the system and writing code that simulates and generates it. This generative code is then executed to produce a synthetic image, which is compared against the original. This approach is tested on various complex emergent systems, ranging from physical systems (waves, lights, clouds) to vegetation, cities, materials, and geological formations. Through analysis of the models and images generated by the VLMs, we examine their understanding of the systems in images. The results show that leading VLMs (GPT, Gemini) demonstrate the capacity to understand and model complex, multi-component systems across multiple layers of abstraction and a wide range of domains. At the same time, the VLMs exhibit limited ability to replicate fine details and low-level arrangements of patterns in the image. These findings reveal an interesting asymmetry: VLMs combine high-level, deep visual understanding of images with limited perception of fine details.

[102] STResNet & STYOLO : A New Family of Compact Classification and Object Detection Models for MCUs

Sudhakar Sah, Ravish Kumar

Main category: cs.CV

TL;DR: The paper introduces STResNet for image classification and STYOLO for object detection, two lightweight neural network families optimized for accuracy, efficiency, and memory footprint on resource-constrained edge devices.

Details

Motivation: Existing lightweight neural networks still trade accuracy for latency, limiting their applicability on microcontroller and neural processing unit based devices. There's a need for models that maintain accuracy while being efficient on resource-constrained platforms.

Method: Proposed two model families: STResNet series (Nano to Tiny variants) for image classification and STYOLO series (Micro and Milli variants) for object detection. Both are jointly optimized for accuracy, efficiency, and memory footprint.

Result: STResNetMilli achieves 70.0% Top-1 accuracy on ImageNet with only 3M parameters, outperforming MobileNetV1 and ShuffleNetV2. STYOLOMicro and STYOLOMilli achieve 30.5% and 33.6% mAP on MS COCO respectively, surpassing YOLOv5n and YOLOX Nano in both accuracy and efficiency.

Conclusion: The proposed STResNet and STYOLO families demonstrate superior performance-efficiency trade-offs for edge deployment, offering competitive accuracy within strict parameter budgets while being optimized for resource-constrained hardware platforms.

Abstract: Recent advancements in lightweight neural networks have significantly improved the efficiency of deploying deep learning models on edge hardware. However, most existing architectures still trade accuracy for latency, which limits their applicability on microcontroller and neural processing unit based devices. In this work, we introduce two new model families, STResNet for image classification and STYOLO for object detection, jointly optimized for accuracy, efficiency, and memory footprint on resource constrained platforms. The proposed STResNet series, ranging from Nano to Tiny variants, achieves competitive ImageNet 1K accuracy within a four million parameter budget. Specifically, STResNetMilli attains 70.0 percent Top 1 accuracy with only three million parameters, outperforming MobileNetV1 and ShuffleNetV2 at comparable computational complexity. For object detection, STYOLOMicro and STYOLOMilli achieve 30.5 percent and 33.6 percent mean average precision, respectively, on the MS COCO dataset, surpassing YOLOv5n and YOLOX Nano in both accuracy and efficiency. Furthermore, when STResNetMilli is used as a backbone with the Ultralytics training environment.

[103] MOSAIC-GS: Monocular Scene Reconstruction via Advanced Initialization for Complex Dynamic Environments

Svitlana Morkva, Maximum Wilder-Smith, Michael Oechsle, Alessio Tonioni, Marco Hutter, Vaishakh Patil

Main category: cs.CV

TL;DR: MOSAIC-GS is a fast, explicit method for reconstructing dynamic 3D scenes from monocular videos using Gaussian Splatting, leveraging multiple geometric cues and efficient motion encoding.

Details

Motivation: Monocular dynamic scene reconstruction is challenging due to insufficient multiview constraints and ambiguity in recovering geometry and temporal coherence from appearance alone.

Method: Uses multiple geometric cues (depth, optical flow, segmentation, point tracking) with rigidity constraints to estimate preliminary 3D dynamics. Decomposes scene into static/dynamic components, with dynamic Gaussians using time-dependent Poly-Fourier curves for efficient motion encoding.

Result: Achieves substantially faster optimization and rendering compared to existing methods while maintaining reconstruction quality on par with state-of-the-art approaches on standard benchmarks.

Conclusion: MOSAIC-GS provides an efficient, high-fidelity solution for monocular dynamic scene reconstruction by leveraging geometric cues and parameter-efficient motion representation.

Abstract: We present MOSAIC-GS, a novel, fully explicit, and computationally efficient approach for high-fidelity dynamic scene reconstruction from monocular videos using Gaussian Splatting. Monocular reconstruction is inherently ill-posed due to the lack of sufficient multiview constraints, making accurate recovery of object geometry and temporal coherence particularly challenging. To address this, we leverage multiple geometric cues, such as depth, optical flow, dynamic object segmentation, and point tracking. Combined with rigidity-based motion constraints, these cues allow us to estimate preliminary 3D scene dynamics during an initialization stage. Recovering scene dynamics prior to the photometric optimization reduces reliance on motion inference from visual appearance alone, which is often ambiguous in monocular settings. To enable compact representations, fast training, and real-time rendering while supporting non-rigid deformations, the scene is decomposed into static and dynamic components. Each Gaussian in the dynamic part of the scene is assigned a trajectory represented as time-dependent Poly-Fourier curve for parameter-efficient motion encoding. We demonstrate that MOSAIC-GS achieves substantially faster optimization and rendering compared to existing methods, while maintaining reconstruction quality on par with state-of-the-art approaches across standard monocular dynamic scene benchmarks.

[104] Subject-driven Video Generation via Disentangled Identity and Motion

Daneul Kim, Jingxu Zhang, Wonjoon Jin, Sunghyun Cho, Qi Dai, Jaesik Park, Chong Luo

Main category: cs.CV

TL;DR: Zero-shot subject-driven video customization without tuning by decoupling subject learning from temporal dynamics using image datasets and unannotated videos.

Details

Motivation: Traditional video customization methods require large annotated video datasets which are computationally expensive and need extensive annotation. The paper aims to create a more efficient approach that avoids these limitations.

Method: Factorizes video customization into: 1) identity injection using image customization datasets, and 2) temporal modeling preservation with unannotated videos via image-to-video training. Uses random image token dropping with randomized initialization to prevent copy-paste issues, and stochastic switching during joint optimization to mitigate catastrophic forgetting.

Result: Achieves strong subject consistency and scalability, outperforming existing video customization models in zero-shot settings, demonstrating framework effectiveness.

Conclusion: Proposed method successfully enables subject-driven video customization without additional tuning by decoupling subject-specific learning from temporal dynamics, using only image datasets and minimal unannotated videos, offering an efficient alternative to traditional approaches.

Abstract: We propose to train a subject-driven customized video generation model through decoupling the subject-specific learning from temporal dynamics in zero-shot without additional tuning. A traditional method for video customization that is tuning-free often relies on large, annotated video datasets, which are computationally expensive and require extensive annotation. In contrast to the previous approach, we introduce the use of an image customization dataset directly on training video customization models, factorizing the video customization into two folds: (1) identity injection through image customization dataset and (2) temporal modeling preservation with a small set of unannotated videos through the image-to-video training method. Additionally, we employ random image token dropping with randomized image initialization during image-to-video fine-tuning to mitigate the copy-and-paste issue. To further enhance learning, we introduce stochastic switching during joint optimization of subject-specific and temporal features, mitigating catastrophic forgetting. Our method achieves strong subject consistency and scalability, outperforming existing video customization models in zero-shot settings, demonstrating the effectiveness of our framework.

[105] Ensemble of radiomics and ConvNeXt for breast cancer diagnosis

Jorge Alberto Garza-Abdala, Gerardo Alejandro Fumagal-González, Beatriz A. Bosques-Palomo, Mario Alexis Monsivais Molina, Daly Avedano, Servando Cardona-Huerta, José Gerardo Tamez-Pena

Main category: cs.CV

TL;DR: Ensemble methods combining deep learning and radiomics outperform individual approaches for breast cancer detection in mammograms, achieving AUC of 0.87.

Details

Motivation: Early breast cancer diagnosis improves survival rates, and radiomics/deep learning show potential for assisting radiologists in detection from screening mammograms.

Method: Used two datasets (RSNA 2023 with 11,913 patients and Mexican TecSalud with 19,400 patients). Trained ConvNeXtV1-small DL model on RSNA and validated on TecSalud, while radiomics models used TecSalud with leave-one-year-out validation. Ensemble method combined and calibrated predictions consistently.

Result: Ensemble approach achieved highest AUC of 0.87, outperforming ConvNeXtV1-small (AUC 0.83) and radiomics alone (AUC 0.80).

Conclusion: Ensemble methods combining DL and radiomics predictions significantly enhance breast cancer diagnosis from mammograms.

Abstract: Early diagnosis of breast cancer is crucial for improving survival rates. Radiomics and deep learning (DL) have shown significant potential in assisting radiologists with early cancer detection. This paper aims to critically assess the performance of radiomics, DL, and ensemble techniques in detecting cancer from screening mammograms. Two independent datasets were used: the RSNA 2023 Breast Cancer Detection Challenge (11,913 patients) and a Mexican cohort from the TecSalud dataset (19,400 patients). The ConvNeXtV1-small DL model was trained on the RSNA dataset and validated on the TecSalud dataset, while radiomics models were developed using the TecSalud dataset and validated with a leave-one-year-out approach. The ensemble method consistently combined and calibrated predictions using the same methodology. Results showed that the ensemble approach achieved the highest area under the curve (AUC) of 0.87, compared to 0.83 for ConvNeXtV1-small and 0.80 for radiomics. In conclusion, ensemble methods combining DL and radiomics predictions significantly enhance breast cancer diagnosis from mammograms.

[106] Dense 3D Displacement Estimation for Landslide Monitoring via Fusion of TLS Point Clouds and Embedded RGB Images

Zhaoyi Wang, Jemil Avers Butt, Shengyu Huang, Tomislav Medic, Andreas Wieser

Main category: cs.CV

TL;DR: A hierarchical coarse-to-fine method that fuses 3D point clouds and RGB images to estimate dense 3D landslide displacement vectors, achieving high spatial coverage and sub-resolution accuracy.

Details

Motivation: Existing point cloud-based landslide monitoring methods are limited by using either geometric or radiometric information alone, resulting in sparse or non-3D displacement estimates that don't provide comprehensive monitoring coverage.

Method: Hierarchical partitioning-based coarse-to-fine approach integrating 3D point clouds and co-registered RGB images. Uses patch-level matches combining 3D geometry and 2D image features, refined via geometric consistency checks, followed by rigid transformation estimation per match.

Result: Achieves high spatial coverage (79% and 97%) with displacement magnitude deviations of 0.15m/0.25m compared to external measurements and 0.07m/0.20m compared to manual references, all below mean scan resolutions. Outperforms state-of-the-art F2S3 in coverage while maintaining accuracy.

Conclusion: The method provides a practical, adaptable solution for TLS-based landslide monitoring that can be extended to other point cloud types and monitoring tasks, with publicly available data and code.

Abstract: Landslide monitoring is essential for understanding geohazards and mitigating associated risks. Existing point cloud-based methods, however, typically rely on either geometric or radiometric information and often yield sparse or non-3D displacement estimates. In this paper, we propose a hierarchical partitioning-based coarse-to-fine approach that integrates 3D point clouds and co-registered RGB images to estimate dense 3D displacement vector fields. Patch-level matches are constructed using both 3D geometry and 2D image features, refined via geometric consistency checks, and followed by rigid transformation estimation per match. Experimental results on two real-world landslide datasets demonstrate that the proposed method produces 3D displacement estimates with high spatial coverage (79% and 97%) and accuracy. Deviations in displacement magnitude with respect to external measurements (total station or GNSS observations) are 0.15 m and 0.25 m on the two datasets, respectively, and only 0.07 m and 0.20 m compared to manually derived references, all below the mean scan resolutions (0.08 m and 0.30 m). Compared with the state-of-the-art method F2S3, the proposed approach improves spatial coverage while maintaining comparable accuracy. The proposed approach offers a practical and adaptable solution for TLS-based landslide monitoring and is extensible to other types of point clouds and monitoring tasks. The example data and source code are publicly available at https://github.com/gseg-ethz/fusion4landslide.

[107] EdgeLDR: Quaternion Low-Displacement Rank Neural Networks for Edge-Efficient Deep Learning

Vladimir Frants, Sos Agaian, Karen Panetta

Main category: cs.CV

TL;DR: EdgeLDR combines quaternion neural networks with block-circulant structure for efficient edge deployment, using FFT-based computation to achieve significant compression with competitive accuracy.

Details

Motivation: Deploying deep neural networks on edge devices is limited by memory traffic and compute costs. While quaternion networks improve parameter efficiency and structured matrices enable fast computation, existing approaches don't combine quaternion channel mixing with structured matrices effectively.

Method: Introduces EdgeLDR framework with quaternion block-circulant linear and convolutional layers that combine quaternion channel mixing with block-circulant parameter structure. Uses complex adjoint representation to enable FFT-based evaluation, with reference implementations comparing FFT-based computation against naive spatial-domain realization.

Result: FFT evaluation yields large empirical speedups over naive implementation and maintains stable latency as block size increases. EdgeLDR layers integrated into CNN and Transformer backbones show significant compression with competitive accuracy on 32x32 RGB classification (CIFAR-10/100, SVHN) and hyperspectral image classification tasks.

Conclusion: EdgeLDR provides a practical framework for efficient edge deployment by combining quaternion channel mixing with block-circulant structure, enabling computationally viable larger compression factors through FFT-based evaluation while maintaining competitive accuracy.

Abstract: Deploying deep neural networks on edge devices is often limited by the memory traffic and compute cost of dense linear operators. While quaternion neural networks improve parameter efficiency by coupling multiple channels through Hamilton products, they typically retain unstructured dense weights; conversely, structured matrices enable fast computation but are usually applied in the real domain. This paper introduces EdgeLDR, a practical framework for quaternion block-circulant linear and convolutional layers that combines quaternion channel mixing with block-circulant parameter structure and enables FFT-based evaluation through the complex adjoint representation. We present reference implementations of EdgeLDR layers and compare FFT-based computation against a naive spatial-domain realization of quaternion circulant products. FFT evaluation yields large empirical speedups over the naive implementation and keeps latency stable as block size increases, making larger compression factors computationally viable. We further integrate EdgeLDR layers into compact CNN and Transformer backbones and evaluate accuracy-compression trade-offs on 32x32 RGB classification (CIFAR-10/100, SVHN) and hyperspectral image classification (Houston 2013, Pavia University), reporting parameter counts and CPU/GPU latency. The results show that EdgeLDR layers provide significant compression with competitive accuracy.

Zhaohui Liang, Sivaramakrishnan Rajaraman, Niccolo Marini, Zhiyun Xue, Sameer Antani

Main category: cs.CV

TL;DR: Fine-tuning BiomedCLIP with multi-task learning improves chest X-ray image-text retrieval performance for clinically relevant medical applications.

Details

Motivation: CLIP and BiomedCLIP provide strong cross-modal embeddings but are not optimized for fine-grained medical retrieval tasks like retrieving clinically relevant radiology reports using chest X-ray image queries.

Method: Multi-task learning framework fine-tuning BiomedCLIP with a lightweight MLP projector head trained with composite loss: binary cross-entropy for normal/abnormal classification, supervised contrastive loss for intra-class consistency, and CLIP loss for cross-modal alignment.

Result: Fine-tuned model achieves more balanced and clinically meaningful performance across both image-to-text and text-to-image retrieval tasks compared to pretrained BiomedCLIP and general-purpose CLIP. t-SNE visualizations show clearer semantic clustering of normal/abnormal cases.

Conclusion: Domain-adaptive, multi-task learning enhances cross-modal retrieval in biomedical applications, demonstrating improved diagnostic sensitivity for chest X-ray image-text retrieval.

Abstract: CLIP and BiomedCLIP are examples of vision-language foundation models and offer strong cross-modal embeddings; however, they are not optimized for fine-grained medical retrieval tasks, such as retrieving clinically relevant radiology reports using chest X-ray (CXR) image queries. To address this shortcoming, we propose a multi-task learning framework to fine-tune BiomedCLIP and evaluate improvements to CXR image-text retrieval. Using BiomedCLIP as the backbone, we incorporate a lightweight MLP projector head trained with a multi-task composite loss function that includes: (1) a binary cross-entropy loss to distinguish normal from abnormal CXR studies, (2) a supervised contrastive loss to reinforce intra-class consistency, and (3) a CLIP loss to maintain cross-modal alignment. Experimental results demonstrate that the fine-tuned model achieves more balanced and clinically meaningful performance across both image-to-text and text-to-image retrieval tasks compared to the pretrained BiomedCLIP and general-purpose CLIP models. Furthermore, t-SNE visualizations reveal clearer semantic clustering of normal and abnormal cases, demonstrating the model’s enhanced diagnostic sensitivity. These findings highlight the value of domain-adaptive, multi-task learning for advancing cross-modal retrieval in biomedical applications.

[109] Thinking with Map: Reinforced Parallel Map-Augmented Agent for Geolocalization

Yuxiang Ji, Yong Wang, Ziyu Ma, Yiming Hu, Hailang Huang, Xuecai Hu, Guanhua Chen, Liaoni Wu, Xiangxiang Chu

Main category: cs.CV

TL;DR: This paper introduces a novel approach to image geolocalization by equipping models with “Thinking with Map” ability through an agent-in-the-map loop, using a two-stage optimization scheme (agentic RL + parallel test-time scaling) and introducing MAPBench for evaluation.

Details

Motivation: Existing large vision-language models for image geolocalization overlook a crucial human strategy - using maps. While current approaches leverage world knowledge and reasoning capabilities, they fail to incorporate map-based reasoning which is fundamental to human geolocalization.

Method: The method introduces “Thinking with Map” ability formulated as an agent-in-the-map loop. It uses a two-stage optimization: 1) Agentic reinforcement learning to strengthen agentic capabilities and improve sampling efficiency, and 2) Parallel test-time scaling that enables exploration of multiple candidate paths before final prediction. The approach is evaluated on MAPBench, a new benchmark of real-world images.

Result: The method significantly outperforms existing open- and closed-source models on most metrics. Most notably, it improves Acc@500m from 8.0% to 22.1% compared to Gemini-3-Pro with Google Search/Map grounded mode, representing a substantial performance gain.

Conclusion: Incorporating map-based reasoning through the agent-in-the-map loop with two-stage optimization (agentic RL + parallel TTS) provides a powerful approach to image geolocalization, demonstrating that map utilization is a critical capability that existing LVLM approaches have overlooked.

Abstract: The image geolocalization task aims to predict the location where an image was taken anywhere on Earth using visual clues. Existing large vision-language model (LVLM) approaches leverage world knowledge, chain-of-thought reasoning, and agentic capabilities, but overlook a common strategy used by humans – using maps. In this work, we first equip the model \textit{Thinking with Map} ability and formulate it as an agent-in-the-map loop. We develop a two-stage optimization scheme for it, including agentic reinforcement learning (RL) followed by parallel test-time scaling (TTS). The RL strengthens the agentic capability of model to improve sampling efficiency, and the parallel TTS enables the model to explore multiple candidate paths before making the final prediction, which is crucial for geolocalization. To evaluate our method on up-to-date and in-the-wild images, we further present MAPBench, a comprehensive geolocalization training and evaluation benchmark composed entirely of real-world images. Experimental results show that our method outperforms existing open- and closed-source models on most metrics, specifically improving Acc@500m from 8.0% to 22.1% compared to \textit{Gemini-3-Pro} with Google Search/Map grounded mode.

[110] TAPM-Net: Trajectory-Aware Perturbation Modeling for Infrared Small Target Detection

Hongyang Xie, Hongyang He, Victor Sanchez

Main category: cs.CV

TL;DR: TAPM-Net: A trajectory-aware Mamba propagation network for infrared small target detection that models spatial diffusion of target-induced feature disturbances.

Details

Motivation: Current CNN and ViT models lack mechanisms to trace how small targets trigger directional, layer-wise perturbations in feature space, which is essential for distinguishing signals from structured noise in infrared scenes with weak contrast and cluttered backgrounds.

Method: Proposes TAPM-Net with two novel components: 1) Perturbation-guided Path Module (PGM) that constructs perturbation energy fields from multi-level features and extracts gradient-following feature trajectories, and 2) Trajectory-Aware State Block (TASB), a Mamba-based state-space unit that models dynamic propagation along trajectories with velocity-constrained diffusion and semantically aligned feature fusion.

Result: Achieves state-of-the-art performance on NUAA-SIRST and IRSTD-1K datasets for infrared small target detection.

Conclusion: TAPM-Net enables anisotropic, context-sensitive state transitions along spatial trajectories while maintaining global coherence at low computational cost, outperforming existing attention-based methods for infrared small target detection.

Abstract: Infrared small target detection (ISTD) remains a long-standing challenge due to weak signal contrast, limited spatial extent, and cluttered backgrounds. Despite performance improvements from convolutional neural networks (CNNs) and Vision Transformers (ViTs), current models lack a mechanism to trace how small targets trigger directional, layer-wise perturbations in the feature space, which is an essential cue for distinguishing signal from structured noise in infrared scenes. To address this limitation, we propose the Trajectory-Aware Mamba Propagation Network (TAPM-Net), which explicitly models the spatial diffusion behavior of target-induced feature disturbances. TAPM-Net is built upon two novel components: a Perturbation-guided Path Module (PGM) and a Trajectory-Aware State Block (TASB). The PGM constructs perturbation energy fields from multi-level features and extracts gradient-following feature trajectories that reflect the directionality of local responses. The resulting feature trajectories are fed into the TASB, a Mamba-based state-space unit that models dynamic propagation along each trajectory while incorporating velocity-constrained diffusion and semantically aligned feature fusion from word-level and sentence-level embeddings. Unlike existing attention-based methods, TAPM-Net enables anisotropic, context-sensitive state transitions along spatial trajectories while maintaining global coherence at low computational cost. Experiments on NUAA-SIRST and IRSTD-1K demonstrate that TAPM-Net achieves state-of-the-art performance in ISTD.

[111] ROAP: A Reading-Order and Attention-Prior Pipeline for Optimizing Layout Transformers in Key Information Extraction

Tingwei Xie, Jinxin He, Yonghong Song

Main category: cs.CV

TL;DR: ROAP is a lightweight pipeline that improves Layout Transformers for document understanding by explicitly modeling reading order and reducing visual noise interference.

Details

Motivation: Multimodal Transformers for document understanding have two key limitations: lack of explicit reading order modeling and visual token interference that dilutes textual attention.

Method: ROAP uses Adaptive-XY-Gap tree to extract hierarchical reading sequences, integrates them via Reading-Order-Aware Relative Position Bias, and applies Textual-Token Sub-block Attention Prior to suppress visual noise.

Result: ROAP consistently improves performance of LayoutLMv3 and GeoLayoutLM on FUNSD and CORD benchmarks without altering pre-trained backbones.

Conclusion: Explicit reading order modeling and modality interference regulation are critical for robust document understanding, offering a scalable solution for complex layout analysis.

Abstract: The efficacy of Multimodal Transformers in visually-rich document understanding (VrDU) is critically constrained by two inherent limitations: the lack of explicit modeling for logical reading order and the interference of visual tokens that dilutes attention on textual semantics. To address these challenges, this paper presents ROAP, a lightweight and architecture-agnostic pipeline designed to optimize attention distributions in Layout Transformers without altering their pre-trained backbones. The proposed pipeline first employs an Adaptive-XY-Gap (AXG-Tree) to robustly extract hierarchical reading sequences from complex layouts. These sequences are then integrated into the attention mechanism via a Reading-Order-Aware Relative Position Bias (RO-RPB). Furthermore, a Textual-Token Sub-block Attention Prior (TT-Prior) is introduced to adaptively suppress visual noise and enhance fine-grained text-text interactions. Extensive experiments on the FUNSD and CORD benchmarks demonstrate that ROAP consistently improves the performance of representative backbones, including LayoutLMv3 and GeoLayoutLM. These findings confirm that explicitly modeling reading logic and regulating modality interference are critical for robust document understanding, offering a scalable solution for complex layout analysis. The implementation code will be released at https://github.com/KevinYuLei/ROAP.

[112] Multi-Image Super Resolution Framework for Detection and Analysis of Plant Roots

Shubham Agarwal, Ofek Nourian, Michael Sidorov, Sharon Chemweno, Ofer Hadar, Naftali Lazarovitch, Jhonathan E. Ephrath

Main category: cs.CV

TL;DR: A deep learning-based Multi-Image Super Resolution (MISR) framework for enhancing underground plant root imaging by leveraging multiple overlapping views to overcome occlusion, soil moisture, and low contrast challenges.

Details

Motivation: Accurate imaging of plant root systems is critical for soil-plant interaction research, but conventional vision approaches fail in subterranean environments due to occlusion, varying soil moisture, and low contrast conditions.

Method: Proposes an underground imaging system capturing multiple overlapping views of roots, integrated with a deep learning-based MISR framework that leverages spatial redundancy across views to reconstruct high-resolution images. Uses a synthetic dataset simulating realistic underground imaging scenarios with environmental factors.

Result: Outperforms state-of-the-art super resolution baselines with 2.3% reduction in BRISQUE (indicating improved image quality) while maintaining same CLIP-IQA score. Enables enhanced phenotypic analysis and accurate estimation of root traits like root hair count and density.

Conclusion: The framework presents a promising direction for robust automatic underground plant root imaging and trait quantification for agricultural and ecological research, addressing persistent challenges in root system analysis.

Abstract: Understanding plant root systems is critical for advancing research in soil-plant interactions, nutrient uptake, and overall plant health. However, accurate imaging of roots in subterranean environments remains a persistent challenge due to adverse conditions such as occlusion, varying soil moisture, and inherently low contrast, which limit the effectiveness of conventional vision-based approaches. In this work, we propose a novel underground imaging system that captures multiple overlapping views of plant roots and integrates a deep learning-based Multi-Image Super Resolution (MISR) framework designed to enhance root visibility and detail. To train and evaluate our approach, we construct a synthetic dataset that simulates realistic underground imaging scenarios, incorporating key environmental factors that affect image quality. Our proposed MISR algorithm leverages spatial redundancy across views to reconstruct high-resolution images with improved structural fidelity and visual clarity. Quantitative evaluations show that our approach outperforms state-of-the-art super resolution baselines, achieving a 2.3 percent reduction in BRISQUE, indicating improved image quality with the same CLIP-IQA score, thereby enabling enhanced phenotypic analysis of root systems. This, in turn, facilitates accurate estimation of critical root traits, including root hair count and root hair density. The proposed framework presents a promising direction for robust automatic underground plant root imaging and trait quantification for agricultural and ecological research.

[113] Hippocampal Atrophy Patterns Across the Alzheimer’s Disease Spectrum: A Voxel-Based Morphometry Analysis

Trishna Niraula

Main category: cs.CV

TL;DR: Study finds significant hippocampal atrophy in Alzheimer’s disease using voxel-based morphometry, with moderate predictive value for MCI-to-AD conversion but no significant APOE4 genetic effects on hippocampal volume.

Details

Motivation: To investigate gray matter loss patterns in Alzheimer's disease and mild cognitive impairment, particularly in medial temporal structures, and to examine the predictive value of hippocampal volume for disease progression and genetic influences.

Method: Used CAT12/SPM12 voxel-based morphometry on baseline T1-weighted MRI scans from 249 ADNI participants (90 CN, 129 MCI, 30 AD). Analyzed gray matter volume with general linear model, diagnostic group as primary predictor, age and total intracranial volume as covariates. Statistical maps thresholded at p < 0.001 (voxelwise) with FWE cluster-level correction (p < 0.05).

Result: Significant hippocampal atrophy in AD vs CN (Cohen’s d = 2.03) and AD vs MCI (Cohen’s d = 1.61). Hippocampal volume showed moderate predictive value for MCI-to-AD conversion (AUC = 0.66). Stratification by APOE4 status revealed no significant genetic effects on cross-sectional hippocampal volume.

Conclusion: Medial temporal degeneration is a key feature of AD progression. Hippocampal volume has moderate predictive value for disease conversion, but APOE4 status does not significantly affect cross-sectional hippocampal volume measurements.

Abstract: Alzheimer’s disease (AD) and mild cognitive impairment (MCI) are associated with progressive gray matter loss, particularly in medial temporal structures. In this study, CAT12/SPM12 voxel-based morphometry was applied to baseline T1-weighted MRI scans from 249 ADNI participants (CN = 90, MCI = 129, AD = 30). Gray matter volume was analyzed using a general linear model, with the diagnostic group as primary predictor and age and total intracranial volume as covariates. Statistical maps were thresholded at p < 0.001 (voxelwise) and corrected for multiple comparisons at the cluster level using family-wise error (FWE) correction (p < 0.05). Significant hippocampal atrophy was observed in AD relative to CN and MCI (Cohen’s d = 2.03 and 1.61, respectively). Hippocampal volume demonstrated moderate predictive value for conversion from MCI to AD (AUC = 0.66). Stratification by APOE4 status did not reveal significant genetic effects on cross-sectional hippocampal volume. These results support medial temporal degeneration as a key feature of AD progression and provide insights into predictive biomarkers and genetic influences.

Zizhong Li, Haopeng Zhang, Jiawei Zhang

Main category: cs.CV

TL;DR: MMViR introduces a multi-modal, multi-grained structured representation for long video understanding that segments videos at key turning points and creates three-level descriptions, achieving significant performance improvements and reduced latency.

Details

Motivation: Current MLLMs struggle with long videos (minutes to hours) due to computational expense of direct encoding and redundancy/fragmentation from simple video-to-text conversion, necessitating a better approach for handling complex events, diverse scenes, and long-range dependencies.

Method: MMViR creates a multi-modal, multi-grained structured representation by identifying key turning points to segment videos and constructing three-level descriptions that couple global narratives with fine-grained visual details, supporting efficient query-based retrieval.

Result: Extensive evaluations across QA, summarization, and retrieval tasks show MMViR outperforms prior strongest methods with 19.67% improvement in hour-long video understanding while reducing processing latency to 45.4% of the original.

Conclusion: MMViR provides an effective structured representation approach for long video understanding that generalizes well across scenarios, balancing computational efficiency with comprehensive video analysis through multi-grained descriptions.

Abstract: Long videos, ranging from minutes to hours, present significant challenges for current Multi-modal Large Language Models (MLLMs) due to their complex events, diverse scenes, and long-range dependencies. Direct encoding of such videos is computationally too expensive, while simple video-to-text conversion often results in redundant or fragmented content. To address these limitations, we introduce MMViR, a novel multi-modal, multi-grained structured representation for long video understanding. MMViR identifies key turning points to segment the video and constructs a three-level description that couples global narratives with fine-grained visual details. This design supports efficient query-based retrieval and generalizes well across various scenarios. Extensive evaluations across three tasks, including QA, summarization, and retrieval, show that MMViR outperforms the prior strongest method, achieving a 19.67% improvement in hour-long video understanding while reducing processing latency to 45.4% of the original.

[115] Prompt-Free SAM-Based Multi-Task Framework for Breast Ultrasound Lesion Segmentation and Classification

Samuel E. Johnny, Bernes L. Atabonfack, Israel Alagbe, Assane Gueye

Main category: cs.CV

TL;DR: Multi-task deep learning framework using SAM vision encoder features for joint breast ultrasound lesion segmentation and classification, achieving state-of-the-art performance on PRECISE 2025 dataset.

Details

Motivation: Breast ultrasound imaging presents challenges for tumor analysis due to low contrast, speckle noise, and diverse lesion morphology, requiring improved methods for accurate segmentation and classification.

Method: Prompt-free adaptation of SAM vision encoder features with two decoding options: lightweight convolutional head or UNet-inspired decoder for segmentation, plus mask-guided attention for classification to focus on lesion-relevant features.

Result: Achieved Dice Similarity Coefficient of 0.887 and accuracy of 92.3% on PRECISE 2025 dataset, ranking among top entries on the challenge leaderboard.

Conclusion: SAM-based representations combined with segmentation-guided learning significantly improve both lesion delineation and diagnostic prediction in breast ultrasound imaging.

Abstract: Accurate tumor segmentation and classification in breast ultrasound (BUS) imaging remain challenging due to low contrast, speckle noise, and diverse lesion morphology. This study presents a multi-task deep learning framework that jointly performs lesion segmentation and diagnostic classification using embeddings from the Segment Anything Model (SAM) vision encoder. Unlike prompt-based SAM variants, our approach employs a prompt-free, fully supervised adaptation where high-dimensional SAM features are decoded through either a lightweight convolutional head or a UNet-inspired decoder for pixel-wise segmentation. The classification branch is enhanced via mask-guided attention, allowing the model to focus on lesion-relevant features while suppressing background artifacts. Experiments on the PRECISE 2025 breast ultrasound dataset, split per class into 80 percent training and 20 percent testing, show that the proposed method achieves a Dice Similarity Coefficient (DSC) of 0.887 and an accuracy of 92.3 percent, ranking among the top entries on the PRECISE challenge leaderboard. These results demonstrate that SAM-based representations, when coupled with segmentation-guided learning, significantly improve both lesion delineation and diagnostic prediction in breast ultrasound imaging.

[116] Enabling Stroke-Level Structural Analysis of Hieroglyphic Scripts without Language-Specific Priors

Fuwen Luo, Zihao Wan, Ziyue Wang, Yaluo Liu, Pau Tong Lin Xu, Xuanjia Qiao, Xiaolong Wang, Peng Li, Yang Liu

Main category: cs.CV

TL;DR: HieroSA is a framework that enables MLLMs to automatically extract stroke-level structures from character images without manual annotation, providing interpretable representations for hieroglyphic script analysis.

Details

Motivation: Current LLMs and MLLMs fail to capture the structural composition of hieroglyphs and logographic writing systems, treating them either as textual tokens or raw pixels without understanding stroke-level logic. Existing structural analysis methods are script-specific and require labor-intensive manual work.

Method: HieroSA transforms character images into explicit, interpretable line-segment representations in normalized coordinate space. It automatically derives stroke-level structures from character bitmaps without handcrafted data, enabling cross-lingual generalization.

Result: Extensive experiments show HieroSA effectively captures character-internal structures and semantics without needing language-specific priors. The framework demonstrates potential as a graphematics analysis tool for deeper understanding of hieroglyphic scripts.

Conclusion: HieroSA provides a novel, generalizable framework for structural analysis of hieroglyphic writing systems, enabling MLLMs to understand stroke-level composition and offering interpretable representations for cross-lingual applications in graphematics research.

Abstract: Hieroglyphs, as logographic writing systems, encode rich semantic and cultural information within their internal structural composition. Yet, current advanced Large Language Models (LLMs) and Multimodal LLMs (MLLMs) usually remain structurally blind to this information. LLMs process characters as textual tokens, while MLLMs additionally view them as raw pixel grids. Both fall short to model the underlying logic of character strokes. Furthermore, existing structural analysis methods are often script-specific and labor-intensive. In this paper, we propose Hieroglyphic Stroke Analyzer (HieroSA), a novel and generalizable framework that enables MLLMs to automatically derive stroke-level structures from character bitmaps without handcrafted data. It transforms modern logographic and ancient hieroglyphs character images into explicit, interpretable line-segment representations in a normalized coordinate space, allowing for cross-lingual generalization. Extensive experiments demonstrate that HieroSA effectively captures character-internal structures and semantics, bypassing the need for language-specific priors. Experimental results highlight the potential of our work as a graphematics analysis tool for a deeper understanding of hieroglyphic scripts. View our code at https://github.com/THUNLP-MT/HieroSA.

[117] GaussianSwap: Animatable Video Face Swapping with 3D Gaussian Splatting

Xuan Cheng, Jiahao Rao, Chengyang Li, Wenhao Wang, Weilin Chen, Lvqing Yang

Main category: cs.CV

TL;DR: GaussianSwap: A 3D Gaussian Splatting-based video face swapping framework that creates animatable avatars from target videos while transferring identity from source images.

Details

Motivation: Conventional video face swapping frameworks are limited to pixel-based representations that lack animation or interactive manipulation capabilities. There's a need for a paradigm shift from unstructured pixel generation to creating high-fidelity, controllable avatars with swapped faces.

Method: 1) Preprocess target video to extract FLAME parameters, camera poses, and segmentation masks. 2) Rig 3D Gaussian splats to the FLAME model across frames for dynamic facial control. 3) Use compound identity embedding from three state-of-the-art face recognition models for avatar finetuning to preserve identity. 4) Render face-swapped avatar on background frames to produce final video.

Result: GaussianSwap achieves superior identity preservation, visual clarity, and temporal consistency compared to conventional methods, while enabling previously unattainable interactive applications.

Conclusion: The framework represents a paradigm shift from pixel-based video generation to creating high-fidelity, animatable avatars with swapped faces, opening new possibilities for interactive manipulation and animation of swapped identities.

Abstract: We introduce GaussianSwap, a novel video face swapping framework that constructs a 3D Gaussian Splatting based face avatar from a target video while transferring identity from a source image to the avatar. Conventional video swapping frameworks are limited to generating facial representations in pixel-based formats. The resulting swapped faces exist merely as a set of unstructured pixels without any capacity for animation or interactive manipulation. Our work introduces a paradigm shift from conventional pixel-based video generation to the creation of high-fidelity avatar with swapped faces. The framework first preprocesses target video to extract FLAME parameters, camera poses and segmentation masks, and then rigs 3D Gaussian splats to the FLAME model across frames, enabling dynamic facial control. To ensure identity preserving, we propose an compound identity embedding constructed from three state-of-the-art face recognition models for avatar finetuning. Finally, we render the face-swapped avatar on the background frames to obtain the face-swapped video. Experimental results demonstrate that GaussianSwap achieves superior identity preservation, visual clarity and temporal consistency, while enabling previously unattainable interactive applications.

[118] SAS-VPReID: A Scale-Adaptive Framework with Shape Priors for Video-based Person Re-Identification at Extreme Far Distances

Qiwei Yang, Pingping Zhang, Yuhao Wang, Zijing Gong

Main category: cs.CV

TL;DR: SAS-VPReID: A scale-adaptive framework with shape priors for video-based person re-identification at extreme far distances, achieving state-of-the-art performance on VReID-XFD benchmark.

Details

Motivation: Video-based Person Re-ID at extreme far distances is challenging due to severe resolution degradation, drastic viewpoint variation, and inevitable appearance noise.

Method: Three complementary modules: 1) Memory-Enhanced Visual Backbone (MEVB) using CLIP vision encoder and multi-proxy memory, 2) Multi-Granularity Temporal Modeling (MGTM) for adaptive motion cue emphasis across scales, 3) Prior-Regularized Shape Dynamics (PRSD) to capture body structure dynamics.

Result: Demonstrates effectiveness of each module and ranks first on the VReID-XFD challenge leaderboard.

Conclusion: The proposed SAS-VPReID framework effectively addresses extreme far-distance VPReID challenges and achieves state-of-the-art performance through scale adaptation and shape priors.

Abstract: Video-based Person Re-IDentification (VPReID) aims to retrieve the same person from videos captured by non-overlapping cameras. At extreme far distances, VPReID is highly challenging due to severe resolution degradation, drastic viewpoint variation and inevitable appearance noise. To address these issues, we propose a Scale-Adaptive framework with Shape Priors for VPReID, named SAS-VPReID. The framework is built upon three complementary modules. First, we deploy a Memory-Enhanced Visual Backbone (MEVB) to extract discriminative feature representations, which leverages the CLIP vision encoder and multi-proxy memory. Second, we propose a Multi-Granularity Temporal Modeling (MGTM) to construct sequences at multiple temporal granularities and adaptively emphasize motion cues across scales. Third, we incorporate Prior-Regularized Shape Dynamics (PRSD) to capture body structure dynamics. With these modules, our framework can obtain more discriminative feature representations. Experiments on the VReID-XFD benchmark demonstrate the effectiveness of each module and our final framework ranks the first on the VReID-XFD challenge leaderboard. The source code is available at https://github.com/YangQiWei3/SAS-VPReID.

Yiming Sun, Zifan Ye, Qinghua Hu, Pengfei Zhu

Main category: cs.CV

TL;DR: DIFF-MF is a novel difference-driven channel-spatial state space model for multi-modal image fusion that addresses the trade-off between preserving infrared intensity and visible details by leveraging feature discrepancy maps to guide fusion across both channel and spatial dimensions.

Details

Motivation: Existing state space model approaches for multi-modal image fusion tend to either over-prioritize infrared intensity at the cost of visible details, or preserve visible structure while diminishing thermal target salience. There's a need for a method that can effectively balance and integrate complementary information from multiple modalities.

Method: DIFF-MF uses feature discrepancy maps between modalities to guide feature extraction. It employs: 1) a channel-exchange module with cross-attention dual state space modeling for adaptive feature reweighting, and 2) a spatial-exchange module with cross-modal state space scanning for comprehensive spatial fusion. The approach maintains linear computational complexity while capturing global dependencies.

Result: Experimental results on driving scenarios and low-altitude UAV datasets demonstrate that DIFF-MF outperforms existing approaches in both visual quality and quantitative evaluation metrics.

Conclusion: DIFF-MF effectively integrates complementary multi-modal features by addressing the limitations of existing state space models through a difference-driven approach that balances channel and spatial fusion, achieving superior performance in multi-modal image fusion tasks.

Abstract: Multi-modal image fusion aims to integrate complementary information from multiple source images to produce high-quality fused images with enriched content. Although existing approaches based on state space model have achieved satisfied performance with high computational efficiency, they tend to either over-prioritize infrared intensity at the cost of visible details, or conversely, preserve visible structure while diminishing thermal target salience. To overcome these challenges, we propose DIFF-MF, a novel difference-driven channel-spatial state space model for multi-modal image fusion. Our approach leverages feature discrepancy maps between modalities to guide feature extraction, followed by a fusion process across both channel and spatial dimensions. In the channel dimension, a channel-exchange module enhances channel-wise interaction through cross-attention dual state space modeling, enabling adaptive feature reweighting. In the spatial dimension, a spatial-exchange module employs cross-modal state space scanning to achieve comprehensive spatial fusion. By efficiently capturing global dependencies while maintaining linear computational complexity, DIFF-MF effectively integrates complementary multi-modal features. Experimental results on the driving scenarios and low-altitude UAV datasets demonstrate that our method outperforms existing approaches in both visual quality and quantitative evaluation.

[120] MoGen: A Unified Collaborative Framework for Controllable Multi-Object Image Generation

Yanfeng Li, Yue Sun, Keren Fu, Sio-Kei Im, Xiaoming Liu, Guangtao Zhai, Xiaohong Liu, Tao Tan

Main category: cs.CV

TL;DR: MoGen: A user-friendly multi-object image generation method that achieves precise semantic-region alignment and adaptive multi-modal control without rigid external constraints.

Details

Motivation: Existing multi-object image generation methods struggle with precise alignment between image regions and language semantics, leading to inconsistent object quantities and attribute aliasing. Current approaches rely heavily on external control signals, making input formats rigid and incompatible with diverse user needs and resource conditions.

Method: Two key modules: 1) Regional Semantic Anchor (RSA) module that precisely anchors phrase units to corresponding image regions during generation, ensuring quantity consistency. 2) Adaptive Multi-modal Guidance (AMG) module that adaptively parses and integrates various multi-source control signals to formulate structured intent for selective constraints on layouts and attributes.

Result: MoGen significantly outperforms existing methods in generation quality, quantity consistency, and fine-grained control. It demonstrates superior accessibility and control flexibility compared to previous approaches.

Conclusion: MoGen provides a user-friendly solution for multi-object image generation that achieves precise semantic-region alignment and adaptive control without rigid external constraints, offering better accessibility and flexibility for diverse user needs.

Abstract: Existing multi-object image generation methods face difficulties in achieving precise alignment between localized image generation regions and their corresponding semantics based on language descriptions, frequently resulting in inconsistent object quantities and attribute aliasing. To mitigate this limitation, mainstream approaches typically rely on external control signals to explicitly constrain the spatial layout, local semantic and visual attributes of images. However, this strong dependency makes the input format rigid, rendering it incompatible with the heterogeneous resource conditions of users and diverse constraint requirements. To address these challenges, we propose MoGen, a user-friendly multi-object image generation method. First, we design a Regional Semantic Anchor (RSA) module that precisely anchors phrase units in language descriptions to their corresponding image regions during the generation process, enabling text-to-image generation that follows quantity specifications for multiple objects. Building upon this foundation, we further introduce an Adaptive Multi-modal Guidance (AMG) module, which adaptively parses and integrates various combinations of multi-source control signals to formulate corresponding structured intent. This intent subsequently guides selective constraints on scene layouts and object attributes, achieving dynamic fine-grained control. Experimental results demonstrate that MoGen significantly outperforms existing methods in generation quality, quantity consistency, and fine-grained control, while exhibiting superior accessibility and control flexibility. Code is available at: https://github.com/Tear-kitty/MoGen/tree/master.

[121] VIB-Probe: Detecting and Mitigating Hallucinations in Vision-Language Models via Variational Information Bottleneck

Feiran Zhang, Yixin Wu, Zhenghua Wang, Xiaohua Wang, Changze Lv, Xuanjing Huang, Xiaoqing Zheng

Main category: cs.CV

TL;DR: VIB-Probe: A novel framework using Variational Information Bottleneck theory to detect and mitigate hallucinations in Vision-Language Models by analyzing internal attention heads and filtering out semantic noise.

Details

Motivation: Vision-Language Models suffer from hallucinations where generated text deviates from visual content. Existing detection methods rely on output logits or external tools, overlooking internal mechanisms. The authors hypothesize that specific attention heads carry signals for truthful generation, but directly probing them is challenging due to entangled visual-linguistic syntax and noise.

Method: Proposes VIB-Probe framework based on Variational Information Bottleneck theory. Extracts discriminative patterns across layers and attention heads while filtering semantic nuisances through information bottleneck principle. Uses gradients from VIB probe to identify attention heads with causal influence on hallucinations and introduces inference-time intervention strategy for mitigation.

Result: Extensive experiments across diverse benchmarks show VIB-Probe significantly outperforms existing baselines in both hallucination detection and mitigation settings.

Conclusion: VIB-Probe effectively addresses hallucination issues in VLMs by leveraging internal attention mechanisms through information bottleneck theory, offering both detection and mitigation capabilities with superior performance over existing methods.

Abstract: Vision-Language Models (VLMs) have demonstrated remarkable progress in multimodal tasks, but remain susceptible to hallucinations, where generated text deviates from the underlying visual content. Existing hallucination detection methods primarily rely on output logits or external verification tools, often overlooking their internal mechanisms. In this work, we investigate the outputs of internal attention heads, postulating that specific heads carry the primary signals for truthful generation.However, directly probing these high-dimensional states is challenging due to the entanglement of visual-linguistic syntax and noise. To address this, we propose VIB-Probe, a novel hallucination detection and mitigation framework leveraging the Variational Information Bottleneck (VIB) theory. Our method extracts discriminative patterns across layers and heads while filtering out semantic nuisances through the information bottleneck principle. Furthermore, by leveraging the gradients of our VIB probe, we identify attention heads with strong causal influence on hallucinations and introduce an inference-time intervention strategy for hallucination mitigation. Extensive experiments across diverse benchmarks demonstrate that VIB-Probe significantly outperforms existing baselines in both settings. Our code will be made publicly available.

[122] One Language-Free Foundation Model Is Enough for Universal Vision Anomaly Detection

Bin-Bin Gao, Chengjie Wang

Main category: cs.CV

TL;DR: UniADet is a simple, parameter-efficient framework for universal visual anomaly detection that decouples classification and segmentation tasks, requiring only 0.002M learnable parameters and outperforming state-of-the-art methods across 14 benchmarks.

Details

Motivation: Current visual-language model approaches for universal anomaly detection suffer from complex prompt engineering, elaborate adaptation modules, and challenging training strategies, limiting their flexibility and generality. The authors aim to simplify the fundamental mechanism behind visual-language models for AD.

Method: The authors demonstrate that language encoders are unnecessary for universal AD and propose decoupling classification and segmentation tasks, as well as decoupling cross-level features. They learn independent weights for different tasks and hierarchical features, creating an embarrassingly simple framework with minimal learnable parameters.

Result: UniADet surpasses state-of-the-art zero-shot and few-shot methods by a large margin, and even outperforms full-shot AD methods for the first time across 14 real-world AD benchmarks covering industrial and medical domains.

Conclusion: UniADet provides a highly simple, parameter-efficient, general, and effective framework for universal visual anomaly detection that eliminates the need for complex prompt engineering and adaptation modules while achieving superior performance across diverse domains.

Abstract: Universal visual anomaly detection (AD) aims to identify anomaly images and segment anomaly regions towards open and dynamic scenarios, following zero- and few-shot paradigms without any dataset-specific fine-tuning. We have witnessed significant progress in widely use of visual-language foundational models in recent approaches. However, current methods often struggle with complex prompt engineering, elaborate adaptation modules, and challenging training strategies, ultimately limiting their flexibility and generality. To address these issues, this paper rethinks the fundamental mechanism behind visual-language models for AD and presents an embarrassingly simple, general, and effective framework for Universal vision Anomaly Detection (UniADet). Specifically, we first find language encoder is used to derive decision weights for anomaly classification and segmentation, and then demonstrate that it is unnecessary for universal AD. Second, we propose an embarrassingly simple method to completely decouple classification and segmentation, and decouple cross-level features, i.e., learning independent weights for different tasks and hierarchical features. UniADet is highly simple (learning only decoupled weights), parameter-efficient (only 0.002M learnable parameters), general (adapting a variety of foundation models), and effective (surpassing state-of-the-art zero-/few-shot by a large margin and even full-shot AD methods for the first time) on 14 real-world AD benchmarks covering both industrial and medical domains. We will make the code and model of UniADet available at https://github.com/gaobb/UniADet.

[123] Semi-Supervised Facial Expression Recognition based on Dynamic Threshold and Negative Learning

Zhongpeng Cai, Jun Yu, Wei Xu, Tianyu Liu, Jianqing Sun, Jiaen Liang

Main category: cs.CV

TL;DR: Proposed semi-supervised facial expression recognition method using Dynamic Threshold Adjustment and Selective Negative Learning to leverage both labeled and unlabeled data effectively.

Details

Motivation: Facial expression recognition is important for human-computer interaction, but labeled data is expensive to acquire. Need semi-supervised methods that can effectively use both labeled and unlabeled data.

Method: Combines local attention enhancement and random dropout during feature extraction, plus Dynamic Threshold Adjustment for semi-supervised framework and Selective Negative Learning to utilize low-confidence unlabeled samples through complementary labels.

Result: Achieved state-of-the-art performance on RAF-DB and AffectNet datasets. Surpassed fully supervised methods even without using entire dataset.

Conclusion: The proposed semi-supervised approach with DTA and SNL is effective for facial expression recognition, demonstrating superior performance while reducing reliance on labeled data.

Abstract: Facial expression recognition is a key task in human-computer interaction and affective computing. However, acquiring a large amount of labeled facial expression data is often costly. Therefore, it is particularly important to design a semi-supervised facial expression recognition algorithm that makes full use of both labeled and unlabeled data. In this paper, we propose a semi-supervised facial expression recognition algorithm based on Dynamic Threshold Adjustment (DTA) and Selective Negative Learning (SNL). Initially, we designed strategies for local attention enhancement and random dropout of feature maps during feature extraction, which strengthen the representation of local features while ensuring the model does not overfit to any specific local area. Furthermore, this study introduces a dynamic thresholding method to adapt to the requirements of the semi-supervised learning framework for facial expression recognition tasks, and through a selective negative learning strategy, it fully utilizes unlabeled samples with low confidence by mining useful expression information from complementary labels, achieving impressive results. We have achieved state-of-the-art performance on the RAF-DB and AffectNet datasets. Our method surpasses fully supervised methods even without using the entire dataset, which proves the effectiveness of our approach.

[124] What’s Left Unsaid? Detecting and Correcting Misleading Omissions in Multimodal News Previews

Fanxiao Li, Jiaying Wu, Tingchao Fu, Dayang Li, Herun Wan, Wei Zhou, Min-Yen Kan

Main category: cs.CV

TL;DR: The paper addresses interpretation drift in social media news previews where image-headline pairs omit crucial context, leading readers to form judgments diverging from full articles. It introduces a pipeline to simulate preview vs. context understanding, creates the MM-Misleading benchmark, evaluates LVLMs’ blind spots, and proposes OMGuard with fine-tuning and rationale-guided correction.

Details

Motivation: Social media news previews (image-headline pairs) can induce interpretation drift by selectively omitting crucial context, leading readers to form judgments that diverge from what the full article conveys. This covert harm is harder to detect than explicit misinformation and remains underexplored.

Method: Developed a multi-stage pipeline that disentangles and simulates preview-based versus context-based understanding to construct the MM-Misleading benchmark. Proposed OMGuard with two components: (1) Interpretation-Aware Fine-Tuning to improve multimodal misleadingness detection, and (2) Rationale-Guided Misleading Content Correction that uses explicit rationales to guide headline rewriting and reduce misleading impressions.

Result: OMGuard lifts an 8B model’s detection accuracy to match a 235B LVLM and delivers markedly stronger end-to-end correction. Analysis reveals that misleadingness typically stems from local narrative shifts (e.g., missing background) rather than global frame changes, and identifies image-driven scenarios where text-only correction fails, highlighting the necessity of visual interventions.

Conclusion: The paper addresses the underexplored problem of interpretation drift in social media news previews, provides a systematic evaluation framework through MM-Misleading benchmark, and demonstrates that OMGuard effectively improves detection and correction of omission-based misleadingness, with insights into the nature of misleading content and the importance of multimodal approaches.

Abstract: Even when factually correct, social-media news previews (image-headline pairs) can induce interpretation drift: by selectively omitting crucial context, they lead readers to form judgments that diverge from what the full article conveys. This covert harm is harder to detect than explicit misinformation yet remains underexplored. To address this gap, we develop a multi-stage pipeline that disentangles and simulates preview-based versus context-based understanding, enabling construction of the MM-Misleading benchmark. Using this benchmark, we systematically evaluate open-source LVLMs and uncover pronounced blind spots to omission-based misleadingness detection. We further propose OMGuard, which integrates (1) Interpretation-Aware Fine-Tuning, which used to improve multimodal misleadingness detection and (2) Rationale-Guided Misleading Content Correction, which uses explicit rationales to guide headline rewriting and reduce misleading impressions. Experiments show that OMGuard lifts an 8B model’s detection accuracy to match a 235B LVLM and delivers markedly stronger end-to-end correction. Further analysis reveals that misleadingness typically stems from local narrative shifts (e.g., missing background) rather than global frame changes, and identifies image-driven scenarios where text-only correction fails, highlighting the necessity of visual interventions.

[125] Towards Generalized Multi-Image Editing for Unified Multimodal Models

Pengcheng Xu, Peng Tang, Donghao Luo, Xiaobin Hu, Weichu Cui, Qingdong He, Zhennan Chen, Jiangning Zhang, Charles Ling, Boyu Wang

Main category: cs.CV

TL;DR: A scalable multi-image editing framework for Unified Multimodal Models that maintains visual consistency and disambiguates visual cues across multiple input images using learnable latent separators and sinusoidal index encoding.

Details

Motivation: Current Unified Multimodal Models (UMMs) are limited in maintaining visual consistency and disambiguating visual cues when referencing details across multiple input images, creating a need for better multi-image editing capabilities.

Method: Two key innovations: 1) Learnable latent separators that explicitly differentiate each reference image in latent space for accurate disentangled conditioning, and 2) Sinusoidal index encoding that assigns visual tokens from the same image continuous sinusoidal embeddings to provide explicit image identity while allowing generalization to variable input counts. Also uses inverse dataset construction methodology for training.

Result: Clear improvements in semantic consistency, visual fidelity, and cross-image integration over prior baselines on diverse multi-image editing tasks, demonstrating advantages in consistency and generalization ability.

Conclusion: The proposed scalable multi-image editing framework effectively addresses limitations of current UMMs by enabling explicit image identity distinction and generalization to variable input counts, validated through comprehensive experiments.

Abstract: Unified Multimodal Models (UMMs) integrate multimodal understanding and generation, yet they are limited to maintaining visual consistency and disambiguating visual cues when referencing details across multiple input images. In this work, we propose a scalable multi-image editing framework for UMMs that explicitly distinguishes image identities and generalizes to variable input counts. Algorithmically, we introduce two innovations: 1) The learnable latent separators explicitly differentiate each reference image in the latent space, enabling accurate and disentangled conditioning. 2) The sinusoidal index encoding assigns visual tokens from the same image a continuous sinusoidal index embedding, which provides explicit image identity while allowing generalization and extrapolation on a variable number of inputs. To facilitate training and evaluation, we establish a high-fidelity benchmark using an inverse dataset construction methodology to guarantee artifact-free, achievable outputs. Experiments show clear improvements in semantic consistency, visual fidelity, and cross-image integration over prior baselines on diverse multi-image editing tasks, validating our advantages on consistency and generalization ability.

[126] Orient Anything V2: Unifying Orientation and Rotation Understanding

Zehan Wang, Ziang Zhang, Jiayang Xu, Jialei Wang, Tianyu Pang, Chao Du, HengShuang Zhao, Zhou Zhao

Main category: cs.CV

TL;DR: Orient Anything V2 is an enhanced foundation model for unified 3D orientation and rotation understanding from images, improving upon V1 to handle diverse rotational symmetries and estimate relative rotations through four key innovations.

Details

Motivation: The paper aims to address limitations in existing orientation estimation models that struggle with objects having diverse rotational symmetries and cannot directly estimate relative rotations between objects. Current approaches often fail to properly model rotational symmetry and lack generalization across different object categories.

Method: Four key innovations: 1) Scalable 3D assets synthesized by generative models for broad category coverage, 2) Model-in-the-loop annotation system to identify 0-N valid front faces per object, 3) Symmetry-aware periodic distribution fitting objective to capture plausible front-facing orientations, 4) Multi-frame architecture for direct relative rotation prediction.

Result: Achieves state-of-the-art zero-shot performance on orientation estimation, 6DoF pose estimation, and object symmetry recognition across 11 widely used benchmarks. Demonstrates strong generalization and broad applicability in diverse downstream tasks.

Conclusion: Orient Anything V2 significantly advances orientation estimation capabilities by handling diverse rotational symmetries and enabling relative rotation estimation, broadening the applicability of orientation understanding in various computer vision tasks.

Abstract: This work presents Orient Anything V2, an enhanced foundation model for unified understanding of object 3D orientation and rotation from single or paired images. Building upon Orient Anything V1, which defines orientation via a single unique front face, V2 extends this capability to handle objects with diverse rotational symmetries and directly estimate relative rotations. These improvements are enabled by four key innovations: 1) Scalable 3D assets synthesized by generative models, ensuring broad category coverage and balanced data distribution; 2) An efficient, model-in-the-loop annotation system that robustly identifies 0 to N valid front faces for each object; 3) A symmetry-aware, periodic distribution fitting objective that captures all plausible front-facing orientations, effectively modeling object rotational symmetry; 4) A multi-frame architecture that directly predicts relative object rotations. Extensive experiments show that Orient Anything V2 achieves state-of-the-art zero-shot performance on orientation estimation, 6DoF pose estimation, and object symmetry recognition across 11 widely used benchmarks. The model demonstrates strong generalization, significantly broadening the applicability of orientation estimation in diverse downstream tasks.

[127] Generalizable and Adaptive Continual Learning Framework for AI-generated Image Detection

Hanyi Wang, Jun Lan, Yaoyu Kang, Huijia Zhu, Weiqiang Wang, Zhuosheng Zhang, Shilin Wang

Main category: cs.CV

TL;DR: A three-stage domain continual learning framework for AI-generated image detection that adapts to evolving generative models through parameter-efficient fine-tuning, continual learning with data augmentation, and linear interpolation based on mode connectivity.

Details

Motivation: AI-generated images threaten online information authenticity; current detection methods struggle to generalize to unseen generative models and adapt to rapidly evolving generation techniques, risking ineffectiveness in real-world applications.

Method: Three-stage framework: 1) Parameter-efficient fine-tuning for transferable offline detection model; 2) Continual learning with data augmentation chain and K-FAC method to prevent catastrophic forgetting; 3) Linear interpolation based on Linear Mode Connectivity to capture commonalities across models.

Result: Offline detectors surpass leading baseline by +5.51% mean average precision; continual learning achieves 92.20% average accuracy, outperforming state-of-the-art methods on benchmark of 27 generative models (GANs, deepfakes, diffusion models).

Conclusion: The proposed framework effectively addresses generalization and adaptability challenges in AI-generated image detection, providing robust performance against evolving generative techniques through continual learning and model adaptation strategies.

Abstract: The malicious misuse and widespread dissemination of AI-generated images pose a significant threat to the authenticity of online information. Current detection methods often struggle to generalize to unseen generative models, and the rapid evolution of generative techniques continuously exacerbates this challenge. Without adaptability, detection models risk becoming ineffective in real-world applications. To address this critical issue, we propose a novel three-stage domain continual learning framework designed for continuous adaptation to evolving generative models. In the first stage, we employ a strategic parameter-efficient fine-tuning approach to develop a transferable offline detection model with strong generalization capabilities. Building upon this foundation, the second stage integrates unseen data streams into a continual learning process. To efficiently learn from limited samples of novel generated models and mitigate overfitting, we design a data augmentation chain with progressively increasing complexity. Furthermore, we leverage the Kronecker-Factored Approximate Curvature (K-FAC) method to approximate the Hessian and alleviate catastrophic forgetting. Finally, the third stage utilizes a linear interpolation strategy based on Linear Mode Connectivity, effectively capturing commonalities across diverse generative models and further enhancing overall performance. We establish a comprehensive benchmark of 27 generative models, including GANs, deepfakes, and diffusion models, chronologically structured up to August 2024 to simulate real-world scenarios. Extensive experiments demonstrate that our initial offline detectors surpass the leading baseline by +5.51% in terms of mean average precision. Our continual learning strategy achieves an average accuracy of 92.20%, outperforming state-of-the-art methods.

[128] GS-DMSR: Dynamic Sensitive Multi-scale Manifold Enhancement for Accelerated High-Quality 3D Gaussian Splatting

Nengbo Lu, Minghua Pan, Shaohua Sun, Yizhou Liang

Main category: cs.CV

TL;DR: GS-DMSR: A method for 3D dynamic scene reconstruction that balances convergence speed and rendering quality using adaptive gradient focusing and multi-scale manifold enhancement.

Details

Motivation: The paper addresses the challenge of balancing model convergence rate and rendering quality in 3D dynamic scene reconstruction, particularly for scenes with complex dynamic motions where high-precision modeling is needed.

Method: Proposes GS-DMSR with two key components: 1) Adaptive gradient focusing mechanism that analyzes dynamic evolution of Gaussian attributes to identify motion state differences and apply differentiated optimization strategies, 2) Multi-scale manifold enhancement module using collaborative optimization of implicit nonlinear decoder and explicit deformation field for complex deformation scenes.

Result: Achieves up to 96 FPS on synthetic datasets while reducing both storage overhead and training time. The method improves model convergence rate significantly.

Conclusion: GS-DMSR effectively addresses the convergence-speed vs. quality trade-off in 3D dynamic scene reconstruction, enabling faster training and rendering while maintaining high precision for complex dynamic motions.

Abstract: In the field of 3D dynamic scene reconstruction, how to balance model convergence rate and rendering quality has long been a critical challenge that urgently needs to be addressed, particularly in high-precision modeling of scenes with complex dynamic motions. To tackle this issue, this study proposes the GS-DMSR method. By quantitatively analyzing the dynamic evolution process of Gaussian attributes, this mechanism achieves adaptive gradient focusing, enabling it to dynamically identify significant differences in the motion states of Gaussian models. It then applies differentiated optimization strategies to Gaussian models with varying degrees of significance, thereby significantly improving the model convergence rate. Additionally, this research integrates a multi-scale manifold enhancement module, which leverages the collaborative optimization of an implicit nonlinear decoder and an explicit deformation field to enhance the modeling efficiency for complex deformation scenes. Experimental results demonstrate that this method achieves a frame rate of up to 96 FPS on synthetic datasets, while effectively reducing both storage overhead and training time.Our code and data are available at https://anonymous.4open.science/r/GS-DMSR-2212.

[129] Quantifying and Inducing Shape Bias in CNNs via Max-Pool Dilation

Takito Sawada, Akinori Iwata, Masahiro Okuda

Main category: cs.CV

TL;DR: Proposes a data-driven metric to quantify shape-texture balance in datasets and introduces an efficient adaptation method using modified max-pooling dilation to improve CNN performance on shape-dominant data.

Details

Motivation: CNNs have inherent texture bias that works well for natural images but degrades performance on shape-dominant data like illustrations and sketches. Existing shape-biased models lack quantitative metrics to identify which datasets would benefit from such modifications.

Method: 1) Proposes a metric using SSIM between image luminance channels and L0-smoothed counterparts to quantify shape-texture balance. 2) Introduces an efficient adaptation method that modifies dilation of max-pooling operations while keeping convolutional weights frozen, requiring only final classification layer training.

Result: The approach consistently improves classification accuracy on shape-dominant datasets, especially in low-data regimes where full fine-tuning is impractical.

Conclusion: Provides a practical solution for adapting CNNs to shape-dominant data through a quantitative dataset analysis metric and computationally efficient architectural modification that maintains frozen convolutional weights.

Abstract: Convolutional Neural Networks (CNNs) are known to exhibit a strong texture bias, favoring local patterns over global shape information–a tendency inherent to their convolutional architecture. While this bias is beneficial for texture-rich natural images, it often degrades performance on shape-dominant data such as illustrations and sketches. Although prior work has proposed shape-biased models to mitigate this issue, these approaches lack a quantitative metric for identifying which datasets would actually benefit from such modifications. To address this gap, we propose a data-driven metric that quantifies the shape-texture balance of a dataset by computing the Structural Similarity Index (SSIM) between each image’s luminance channel and its L0-smoothed counterpart. Building on this metric, we further introduce a computationally efficient adaptation method that promotes shape bias by modifying the dilation of max-pooling operations while keeping convolutional weights frozen. Experimental results show that this approach consistently improves classification accuracy on shape-dominant datasets, particularly in low-data regimes where full fine-tuning is impractical, requiring training only the final classification layer.

[130] SceneAlign: Aligning Multimodal Reasoning to Scene Graphs in Complex Visual Scenes

Chuhan Wang, Xintong Li, Jennifer Yuntong Zhang, Junda Wu, Chengkai Huang, Lina Yao, Julian McAuley, Jingbo Shang

Main category: cs.CV

TL;DR: SceneAlign improves multimodal reasoning faithfulness by using scene graphs to create hard negative examples that force models to properly ground reasoning in visual information, preventing language prior exploitation.

Details

Motivation: Multimodal LLMs struggle with faithful reasoning in complex visual scenes, exhibiting hallucinations, mis-grounded relations, skipped steps, and over-specified reasoning. Existing preference-based approaches fail because models can exploit language priors to bypass visual grounding.

Method: SceneAlign uses scene graphs as structured visual information to perform controllable structural interventions. It identifies reasoning-critical nodes and perturbs them through four targeted strategies mimicking typical grounding failures, creating hard negative rationales that are linguistically plausible but visually inaccurate. These contrastive pairs are used in Direct Preference Optimization.

Result: Across seven visual reasoning benchmarks, SceneAlign consistently improves both answer accuracy and reasoning faithfulness, demonstrating effectiveness of grounding-aware alignment for multimodal reasoning.

Conclusion: SceneAlign effectively addresses multimodal reasoning unfaithfulness by forcing models to properly ground reasoning in visual information through structured interventions and contrastive learning, outperforming existing preference-based approaches.

Abstract: Multimodal large language models often struggle with faithful reasoning in complex visual scenes, where intricate entities and relations require precise visual grounding at each step. This reasoning unfaithfulness frequently manifests as hallucinated entities, mis-grounded relations, skipped steps, and over-specified reasoning. Existing preference-based approaches, typically relying on textual perturbations or answer-conditioned rationales, fail to address this challenge as they allow models to exploit language priors to bypass visual grounding. To address this, we propose SceneAlign, a framework that leverages scene graphs as structured visual information to perform controllable structural interventions. By identifying reasoning-critical nodes and perturbing them through four targeted strategies that mimic typical grounding failures, SceneAlign constructs hard negative rationales that remain linguistically plausible but are grounded in inaccurate visual facts. These contrastive pairs are used in Direct Preference Optimization to steer models toward fine-grained, structure-faithful reasoning. Across seven visual reasoning benchmarks, SceneAlign consistently improves answer accuracy and reasoning faithfulness, highlighting the effectiveness of grounding-aware alignment for multimodal reasoning.

[131] Learning Geometric Invariance for Gait Recognition

Zengbin Wang, Junjie Li, Saihui Hou, Xu Liu, Chunshui Cao, Yongzhen Huang, Muyi Sun, Siye Wang, Man Zhang

Main category: cs.CV

TL;DR: RRS-Gait proposes a geometric transformation invariance framework for gait recognition, treating variations as combinations of Reflect, Rotate, and Scale transformations to achieve identity invariance.

Details

Motivation: Most gait recognition models implicitly learn common traits across different conditions, but few explicitly explore inherent relations between gait conditions. The authors aim to establish connections among different gait conditions by viewing variations as geometric transformations.

Method: Proposes RRS-Gait framework that explores three geometric transformations (Reflect, Rotate, Scale). It adjusts convolution kernels based on specific transformations to achieve approximate feature equivariance, then feeds equivariant-aware features into global pooling for invariance-aware learning.

Result: Extensive experiments on four popular gait datasets (Gait3D, GREW, CCPG, SUSTech1K) show superior performance across various gait conditions.

Conclusion: Viewing gait variations as geometric transformations provides a new perspective for gait recognition, where achieving geometric invariance naturally leads to identity invariance, demonstrated through the effective RRS-Gait framework.

Abstract: The goal of gait recognition is to extract identity-invariant features of an individual under various gait conditions, e.g., cross-view and cross-clothing. Most gait models strive to implicitly learn the common traits across different gait conditions in a data-driven manner to pull different gait conditions closer for recognition. However, relatively few studies have explicitly explored the inherent relations between different gait conditions. For this purpose, we attempt to establish connections among different gait conditions and propose a new perspective to achieve gait recognition: variations in different gait conditions can be approximately viewed as a combination of geometric transformations. In this case, all we need is to determine the types of geometric transformations and achieve geometric invariance, then identity invariance naturally follows. As an initial attempt, we explore three common geometric transformations (i.e., Reflect, Rotate, and Scale) and design a $\mathcal{R}$eflect-$\mathcal{R}$otate-$\mathcal{S}$cale invariance learning framework, named ${\mathcal{RRS}}$-Gait. Specifically, it first flexibly adjusts the convolution kernel based on the specific geometric transformations to achieve approximate feature equivariance. Then these three equivariant-aware features are respectively fed into a global pooling operation for final invariance-aware learning. Extensive experiments on four popular gait datasets (Gait3D, GREW, CCPG, SUSTech1K) show superior performance across various gait conditions.

[132] LatentVLA: Efficient Vision-Language Models for Autonomous Driving via Latent Action Prediction

Chengen Xie, Bin Sun, Tianyu Li, Junjie Wu, Zhihui Hao, XianPeng Lang, Hongyang Li

Main category: cs.CV

TL;DR: LatentVLA: A novel Vision-Language-Action framework that uses self-supervised latent action prediction to train without language annotations, eliminating linguistic bias while achieving state-of-the-art autonomous driving performance with real-time efficiency.

Details

Motivation: Current end-to-end autonomous driving models struggle with rare, long-tail scenarios due to limited dataset diversity. Existing VLA models have three critical issues: numerical imprecision in trajectory prediction from discrete tokenization, heavy reliance on language annotations that introduce linguistic bias and annotation burden, and computational inefficiency from multi-step reasoning that hinders real-time deployment.

Method: LatentVLA employs self-supervised latent action prediction to train VLA models without language annotations, learning rich driving representations from unlabeled trajectory data. It uses knowledge distillation to transfer generalization capabilities from VLA models to efficient vision-based networks.

Result: LatentVLA achieves state-of-the-art on NAVSIM benchmark with PDMS score of 92.4 and demonstrates strong zero-shot generalization on nuScenes benchmark. The framework achieves both robust performance and real-time efficiency.

Conclusion: LatentVLA successfully addresses key limitations of existing VLA models by eliminating linguistic bias through self-supervised learning, improving computational efficiency via knowledge distillation, and achieving superior performance on autonomous driving benchmarks while maintaining real-time capabilities.

Abstract: End-to-end autonomous driving models trained on largescale datasets perform well in common scenarios but struggle with rare, long-tail situations due to limited scenario diversity. Recent Vision-Language-Action (VLA) models leverage broad knowledge from pre-trained visionlanguage models to address this limitation, yet face critical challenges: (1) numerical imprecision in trajectory prediction due to discrete tokenization, (2) heavy reliance on language annotations that introduce linguistic bias and annotation burden, and (3) computational inefficiency from multi-step chain-of-thought reasoning hinders real-time deployment. We propose LatentVLA, a novel framework that employs self-supervised latent action prediction to train VLA models without language annotations, eliminating linguistic bias while learning rich driving representations from unlabeled trajectory data. Through knowledge distillation, LatentVLA transfers the generalization capabilities of VLA models to efficient vision-based networks, achieving both robust performance and real-time efficiency. LatentVLA establishes a new state-of-the-art on the NAVSIM benchmark with a PDMS score of 92.4 and demonstrates strong zeroshot generalization on the nuScenes benchmark.

[133] Compressing image encoders via latent distillation

Caroline Mazini Rodrigues, Nicolas Keriven, Thomas Maugey

Main category: cs.CV

TL;DR: A knowledge distillation method to compress deep learning image compression models by reducing encoder size while preserving quality, making them practical for hardware-constrained applications.

Details

Motivation: Deep learning models for image compression are typically complex, heavyweight, and require substantial training data and computational resources, making them impractical for hardware-constrained applications.

Method: Proposes a simplified knowledge distillation strategy to approximate the latent space of original heavyweight models, creating lightweight encoders with less data and shorter training time.

Result: The method preserves reconstruction quality and statistical fidelity better than training lightweight encoders with the original loss, evaluated across two different architectures on image compression tasks.

Conclusion: The approach yields practical lightweight encoders suitable for resource-limited environments while maintaining performance comparable to original heavyweight models.

Abstract: Deep learning models for image compression often face practical limitations in hardware-constrained applications. Although these models achieve high-quality reconstructions, they are typically complex, heavyweight, and require substantial training data and computational resources. We propose a methodology to partially compress these networks by reducing the size of their encoders. Our approach uses a simplified knowledge distillation strategy to approximate the latent space of the original models with less data and shorter training, yielding lightweight encoders from heavyweight ones. We evaluate the resulting lightweight encoders across two different architectures on the image compression task. Experiments show that our method preserves reconstruction quality and statistical fidelity better than training lightweight encoders with the original loss, making it practical for resource-limited environments.

[134] SGDrive: Scene-to-Goal Hierarchical World Cognition for Autonomous Driving

Jingyu Li, Junjie Wu, Dongnan Hu, Xiangkai Huang, Bin Sun, Zhihui Hao, Xianpeng Lang, Xiatian Zhu, Li Zhang

Main category: cs.CV

TL;DR: SGDrive structures VLM representations with driving-specific hierarchies (scene-agent-goal) to improve autonomous driving planning, achieving SOTA on NAVSIM benchmark.

Details

Motivation: Generalist Vision-Language Models lack specialized understanding of driving-specific 3D spatial-temporal reasoning needed for safe trajectory planning in autonomous driving.

Method: Proposes SGDrive framework that structures VLM representation learning around a scene-agent-goal hierarchy mirroring human driving cognition, decomposing driving understanding into environmental perception, agent attention, and goal formulation.

Result: Achieves state-of-the-art performance among camera-only methods on NAVSIM benchmark (both PDMS and EPDMS metrics), validating hierarchical knowledge structuring effectiveness.

Conclusion: Explicitly structuring VLM representations with driving-specific knowledge hierarchies effectively adapts generalist models to autonomous driving by providing the structured spatial-temporal representations they inherently lack.

Abstract: Recent end-to-end autonomous driving approaches have leveraged Vision-Language Models (VLMs) to enhance planning capabilities in complex driving scenarios. However, VLMs are inherently trained as generalist models, lacking specialized understanding of driving-specific reasoning in 3D space and time. When applied to autonomous driving, these models struggle to establish structured spatial-temporal representations that capture geometric relationships, scene context, and motion patterns critical for safe trajectory planning. To address these limitations, we propose SGDrive, a novel framework that explicitly structures the VLM’s representation learning around driving-specific knowledge hierarchies. Built upon a pre-trained VLM backbone, SGDrive decomposes driving understanding into a scene-agent-goal hierarchy that mirrors human driving cognition: drivers first perceive the overall environment (scene context), then attend to safety-critical agents and their behaviors, and finally formulate short-term goals before executing actions. This hierarchical decomposition provides the structured spatial-temporal representation that generalist VLMs lack, integrating multi-level information into a compact yet comprehensive format for trajectory planning. Extensive experiments on the NAVSIM benchmark demonstrate that SGDrive achieves state-of-the-art performance among camera-only methods on both PDMS and EPDMS, validating the effectiveness of hierarchical knowledge structuring for adapting generalist VLMs to autonomous driving.

[135] SketchVL: Policy Optimization via Fine-Grained Credit Assignment for Chart Understanding and More

Muye Huang, Lingling Zhang, Yifei Li, Yaqiang Wu, Jun Liu

Main category: cs.CV

TL;DR: SketchVL is a multimodal LLM that uses a novel RL algorithm (FinePO) with fine-grained credit assignment for chart understanding, achieving 7.23% performance gain over base models.

Details

Motivation: Existing MLLMs struggle with chart understanding due to complex visual reasoning requirements. RL-trained MLLMs face credit assignment problems where they can't distinguish correct vs incorrect reasoning steps within a single response.

Method: SketchVL draws intermediate reasoning steps as markers on images and feeds annotated images back to itself. It uses FinePO algorithm with Fine-grained Process Reward Model (FinePRM) to score each drawing action within trajectories for precise credit assignment.

Result: SketchVL achieves average 7.23% performance gain over base model across chart datasets, natural image datasets, and mathematics. It learns to align step-level behavior with FinePRM.

Conclusion: SketchVL with FinePO provides promising direction for training powerful reasoning models through fine-grained reinforcement signals and multi-step visual reasoning.

Abstract: Charts are high-density visual carriers of complex data and medium for information extraction and analysis. Due to the need for precise and complex visual reasoning, automated chart understanding poses a significant challenge to existing Multimodal Large Language Models (MLLMs). Many MLLMs trained with reinforcement learning (RL) face the challenge of credit assignment. Their advantage estimation, typically performed at the trajectory level, cannot distinguish between correct and incorrect reasoning steps within a single generated response. To address this limitation, we introduce SketchVL, a novel MLLM that optimized with FinePO, a new RL algorithm designed for fine-grained credit assignment within each trajectory. SketchVL’s methodology involves drawing its intermediate reasoning steps as markers on the image and feeding the annotated image back to itself, creating a robust, multi-step reasoning process. During training, the FinePO algorithm leverages a Fine-grained Process Reward Model (FinePRM) to score each drawing action within a trajectory, thereby precisely assigning credit for each step. This mechanism allows FinePO to more strongly reward correct tokens when a trajectory is globally successful, and more heavily penalize incorrect tokens when the trajectory is globally suboptimal, thus achieving fine-grained reinforcement signals. Experiments show that SketchVL learns to align its step-level behavior with the FinePRM, achieving an average performance gain of 7.23% over its base model across chart datasets, natural image datasets, and mathematics, providing a promising new direction for training powerful reasoning models.

[136] Rotate Your Character: Revisiting Video Diffusion Models for High-Quality 3D Character Generation

Jin Wang, Jianxiang Lu, Comi Chen, Guangzheng Xu, Haoyu Yang, Peng Chen, Na Zhang, Yifan Xu, Longhuang Wu, Shuai Shao, Qinglin Lu, Ping Luo

Main category: cs.CV

TL;DR: RCM is an image-to-video diffusion framework for high-quality 3D character generation from single images, featuring canonical pose transfer, 1024x1024 resolution orbital videos, controllable camera positions, and multi-view conditioning.

Details

Motivation: Generating high-quality 3D characters from single images is challenging due to complex body poses and self-occlusion, creating a need for better solutions in digital content creation.

Method: RCM uses an advanced image-to-video diffusion framework that transfers characters with complex poses into canonical poses, enabling consistent novel view synthesis across the entire viewing orbit with high-resolution 1024x1024 video generation and controllable camera positions.

Result: Extensive experiments show RCM outperforms state-of-the-art methods in both novel view synthesis and 3D generation quality.

Conclusion: RCM provides a superior solution for high-quality 3D character generation from single images with its advanced diffusion framework, canonical pose handling, and multi-view capabilities.

Abstract: Generating high-quality 3D characters from single images remains a significant challenge in digital content creation, particularly due to complex body poses and self-occlusion. In this paper, we present RCM (Rotate your Character Model), an advanced image-to-video diffusion framework tailored for high-quality novel view synthesis (NVS) and 3D character generation. Compared to existing diffusion-based approaches, RCM offers several key advantages: (1) transferring characters with any complex poses into a canonical pose, enabling consistent novel view synthesis across the entire viewing orbit, (2) high-resolution orbital video generation at 1024x1024 resolution, (3) controllable observation positions given different initial camera poses, and (4) multi-view conditioning supporting up to 4 input images, accommodating diverse user scenarios. Extensive experiments demonstrate that RCM outperforms state-of-the-art methods in both novel view synthesis and 3D generation quality.

[137] TAGRPO: Boosting GRPO on Image-to-Video Generation with Direct Trajectory Alignment

Jin Wang, Jianxiang Lu, Guangzheng Xu, Comi Chen, Haoyu Yang, Linqing Wang, Peng Chen, Mingtao Chen, Zhichao Hu, Longhuang Wu, Shuai Shao, Qinglin Lu, Ping Luo

Main category: cs.CV

TL;DR: TAGRPO is a robust post-training framework for image-to-video models that improves upon existing GRPO techniques by using contrastive learning on intermediate latents from identical initial noise conditions.

Details

Motivation: Direct application of Group Relative Policy Optimization (GRPO) techniques from text-to-image/video generation to image-to-video (I2V) models fails to yield consistent reward improvements, necessitating a specialized approach for I2V optimization.

Method: TAGRPO uses a novel GRPO loss applied to intermediate latents, leveraging rollout videos generated from identical initial noise for better optimization guidance. It incorporates a memory bank for rollout videos to enhance diversity and reduce computational overhead.

Result: TAGRPO achieves significant improvements over DanceGRPO in I2V generation despite its simplicity, demonstrating better performance in video generation tasks.

Conclusion: The proposed TAGRPO framework provides an effective post-training solution for I2V models by addressing the limitations of existing GRPO techniques through contrastive learning principles and optimized guidance from identical initial noise conditions.

Abstract: Recent studies have demonstrated the efficacy of integrating Group Relative Policy Optimization (GRPO) into flow matching models, particularly for text-to-image and text-to-video generation. However, we find that directly applying these techniques to image-to-video (I2V) models often fails to yield consistent reward improvements. To address this limitation, we present TAGRPO, a robust post-training framework for I2V models inspired by contrastive learning. Our approach is grounded in the observation that rollout videos generated from identical initial noise provide superior guidance for optimization. Leveraging this insight, we propose a novel GRPO loss applied to intermediate latents, encouraging direct alignment with high-reward trajectories while maximizing distance from low-reward counterparts. Furthermore, we introduce a memory bank for rollout videos to enhance diversity and reduce computational overhead. Despite its simplicity, TAGRPO achieves significant improvements over DanceGRPO in I2V generation.

[138] FeatureSLAM: Feature-enriched 3D gaussian splatting SLAM in real time

Christopher Thirgood, Oscar Mendez, Erin Ling, Jon Storey, Simon Hadfield

Main category: cs.CV

TL;DR: Real-time SLAM system combining camera tracking with photorealistic 3D Gaussian Splatting mapping, integrating dense feature rasterization aligned with visual foundation models for open-set semantic capabilities.

Details

Motivation: To create a real-time SLAM system that goes beyond basic RGB-D input by incorporating semantic features from visual foundation models, enabling both improved tracking/mapping accuracy and new downstream applications like open-set segmentation.

Method: Unifies camera tracking with 3D Gaussian Splatting mapping, integrates dense feature rasterization into novel-view synthesis aligned with visual foundation models, enabling feature-enriched mapping beyond pre-defined class labels.

Result: Achieves real-time tracking with 9% lower pose error and 8% higher mapping accuracy compared to fixed-set SLAM baselines, provides semantic/language masking on-par with offline 3DGS models, and maintains state-of-the-art tracking, depth, and RGB rendering.

Conclusion: Feature-embedded SLAM not only enables new downstream applications like free-viewpoint open-set segmentation but also improves underlying tracking and mapping performance, demonstrating the value of integrating visual foundation model features into real-time SLAM systems.

Abstract: We present a real-time tracking SLAM system that unifies efficient camera tracking with photorealistic feature-enriched mapping using 3D Gaussian Splatting (3DGS). Our main contribution is integrating dense feature rasterization into the novel-view synthesis, aligned with a visual foundation model. This yields strong semantics, going beyond basic RGB-D input, aiding both tracking and mapping accuracy. Unlike previous semantic SLAM approaches (which embed pre-defined class labels) FeatureSLAM enables entirely new downstream tasks via free-viewpoint, open-set segmentation. Across standard benchmarks, our method achieves real-time tracking, on par with state-of-the-art systems while improving tracking stability and map fidelity without prohibitive compute. Quantitatively, we obtain 9% lower pose error and 8% higher mapping accuracy compared to recent fixed-set SLAM baselines. Our results confirm that real-time feature-embedded SLAM, is not only valuable for enabling new downstream applications. It also improves the performance of the underlying tracking and mapping subsystems, providing semantic and language masking results that are on-par with offline 3DGS models, alongside state-of-the-art tracking, depth and RGB rendering.

[139] ViTNT-FIQA: Training-Free Face Image Quality Assessment with Vision Transformers

Guray Ozgur, Eduarda Caldeira, Tahar Chettaoui, Jan Niklas Kolf, Marco Huber, Naser Damer, Fadi Boutros

Main category: cs.CV

TL;DR: ViTNT-FIQA is a training-free face image quality assessment method that measures patch embedding stability across Vision Transformer blocks to predict image quality with just one forward pass.

Details

Motivation: Current FIQA methods either use only final-layer representations or require multiple forward passes/backpropagation. There's a need for efficient training-free methods that can leverage intermediate features without computational overhead.

Method: Measures stability of patch embedding evolution across intermediate ViT blocks by computing Euclidean distances between L2-normalized patch embeddings from consecutive transformer blocks, then aggregates into image-level quality scores.

Result: Achieves competitive performance on eight benchmarks (LFW, AgeDB-30, CFP-FP, CALFW, Adience, CPLFW, XQLFW, IJB-C) while maintaining computational efficiency and immediate applicability to any pre-trained ViT-based face recognition model.

Conclusion: ViTNT-FIQA provides an effective training-free FIQA solution that leverages intermediate feature stability, requires only single forward pass, and works with existing ViT models without modifications.

Abstract: Face Image Quality Assessment (FIQA) is essential for reliable face recognition systems. Current approaches primarily exploit only final-layer representations, while training-free methods require multiple forward passes or backpropagation. We propose ViTNT-FIQA, a training-free approach that measures the stability of patch embedding evolution across intermediate Vision Transformer (ViT) blocks. We demonstrate that high-quality face images exhibit stable feature refinement trajectories across blocks, while degraded images show erratic transformations. Our method computes Euclidean distances between L2-normalized patch embeddings from consecutive transformer blocks and aggregates them into image-level quality scores. We empirically validate this correlation on a quality-labeled synthetic dataset with controlled degradation levels. Unlike existing training-free approaches, ViTNT-FIQA requires only a single forward pass without backpropagation or architectural modifications. Through extensive evaluation on eight benchmarks (LFW, AgeDB-30, CFP-FP, CALFW, Adience, CPLFW, XQLFW, IJB-C), we show that ViTNT-FIQA achieves competitive performance with state-of-the-art methods while maintaining computational efficiency and immediate applicability to any pre-trained ViT-based face recognition model.

[140] FlyPose: Towards Robust Human Pose Estimation From Aerial Views

Hassaan Farooq, Marvin Brenner, Peter St\ütz

Main category: cs.CV

TL;DR: FlyPose is a lightweight human pose estimation pipeline for UAVs that improves detection accuracy by 6.8 mAP and pose estimation by 16.3 mAP while running at ~20ms inference time on edge hardware.

Details

Motivation: UAVs operating near humans need accurate real-time human pose estimation from aerial views, which is challenging due to low resolution, steep angles, and occlusions.

Method: Developed FlyPose, a lightweight top-down human pose estimation pipeline trained on multiple datasets, deployed on Jetson Orin AGX hardware onboard UAVs.

Result: Achieved 6.8 mAP improvement in person detection across multiple datasets and 16.3 mAP improvement in 2D pose estimation on UAV-Human dataset, with ~20ms inference latency.

Conclusion: FlyPose enables real-time human pose estimation for UAV applications and contributes FlyPose-104 dataset for challenging aerial perspectives.

Abstract: Unmanned Aerial Vehicles (UAVs) are increasingly deployed in close proximity to humans for applications such as parcel delivery, traffic monitoring, disaster response and infrastructure inspections. Ensuring safe and reliable operation in these human-populated environments demands accurate perception of human poses and actions from an aerial viewpoint. This perspective challenges existing methods with low resolution, steep viewing angles and (self-)occlusion, especially if the application demands realtime feasibile models. We train and deploy FlyPose, a lightweight top-down human pose estimation pipeline for aerial imagery. Through multi-dataset training, we achieve an average improvement of 6.8 mAP in person detection across the test-sets of Manipal-UAV, VisDrone, HIT-UAV as well as our custom dataset. For 2D human pose estimation we report an improvement of 16.3 mAP on the challenging UAV-Human dataset. FlyPose runs with an inference latency of ~20 milliseconds including preprocessing on a Jetson Orin AGX Developer Kit and is deployed onboard a quadrotor UAV during flight experiments. We also publish FlyPose-104, a small but challenging aerial human pose estimation dataset, that includes manual annotations from difficult aerial perspectives: https://github.com/farooqhassaan/FlyPose.

[141] Adaptive Disentangled Representation Learning for Incomplete Multi-View Multi-Label Classification

Quanjiang Li, Zhiming Liu, Tianxiang Xu, Tingjin Luo, Chenping Hou

Main category: cs.CV

TL;DR: ADRL is a multi-view multi-label learning method that addresses feature absence and incomplete annotations through adaptive disentangled representation learning, robust view completion, and prototype-based feature selection.

Details

Motivation: Multi-view multi-label learning suffers from simultaneous feature absence and incomplete annotations due to data acquisition challenges and expensive supervision costs. Existing methods have limitations in feature recovery, representation disentanglement, and label semantics modeling.

Method: ADRL uses adaptive disentangled representation learning with: 1) robust view completion via feature-level affinity propagation across modalities with neighborhood awareness, 2) stochastic masking for reconstruction, 3) category-level association dissemination for label prototypes, 4) mutual-information-based objective for representation consistency, 5) prototype-specific feature selection via label-view interactions, and 6) pseudo-label generation with discriminative view fusion.

Result: Extensive experiments on public datasets and real-world applications demonstrate ADRL’s superior performance compared to existing methods.

Conclusion: ADRL effectively addresses the practical challenges of multi-view multi-label learning with incomplete features and annotations through its comprehensive disentangled representation learning framework, achieving robust performance across various applications.

Abstract: Multi-view multi-label learning frequently suffers from simultaneous feature absence and incomplete annotations, due to challenges in data acquisition and cost-intensive supervision. To tackle the complex yet highly practical problem while overcoming the existing limitations of feature recovery, representation disentanglement, and label semantics modeling, we propose an Adaptive Disentangled Representation Learning method (ADRL). ADRL achieves robust view completion by propagating feature-level affinity across modalities with neighborhood awareness, and reinforces reconstruction effectiveness by leveraging a stochastic masking strategy. Through disseminating category-level association across label distributions, ADRL refines distribution parameters for capturing interdependent label prototypes. Besides, we formulate a mutual-information-based objective to promote consistency among shared representations and suppress information overlap between view-specific representation and other modalities. Theoretically, we derive the tractable bounds to train the dual-channel network. Moreover, ADRL performs prototype-specific feature selection by enabling independent interactions between label embeddings and view representations, accompanied by the generation of pseudo-labels for each category. The structural characteristics of the pseudo-label space are then exploited to guide a discriminative trade-off during view fusion. Finally, extensive experiments on public datasets and real-world applications demonstrate the superior performance of ADRL.

[142] SceneFoundry: Generating Interactive Infinite 3D Worlds

ChunTeng Chen, YiChen Hsu, YiWen Liu, WeiFang Sun, TsaiChing Ni, ChunYi Lee, Min Sun, YuanFu Yang

Main category: cs.CV

TL;DR: SceneFoundry: A language-guided diffusion framework for generating apartment-scale 3D environments with articulated furniture for robotic training.

Details

Motivation: Existing generative approaches fail to capture functional complexity of real-world interiors, particularly articulated objects with movable parts essential for robotic manipulation and navigation.

Method: Uses LLM for floor layout generation from natural language prompts, diffusion-based posterior sampling to populate scenes with articulated assets from 3D repositories, and differentiable guidance functions for physical usability.

Result: Generates structurally valid, semantically coherent, and functionally interactive environments across diverse scene types and conditions.

Conclusion: Enables scalable embodied AI research by producing physically realistic 3D environments with articulated furniture for robotic training.

Abstract: The ability to automatically generate large-scale, interactive, and physically realistic 3D environments is crucial for advancing robotic learning and embodied intelligence. However, existing generative approaches often fail to capture the functional complexity of real-world interiors, particularly those containing articulated objects with movable parts essential for manipulation and navigation. This paper presents SceneFoundry, a language-guided diffusion framework that generates apartment-scale 3D worlds with functionally articulated furniture and semantically diverse layouts for robotic training. From natural language prompts, an LLM module controls floor layout generation, while diffusion-based posterior sampling efficiently populates the scene with articulated assets from large-scale 3D repositories. To ensure physical usability, SceneFoundry employs differentiable guidance functions to regulate object quantity, prevent articulation collisions, and maintain sufficient walkable space for robotic navigation. Extensive experiments demonstrate that our framework generates structurally valid, semantically coherent, and functionally interactive environments across diverse scene types and conditions, enabling scalable embodied AI research.

[143] Boosting Latent Diffusion Models via Disentangled Representation Alignment

John Page, Xuesong Niu, Kai Wu, Kun Gai

Main category: cs.CV

TL;DR: Send-VAE improves image generation by optimizing VAEs for semantic disentanglement rather than just semantic alignment, achieving better attribute-level representation and faster training.

Details

Motivation: Current approaches use the same alignment targets for both VAEs and LDMs, ignoring their different representational needs. VAEs should focus on semantic disentanglement for attribute-level information, while LDMs need high-level semantic concepts.

Method: Proposes Semantic disentangled VAE (Send-VAE) that aligns VAE latent space with semantic hierarchy of pre-trained VFMs using a non-linear mapper network, enabling attribute-level disentanglement while bridging to high-level semantics.

Result: Send-VAE shows strong correlation between semantic disentanglement and improved generation performance. When used to train SiTs, it speeds up training and achieves SOTA FID of 1.21/1.75 on ImageNet 256x256 with/without classifier-free guidance.

Conclusion: Optimizing VAEs specifically for semantic disentanglement rather than generic semantic alignment leads to better attribute-level representation, faster training, and improved image generation quality.

Abstract: Latent Diffusion Models (LDMs) generate high-quality images by operating in a compressed latent space, typically obtained through image tokenizers such as Variational Autoencoders (VAEs). In pursuit of a generation-friendly VAE, recent studies have explored leveraging Vision Foundation Models (VFMs) as representation alignment targets for VAEs, mirroring the approach commonly adopted for LDMs. Although this yields certain performance gains, using the same alignment target for both VAEs and LDMs overlooks their fundamentally different representational requirements. We advocate that while LDMs benefit from latents retaining high-level semantic concepts, VAEs should excel in semantic disentanglement, enabling encoding of attribute-level information in a structured way. To address this, we propose the Semantic disentangled VAE (Send-VAE), explicitly optimized for disentangled representation learning through aligning its latent space with the semantic hierarchy of pre-trained VFMs. Our approach employs a non-linear mapper network to transform VAE latents, aligning them with VFMs to bridge the gap between attribute-level disentanglement and high-level semantics, facilitating effective guidance for VAE learning. We evaluate semantic disentanglement via linear probing on attribute prediction tasks, showing strong correlation with improved generation performance. Finally, using Send-VAE, we train flow-based transformers SiTs; experiments show Send-VAE significantly speeds up training and achieves a state-of-the-art FID of 1.21 and 1.75 with and without classifier-free guidance on ImageNet 256x256.

[144] GeoSurDepth: Spatial Geometry-Consistent Self-Supervised Depth Estimation for Surround-View Cameras

Weimin Liu, Wenjun Wang, Joshua H. Meng

Main category: cs.CV

TL;DR: GeoSurDepth is a self-supervised surround-view depth estimation framework that leverages geometry consistency as the primary cue, using foundation models as pseudo geometry priors and introducing novel view synthesis with adaptive joint motion learning.

Details

Motivation: Prior surround-view depth estimation methods focus mainly on photometric constraints but fail to explicitly exploit rich geometric structure inherent in both monocular and surround-view settings. There's a need for better geometry consistency exploitation for robust 3D scene understanding in autonomous driving.

Method: 1) Uses foundation models as pseudo geometry priors to maintain surface normal consistency in 3D space and regularize object/texture-consistent depth in 2D. 2) Introduces novel view synthesis pipeline with 2D-3D lifting via spatial warping for photometric supervision across temporal, spatial, and spatial-temporal contexts. 3) Proposes adaptive joint motion learning strategy to emphasize informative spatial geometry cues for improved motion reasoning.

Result: Extensive experiments on DDAD and nuScenes datasets demonstrate state-of-the-art performance, validating the effectiveness of the geometry consistency approach for surround-view depth estimation.

Conclusion: The framework highlights the importance of exploiting geometry coherence and consistency for robust self-supervised multi-view depth estimation, providing a competitive alternative to laser-based sensors for autonomous driving applications.

Abstract: Accurate surround-view depth estimation provides a competitive alternative to laser-based sensors and is essential for 3D scene understanding in autonomous driving. While prior studies have proposed various approaches that primarily focus on enforcing cross-view constraints at the photometric level, few explicitly exploit the rich geometric structure inherent in both monocular and surround-view setting. In this work, we propose GeoSurDepth, a framework that leverages geometry consistency as the primary cue for surround-view depth estimation. Concretely, we utilize foundation models as a pseudo geometry prior and feature representation enhancement tool to guide the network to maintain surface normal consistency in spatial 3D space and regularize object- and texture-consistent depth estimation in 2D. In addition, we introduce a novel view synthesis pipeline where 2D-3D lifting is achieved with dense depth reconstructed via spatial warping, encouraging additional photometric supervision across temporal, spatial, and spatial-temporal contexts, and compensating for the limitations of single-view image reconstruction. Finally, a newly-proposed adaptive joint motion learning strategy enables the network to adaptively emphasize informative spatial geometry cues for improved motion reasoning. Extensive experiments on DDAD and nuScenes demonstrate that GeoSurDepth achieves state-of-the-art performance, validating the effectiveness of our approach. Our framework highlights the importance of exploiting geometry coherence and consistency for robust self-supervised multi-view depth estimation.

[145] Goal Force: Teaching Video Models To Accomplish Physics-Conditioned Goals

Nate Gillman, Yinghua Zhou, Zitian Tang, Evan Luo, Arjan Chakravarthy, Daksh Aggarwal, Michael Freeman, Charles Herrmann, Chen Sun

Main category: cs.CV

TL;DR: Goal Force: A framework for specifying goals in video generation models using force vectors and intermediate dynamics, enabling physics-aware planning without external simulators.

Details

Motivation: Current video generation world models struggle with goal specification - text is too abstract for physical nuances, while target images are infeasible for dynamic tasks. There's a need for more precise, physics-aware goal specification methods.

Method: Train a video generation model on synthetic causal primitives (elastic collisions, falling dominos) to learn force propagation through time and space. Users define goals via explicit force vectors and intermediate dynamics, mimicking human physical task conceptualization.

Result: The model shows remarkable zero-shot generalization to complex real-world scenarios (tool manipulation, multi-object causal chains) despite training only on simple physics data. It emerges as an implicit neural physics simulator.

Conclusion: Grounding video generation in fundamental physical interactions enables precise, physics-aware planning without external engines. The approach bridges the gap between abstract goal specification and physical reality in world models.

Abstract: Recent advancements in video generation have enabled the development of ``world models’’ capable of simulating potential futures for robotics and planning. However, specifying precise goals for these models remains a challenge; text instructions are often too abstract to capture physical nuances, while target images are frequently infeasible to specify for dynamic tasks. To address this, we introduce Goal Force, a novel framework that allows users to define goals via explicit force vectors and intermediate dynamics, mirroring how humans conceptualize physical tasks. We train a video generation model on a curated dataset of synthetic causal primitives-such as elastic collisions and falling dominos-teaching it to propagate forces through time and space. Despite being trained on simple physics data, our model exhibits remarkable zero-shot generalization to complex, real-world scenarios, including tool manipulation and multi-object causal chains. Our results suggest that by grounding video generation in fundamental physical interactions, models can emerge as implicit neural physics simulators, enabling precise, physics-aware planning without reliance on external engines. We release all datasets, code, model weights, and interactive video demos at our project page.

[146] Kidney Cancer Detection Using 3D-Based Latent Diffusion Models

Jen Dusseljee, Sarah de Boer, Alessa Hering

Main category: cs.CV

TL;DR: Novel 3D latent diffusion pipeline for kidney anomaly detection on abdominal CT using DDPMs, DDIMs, and VQ-GANs with weak supervision from case-level pseudo-labels.

Details

Motivation: To develop an annotation-efficient approach for 3D kidney anomaly detection that doesn't require pixel-level annotations, addressing the limitations of prior slice-wise methods and reducing annotation burden.

Method: Combines Denoising Diffusion Probabilistic Models (DDPMs), Denoising Diffusion Implicit Models (DDIMs), and Vector-Quantized Generative Adversarial Networks (VQ-GANs) in a 3D latent diffusion pipeline that operates directly on image volumes using only case-level pseudo-labels for weak supervision.

Result: Demonstrates feasibility of 3D latent diffusion for weakly supervised anomaly detection, though current results don’t yet match supervised baselines. Provides key insights for improving reconstruction fidelity and lesion localization.

Conclusion: The approach represents an important step toward annotation-efficient generative modeling of complex abdominal anatomy, with promising directions identified for future improvements in reconstruction quality and localization accuracy.

Abstract: In this work, we present a novel latent diffusion-based pipeline for 3D kidney anomaly detection on contrast-enhanced abdominal CT. The method combines Denoising Diffusion Probabilistic Models (DDPMs), Denoising Diffusion Implicit Models (DDIMs), and Vector-Quantized Generative Adversarial Networks (VQ-GANs). Unlike prior slice-wise approaches, our method operates directly on an image volume and leverages weak supervision with only case-level pseudo-labels. We benchmark our approach against state-of-the-art supervised segmentation and detection models. This study demonstrates the feasibility and promise of 3D latent diffusion for weakly supervised anomaly detection. While the current results do not yet match supervised baselines, they reveal key directions for improving reconstruction fidelity and lesion localization. Our findings provide an important step toward annotation-efficient, generative modeling of complex abdominal anatomy.

[147] LayerGS: Decomposition and Inpainting of Layered 3D Human Avatars via 2D Gaussian Splatting

Yinghan Xu, John Dingliana

Main category: cs.CV

TL;DR: A novel framework decomposes posed humans into animatable multi-layered 3D avatars (body + garments) using 2D Gaussians and diffusion-based inpainting, enabling realistic virtual try-on.

Details

Motivation: Existing single-layer reconstruction methods lock clothing to specific identities, while prior multi-layer approaches struggle with occluded regions, limiting realistic virtual try-on and 3D human asset creation.

Method: Uses 2D Gaussians to encode each layer for geometry/rendering, inpaints hidden regions with pretrained 2D diffusion via SDS, and employs three-stage training: coarse canonical garment reconstruction, then multi-layer training for body and garment details.

Result: Achieves better rendering quality and layer decomposition/recomposition than SOTA on 4D-Dress and Thuman2.0 datasets, enabling realistic virtual try-on under novel viewpoints and poses.

Conclusion: The approach advances practical creation of high-fidelity 3D human assets for immersive applications by overcoming limitations of previous single-layer and multi-layer reconstruction methods.

Abstract: We propose a novel framework for decomposing arbitrarily posed humans into animatable multi-layered 3D human avatars, separating the body and garments. Conventional single-layer reconstruction methods lock clothing to one identity, while prior multi-layer approaches struggle with occluded regions. We overcome both limitations by encoding each layer as a set of 2D Gaussians for accurate geometry and photorealistic rendering, and inpainting hidden regions with a pretrained 2D diffusion model via score-distillation sampling (SDS). Our three-stage training strategy first reconstructs the coarse canonical garment via single-layer reconstruction, followed by multi-layer training to jointly recover the inner-layer body and outer-layer garment details. Experiments on two 3D human benchmark datasets (4D-Dress, Thuman2.0) show that our approach achieves better rendering quality and layer decomposition and recomposition than the previous state-of-the-art, enabling realistic virtual try-on under novel viewpoints and poses, and advancing practical creation of high-fidelity 3D human assets for immersive applications. Our code is available at https://github.com/RockyXu66/LayerGS

[148] Bidirectional Channel-selective Semantic Interaction for Semi-Supervised Medical Segmentation

Kaiwen Huang, Yizhe Zhang, Yi Zhou, Tianyang Xu, Tao Zhou

Main category: cs.CV

TL;DR: BCSI framework improves semi-supervised medical image segmentation through semantic-spatial perturbation, channel-selective routing, and bidirectional channel-wise interaction to address error accumulation and limited labeled-unlabeled data interaction.

Details

Motivation: Existing semi-supervised medical image segmentation methods (mean teacher, dual-stream consistency) suffer from error accumulation, model complexity, and neglect interaction between labeled and unlabeled data streams.

Method: Proposes BCSI framework with: 1) Semantic-Spatial Perturbation (SSP) using strong augmentations with pseudo-labels from weak augmentations and consistency regularization; 2) Channel-selective Router (CR) to dynamically select relevant channels for information exchange; 3) Bidirectional Channel-wise Interaction (BCI) to supplement semantic information and enhance important channels.

Result: Experimental results on multiple benchmarking 3D medical datasets demonstrate superior performance compared to existing semi-supervised approaches.

Conclusion: The BCSI framework effectively addresses limitations of existing methods by improving feature interaction between labeled and unlabeled data, reducing noise, and enhancing model stability for semi-supervised medical image segmentation.

Abstract: Semi-supervised medical image segmentation is an effective method for addressing scenarios with limited labeled data. Existing methods mainly rely on frameworks such as mean teacher and dual-stream consistency learning. These approaches often face issues like error accumulation and model structural complexity, while also neglecting the interaction between labeled and unlabeled data streams. To overcome these challenges, we propose a Bidirectional Channel-selective Semantic Interaction~(BCSI) framework for semi-supervised medical image segmentation. First, we propose a Semantic-Spatial Perturbation~(SSP) mechanism, which disturbs the data using two strong augmentation operations and leverages unsupervised learning with pseudo-labels from weak augmentations. Additionally, we employ consistency on the predictions from the two strong augmentations to further improve model stability and robustness. Second, to reduce noise during the interaction between labeled and unlabeled data, we propose a Channel-selective Router~(CR) component, which dynamically selects the most relevant channels for information exchange. This mechanism ensures that only highly relevant features are activated, minimizing unnecessary interference. Finally, the Bidirectional Channel-wise Interaction~(BCI) strategy is employed to supplement additional semantic information and enhance the representation of important channels. Experimental results on multiple benchmarking 3D medical datasets demonstrate that the proposed method outperforms existing semi-supervised approaches.

[149] Phase4DFD: Multi-Domain Phase-Aware Attention for Deepfake Detection

Zhen-Xin Lin, Shang-Kuan Chen

Main category: cs.CV

TL;DR: Phase4DFD: A deepfake detection framework that explicitly models phase information in frequency domain using learnable attention, outperforming state-of-the-art methods while maintaining low computational cost.

Details

Motivation: Most existing deepfake detection methods focus on spatial domain or frequency magnitude only, implicitly under-exploring phase information which contains crucial manipulation artifacts from synthetic generation processes.

Method: Proposes Phase4DFD framework that augments RGB input with FFT magnitude and LBP representations, and introduces a phase-aware attention module that uses phase discontinuities to guide attention toward manipulation-indicative frequency patterns before backbone feature extraction.

Result: Outperforms state-of-the-art spatial and frequency-based detectors on CIFAKE and DFFD datasets while maintaining low computational overhead. Ablation studies confirm phase modeling provides complementary, non-redundant information beyond magnitude-only representations.

Conclusion: Explicit phase modeling in frequency domain deepfake detection is crucial and provides significant performance improvements, demonstrating that phase information contains unique manipulation artifacts not captured by magnitude-only approaches.

Abstract: Recent deepfake detection methods have increasingly explored frequency domain representations to reveal manipulation artifacts that are difficult to detect in the spatial domain. However, most existing approaches rely primarily on spectral magnitude, implicitly under exploring the role of phase information. In this work, we propose Phase4DFD, a phase aware frequency domain deepfake detection framework that explicitly models phase magnitude interactions via a learnable attention mechanism. Our approach augments standard RGB input with Fast Fourier Transform (FFT) magnitude and local binary pattern (LBP) representations to expose subtle synthesis artifacts that remain indistinguishable under spatial analysis alone. Crucially, we introduce an input level phase aware attention module that uses phase discontinuities commonly introduced by synthetic generation to guide the model toward frequency patterns that are most indicative of manipulation before backbone feature extraction. The attended multi domain representation is processed by an efficient BNext M backbone, with optional channel spatial attention applied for semantic feature refinement. Extensive experiments on the CIFAKE and DFFD datasets demonstrate that our proposed model Phase4DFD outperforms state of the art spatial and frequency-based detectors while maintaining low computational overhead. Comprehensive ablation studies further confirm that explicit phase modeling provides complementary and non-redundant information beyond magnitude-only frequency representations.

[150] Adapting Vision Transformers to Ultra-High Resolution Semantic Segmentation with Relay Tokens

Yohann Perron, Vladyslav Sydorov, Christophe Pottier, Loic Landrieu

Main category: cs.CV

TL;DR: A novel vision transformer method using relay tokens to process images at both local (high-res) and global (low-res) scales simultaneously, preserving fine details while maintaining global context for ultra high-resolution segmentation.

Details

Motivation: Current methods for ultra high-resolution image segmentation either use sliding windows (losing global context) or downsampling (losing fine details). There's a need for an approach that preserves both local details and global awareness.

Method: Processes images in parallel at local scale (high resolution, small crops) and global scale (low resolution, large crops). Uses a small set of learnable relay tokens to aggregate and propagate features between the two branches. Plugs directly into standard transformer backbones like ViT and Swin with minimal parameter overhead.

Result: Achieves consistent gains across three ultra high-resolution segmentation benchmarks (Archaeoscape, URUR, Gleason) and Cityscapes, with up to 15% relative mIoU improvement. Adds fewer than 2% parameters to standard backbones.

Conclusion: The proposed relay token method effectively enables multi-scale reasoning in vision transformers, simultaneously preserving local details and global context for superior ultra high-resolution segmentation performance with minimal computational overhead.

Abstract: Current approaches for segmenting ultra high resolution images either slide a window, thereby discarding global context, or downsample and lose fine detail. We propose a simple yet effective method that brings explicit multi scale reasoning to vision transformers, simultaneously preserving local details and global awareness. Concretely, we process each image in parallel at a local scale (high resolution, small crops) and a global scale (low resolution, large crops), and aggregate and propagate features between the two branches with a small set of learnable relay tokens. The design plugs directly into standard transformer backbones (eg ViT and Swin) and adds fewer than 2 % parameters. Extensive experiments on three ultra high resolution segmentation benchmarks, Archaeoscape, URUR, and Gleason, and on the conventional Cityscapes dataset show consistent gains, with up to 15 % relative mIoU improvement. Code and pretrained models are available at https://archaeoscape.ai/work/relay-tokens/ .

[151] Performance of a Deep Learning-Based Segmentation Model for Pancreatic Tumors on Public Endoscopic Ultrasound Datasets

Pankaj Gupta, Priya Mudgil, Niharika Dutta, Kartik Bose, Nitish Kumar, Anupam Kumar, Jimil Shah, Vaneet Jearth, Jayanta Samanta, Vishal Sharma, Harshal Mandavdhare, Surinder Rana, Saroj K Sinha, Usha Dutta

Main category: cs.CV

TL;DR: Vision Transformer-based deep learning model achieves promising performance for pancreatic tumor segmentation in EUS images, with mean DSC of 0.651 in cross-validation and 0.657 in external validation, though some limitations in dataset heterogeneity and prediction errors exist.

Details

Motivation: Pancreatic cancer has poor survival rates, and while endoscopic ultrasound (EUS) is a key diagnostic tool, its effectiveness is limited by operator subjectivity. There's a need for automated, objective segmentation methods to improve pancreatic tumor detection and characterization.

Method: Used a Vision Transformer-based deep learning segmentation model within the USFM framework. Trained and validated on 17,367 EUS images from two public datasets using 5-fold cross-validation. Tested on independent dataset of 350 EUS images from another public dataset. Preprocessing included grayscale conversion, cropping, and resizing to 512x512 pixels. Evaluated using Dice similarity coefficient (DSC), intersection over union (IoU), sensitivity, specificity, and accuracy.

Result: In 5-fold cross-validation: mean DSC 0.651 ± 0.738, IoU 0.579 ± 0.658, sensitivity 69.8%, specificity 98.8%, accuracy 97.5%. External validation: DSC 0.657 (95% CI: 0.634-0.769), IoU 0.614 (95% CI: 0.590-0.689), sensitivity 71.8%, specificity 97.7%. Results were consistent but 9.7% of cases showed erroneous multiple predictions.

Conclusion: The Vision Transformer-based model demonstrated strong performance for pancreatic tumor segmentation in EUS images, showing potential for clinical application. However, dataset heterogeneity, limited external validation, and prediction errors (9.7% with multiple erroneous predictions) highlight the need for further refinement, standardization, and prospective studies.

Abstract: Background: Pancreatic cancer is one of the most aggressive cancers, with poor survival rates. Endoscopic ultrasound (EUS) is a key diagnostic modality, but its effectiveness is constrained by operator subjectivity. This study evaluates a Vision Transformer-based deep learning segmentation model for pancreatic tumors. Methods: A segmentation model using the USFM framework with a Vision Transformer backbone was trained and validated with 17,367 EUS images (from two public datasets) in 5-fold cross-validation. The model was tested on an independent dataset of 350 EUS images from another public dataset, manually segmented by radiologists. Preprocessing included grayscale conversion, cropping, and resizing to 512x512 pixels. Metrics included Dice similarity coefficient (DSC), intersection over union (IoU), sensitivity, specificity, and accuracy. Results: In 5-fold cross-validation, the model achieved a mean DSC of 0.651 +/- 0.738, IoU of 0.579 +/- 0.658, sensitivity of 69.8%, specificity of 98.8%, and accuracy of 97.5%. For the external validation set, the model achieved a DSC of 0.657 (95% CI: 0.634-0.769), IoU of 0.614 (95% CI: 0.590-0.689), sensitivity of 71.8%, and specificity of 97.7%. Results were consistent, but 9.7% of cases exhibited erroneous multiple predictions. Conclusions: The Vision Transformer-based model demonstrated strong performance for pancreatic tumor segmentation in EUS images. However, dataset heterogeneity and limited external validation highlight the need for further refinement, standardization, and prospective studies.

[152] Context-Aware Decoding for Faithful Vision-Language Generation

Mehrdad Fazli, Bowen Wei, Ziwei Zhu

Main category: cs.CV

TL;DR: The paper introduces Context Embedding Injection (CEI), a training-free method to reduce hallucinations in large vision-language models by using the last input token’s hidden state as a grounding signal during decoding.

Details

Motivation: Hallucinations (responses inconsistent with visual input) remain a critical limitation of LVLMs in open-ended tasks like image captioning and visual reasoning, requiring effective mitigation strategies.

Method: 1) Analyzes layer-wise generation dynamics using Logit Lens to discover “commitment-depth gap” - truthful tokens accumulate probability earlier than hallucinatory ones. 2) Proposes Context Embedding Injection (CEI) - uses the hidden state of the last input token (context embedding) as a grounding signal to maintain visual fidelity during decoding.

Result: CEI outperforms state-of-the-art baselines on CHAIR, AMBER, and MMHal-Bench benchmarks (max token length 512) across three LVLMs, with its dynamic variant achieving the lowest overall hallucination rates.

Conclusion: The work advances hallucination mitigation in LVLMs by integrating novel mechanistic insights about generation dynamics with a scalable, training-free intervention (CEI) that effectively reduces hallucinations.

Abstract: Hallucinations, generating responses inconsistent with the visual input, remain a critical limitation of large vision-language models (LVLMs), especially in open-ended tasks such as image captioning and visual reasoning. In this work, we probe the layer-wise generation dynamics that drive hallucinations and propose a training-free mitigation strategy. Employing the Logit Lens, we examine how LVLMs construct next-token distributions across decoder layers, uncovering a pronounced commitment-depth gap: truthful tokens accumulate probability mass on their final candidates earlier than hallucinatory ones. Drawing on this discovery, we introduce Context Embedding Injection (CEI), a lightweight method that harnesses the hidden state of the last input token-the context embedding-as a grounding signal to maintain visual fidelity throughout decoding and curb hallucinations. Evaluated on the CHAIR, AMBER, and MMHal-Bench benchmarks (with a maximum token length of 512), CEI outperforms state-of-the-art baselines across three LVLMs, with its dynamic variant yielding the lowest overall hallucination rates. By integrating novel mechanistic insights with a scalable intervention, this work advances the mitigation of hallucinations in LVLMs.

[153] WaveRNet: Wavelet-Guided Frequency Learning for Multi-Source Domain-Generalized Retinal Vessel Segmentation

Chanchan Wang, Yuanfang Wang, Qing Xu, Guanxin Chen

Main category: cs.CV

TL;DR: WaveRNet: A wavelet-guided frequency learning framework for domain-generalized retinal vessel segmentation that addresses illumination/contrast variations and preserves fine vessel details through spectral domain modulation, frequency-adaptive fusion, and hierarchical refinement.

Details

Motivation: Domain shift in retinal vessel segmentation caused by non-uniform illumination and varying contrast degrades generalization. Existing SAM-based methods overlook frequency-domain information and lose fine vessel details through direct upsampling.

Method: 1) Spectral-guided Domain Modulator (SDM) integrates wavelet decomposition with learnable domain tokens to separate illumination-robust low-frequency structures from high-frequency vessel boundaries. 2) Frequency-Adaptive Domain Fusion (FADF) performs test-time domain selection via wavelet-based frequency similarity and soft-weighted fusion. 3) Hierarchical Mask-Prompt Refiner (HMPR) enables coarse-to-fine refinement with long-range dependency modeling to overcome SAM’s upsampling limitations.

Result: Extensive experiments under Leave-One-Domain-Out protocol on four public retinal datasets demonstrate state-of-the-art generalization performance.

Conclusion: WaveRNet effectively addresses domain shift challenges in retinal vessel segmentation by leveraging frequency-domain information and hierarchical refinement, achieving superior generalization across diverse ophthalmic datasets.

Abstract: Domain-generalized retinal vessel segmentation is critical for automated ophthalmic diagnosis, yet faces significant challenges from domain shift induced by non-uniform illumination and varying contrast, compounded by the difficulty of preserving fine vessel structures. While the Segment Anything Model (SAM) exhibits remarkable zero-shot capabilities, existing SAM-based methods rely on simple adapter fine-tuning while overlooking frequency-domain information that encodes domain-invariant features, resulting in degraded generalization under illumination and contrast variations. Furthermore, SAM’s direct upsampling inevitably loses fine vessel details. To address these limitations, we propose WaveRNet, a wavelet-guided frequency learning framework for robust multi-source domain-generalized retinal vessel segmentation. Specifically, we devise a Spectral-guided Domain Modulator (SDM) that integrates wavelet decomposition with learnable domain tokens, enabling the separation of illumination-robust low-frequency structures from high-frequency vessel boundaries while facilitating domain-specific feature generation. Furthermore, we introduce a Frequency-Adaptive Domain Fusion (FADF) module that performs intelligent test-time domain selection through wavelet-based frequency similarity and soft-weighted fusion. Finally, we present a Hierarchical Mask-Prompt Refiner (HMPR) that overcomes SAM’s upsampling limitation through coarse-to-fine refinement with long-range dependency modeling. Extensive experiments under the Leave-One-Domain-Out protocol on four public retinal datasets demonstrate that WaveRNet achieves state-of-the-art generalization performance. The source code is available at https://github.com/Chanchan-Wang/WaveRNet.

[154] VideoAR: Autoregressive Video Generation via Next-Frame & Scale Prediction

Longbin Ji, Xiaoxiong Liu, Junyuan Shang, Shuohuan Wang, Yu Sun, Hua Wu, Haifeng Wang

Main category: cs.CV

TL;DR: VideoAR is the first large-scale Visual Autoregressive framework for video generation that combines multi-scale next-frame prediction with autoregressive modeling, achieving state-of-the-art results while being 10x faster than diffusion models.

Details

Motivation: Current video generation methods (diffusion and flow-matching models) produce high-quality results but are computationally intensive and difficult to scale. There's a need for more efficient and scalable approaches that maintain quality.

Method: VideoAR integrates intra-frame VAR modeling with causal next-frame prediction using a 3D multi-scale tokenizer. It includes Multi-scale Temporal RoPE, Cross-Frame Error Correction, and Random Frame Mask for temporal consistency. Uses multi-stage pretraining pipeline across increasing resolutions and durations.

Result: Achieves new SOTA among autoregressive models: improves FVD on UCF-101 from 99.5 to 88.6, reduces inference steps by over 10x, reaches VBench score of 81.74 (competitive with much larger diffusion models).

Conclusion: VideoAR narrows the performance gap between autoregressive and diffusion paradigms, offering a scalable, efficient, and temporally consistent foundation for future video generation research.

Abstract: Recent advances in video generation have been dominated by diffusion and flow-matching models, which produce high-quality results but remain computationally intensive and difficult to scale. In this work, we introduce VideoAR, the first large-scale Visual Autoregressive (VAR) framework for video generation that combines multi-scale next-frame prediction with autoregressive modeling. VideoAR disentangles spatial and temporal dependencies by integrating intra-frame VAR modeling with causal next-frame prediction, supported by a 3D multi-scale tokenizer that efficiently encodes spatio-temporal dynamics. To improve long-term consistency, we propose Multi-scale Temporal RoPE, Cross-Frame Error Correction, and Random Frame Mask, which collectively mitigate error propagation and stabilize temporal coherence. Our multi-stage pretraining pipeline progressively aligns spatial and temporal learning across increasing resolutions and durations. Empirically, VideoAR achieves new state-of-the-art results among autoregressive models, improving FVD on UCF-101 from 99.5 to 88.6 while reducing inference steps by over 10x, and reaching a VBench score of 81.74-competitive with diffusion-based models an order of magnitude larger. These results demonstrate that VideoAR narrows the performance gap between autoregressive and diffusion paradigms, offering a scalable, efficient, and temporally consistent foundation for future video generation research.

[155] Adaptive Conditional Contrast-Agnostic Deformable Image Registration with Uncertainty Estimation

Yinsong Wang, Xinzhe Luo, Siyi Du, Chen Qin

Main category: cs.CV

TL;DR: AC-CAR is a contrast-agnostic deformable image registration framework that generalizes to unseen imaging contrasts using random convolution-based augmentation and adaptive feature modulation.

Details

Motivation: Deformable multi-contrast image registration is challenging due to complex intensity relationships across different contrasts. Conventional methods are slow, while learning-based approaches lack generalizability to unseen contrasts.

Method: Proposes AC-CAR with: 1) Random convolution-based contrast augmentation, 2) Adaptive conditional feature modulator (ACFM) for contrast-invariant feature learning, 3) Contrast-invariant latent regularization, and 4) Variance network for uncertainty estimation.

Result: AC-CAR outperforms baseline methods in registration accuracy and shows superior generalization to unseen imaging contrasts, with code publicly available.

Conclusion: AC-CAR provides a robust, contrast-agnostic registration solution with uncertainty estimation, addressing the generalization limitations of existing learning-based registration methods.

Abstract: Deformable multi-contrast image registration is a challenging yet crucial task due to the complex, non-linear intensity relationships across different imaging contrasts. Conventional registration methods typically rely on iterative optimization of the deformation field, which is time-consuming. Although recent learning-based approaches enable fast and accurate registration during inference, their generalizability remains limited to the specific contrasts observed during training. In this work, we propose an adaptive conditional contrast-agnostic deformable image registration framework (AC-CAR) based on a random convolution-based contrast augmentation scheme. AC-CAR can generalize to arbitrary imaging contrasts without observing them during training. To encourage contrast-invariant feature learning, we propose an adaptive conditional feature modulator (ACFM) that adaptively modulates the features and the contrast-invariant latent regularization to enforce the consistency of the learned feature across different imaging contrasts. Additionally, we enable our framework to provide contrast-agnostic registration uncertainty by integrating a variance network that leverages the contrast-agnostic registration encoder to improve the trustworthiness and reliability of AC-CAR. Experimental results demonstrate that AC-CAR outperforms baseline methods in registration accuracy and exhibits superior generalization to unseen imaging contrasts. Code is available at https://github.com/Yinsong0510/AC-CAR.

[156] Deepfake detectors are DUMB: A benchmark to assess adversarial training robustness under transferability constraints

Adrian Serrano, Erwan Umlil, Ronan Thomas

Main category: cs.CV

TL;DR: Deepfake detectors show varying robustness to adversarial attacks under realistic constraints; adversarial training helps in-distribution but can hurt cross-dataset performance, requiring case-aware defenses.

Details

Motivation: Real-world deepfake detection systems face adversaries with limited knowledge and data mismatches, but current robustness evaluations don't adequately address these realistic constraints.

Method: Extends DUMB/DUMBer methodology to deepfake detection, evaluating 5 detectors (RECCE, SRM, XCeption, UCF, SPSL) against 3 attacks (PGD, FGSM, FPBA) on 2 datasets (FaceForensics++, Celeb-DF-V2) under transferability and cross-dataset constraints.

Result: Adversarial training improves robustness for in-distribution cases but can degrade performance in cross-dataset scenarios depending on the specific strategy used.

Conclusion: Real-world deepfake detection requires case-aware defense strategies that consider attacker knowledge and data distribution mismatches, as standard adversarial training may not generalize across different scenarios.

Abstract: Deepfake detection systems deployed in real-world environments are subject to adversaries capable of crafting imperceptible perturbations that degrade model performance. While adversarial training is a widely adopted defense, its effectiveness under realistic conditions – where attackers operate with limited knowledge and mismatched data distributions - remains underexplored. In this work, we extend the DUMB – Dataset soUrces, Model architecture and Balance - and DUMBer methodology to deepfake detection. We evaluate detectors robustness against adversarial attacks under transferability constraints and cross-dataset configuration to extract real-world insights. Our study spans five state-of-the-art detectors (RECCE, SRM, XCeption, UCF, SPSL), three attacks (PGD, FGSM, FPBA), and two datasets (FaceForensics++ and Celeb-DF-V2). We analyze both attacker and defender perspectives mapping results to mismatch scenarios. Experiments show that adversarial training strategies reinforce robustness in the in-distribution cases but can also degrade it under cross-dataset configuration depending on the strategy adopted. These findings highlight the need for case-aware defense strategies in real-world applications exposed to adversarial attacks.

[157] Adaptive aggregation of Monte Carlo augmented decomposed filters for efficient group-equivariant convolutional neural network

Wenzhao Zhao, Barbara D. Wichtmann, Steffen Albert, Angelika Maurer, Frank G. Zöllner, Jürgen Hesser

Main category: cs.CV

TL;DR: Non-parameter-sharing approach for group-equivariant CNNs using adaptive aggregation of stochastically augmented decomposed filters, achieving equivariance without computational burden of parameter sharing.

Details

Motivation: Parameter-sharing in G-CNNs increases computational burden for each added parameter, limiting application to deep networks. Need more efficient approach to group equivariance.

Method: Adaptively aggregate diverse filters via weighted sum of stochastically augmented decomposed filters. Uses Monte Carlo sampling for continuous groups and bootstrap resampling for discrete groups. Theoretical proof of group equivariance provided.

Result: Outperforms parameter-sharing group equivariant networks and enhances standard CNNs in image classification and denoising tasks. Enables efficient lightweight networks using suitable filter bases.

Conclusion: Proposed non-parameter-sharing approach provides efficient group equivariance for both continuous and discrete groups, serving as effective extension to standard CNNs with better performance and reduced computational burden.

Abstract: Group-equivariant convolutional neural networks (G-CNN) heavily rely on parameter sharing to increase CNN’s data efficiency and performance. However, the parameter-sharing strategy greatly increases the computational burden for each added parameter, which hampers its application to deep neural network models. In this paper, we address these problems by proposing a non-parameter-sharing approach for group equivariant neural networks. The proposed methods adaptively aggregate a diverse range of filters by a weighted sum of stochastically augmented decomposed filters. We give theoretical proof about how the group equivariance can be achieved by our methods. Our method applies to both continuous and discrete groups, where the augmentation is implemented using Monte Carlo sampling and bootstrap resampling, respectively. Our methods also serve as an efficient extension of standard CNN. The experiments show that our method outperforms parameter-sharing group equivariant networks and enhances the performance of standard CNNs in image classification and denoising tasks, by using suitable filter bases to build efficient lightweight networks. The code will be available at https://github.com/ZhaoWenzhao/MCG_CNN.

[158] Controlled Automatic Task-Specific Synthetic Data Generation for Hallucination Detection

Yong Xie, Karan Aggarwal, Aitzaz Ahmad, Stephen Lau

Main category: cs.CV

TL;DR: Two-step pipeline generates synthetic hallucination datasets using pattern guidance and style alignment, with data mixture training for robust detectors that outperform ICL by 32%.

Details

Motivation: Need for task-specific synthetic datasets for hallucination detection, as existing methods lack proper alignment with benchmark styles and robust generalization.

Method: Two-step generation-selection pipeline: 1) Hallucination pattern guidance uses task-specific patterns, 2) Language style alignment matches benchmark text style. Plus data mixture strategy for robust training.

Result: Generated hallucination text aligns better with non-hallucinated text than baselines. Detectors trained on synthetic data outperform ICL-based detectors by 32% margin. Cross-task and cross-generator generalization confirmed.

Conclusion: The approach effectively generates high-quality synthetic datasets for hallucination detection, with data mixture training improving generalization and robustness significantly.

Abstract: We present a novel approach to automatically generate non-trivial task-specific synthetic datasets for hallucination detection. Our approach features a two-step generation-selection pipeline, using hallucination pattern guidance and a language style alignment during generation. Hallucination pattern guidance leverages the most important task-specific hallucination patterns while language style alignment aligns the style of the synthetic dataset with benchmark text. To obtain robust supervised detectors from synthetic datasets, we also adopt a data mixture strategy to improve performance robustness and generalization. Our results on three datasets show that our generated hallucination text is more closely aligned with non-hallucinated text versus baselines, to train hallucination detectors with better generalization. Our hallucination detectors trained on synthetic datasets outperform in-context-learning (ICL)-based detectors by a large margin of 32%. Our extensive experiments confirm the benefits of our approach with cross-task and cross-generator generalization. Our data-mixture-based training further improves the generalization and robustness of hallucination detection.

[159] AtomThink: Multimodal Slow Thinking with Atomic Step Reasoning

Kun Xiang, Zhili Liu, Terry Jingchen Zhang, Yinya Huang, Yunshuang Nie, Kaixin Cai, Yiyang Yin, Runhui Huang, Hanhui Li, Yihan Zeng, Yu-Jie Yuan, Jianhua Han, Lanqing Hong, Hang Xu, Xiaodan Liang

Main category: cs.CV

TL;DR: AtomThink introduces Self-structured Chain of Thought (SCoT) with minimal semantic atomic steps for multimodal reasoning, enabling adaptive reasoning levels and improving performance on complex tasks while avoiding overthinking on simpler ones.

Details

Motivation: Current multimodal reasoning methods lack adaptive reasoning capabilities - they either use rigid structured templates or free-form approaches that can lead to overthinking on simple tasks and insufficient reasoning on complex ones. There's a need for a method that can dynamically adjust reasoning depth based on task complexity.

Method: Proposes AtomThink framework with four modules: (1) data engine for generating high-quality multimodal reasoning paths, (2) supervised fine-tuning with serialized inference data, (3) policy-guided multi-turn inference method, and (4) atomic capability metric to evaluate single-step utilization rate. Uses Self-structured Chain of Thought (SCoT) with minimal semantic atomic steps.

Result: Achieves >10% average accuracy gains on MathVista and MathVerse benchmarks. Compared to state-of-the-art structured CoT approaches, achieves higher accuracy while improving data utilization by 5× and boosting inference efficiency by 85.3%.

Conclusion: AtomThink successfully incorporates “slow thinking” into MLLMs through adaptive reasoning with atomic steps, demonstrating significant performance improvements, better data efficiency, and faster inference compared to existing methods.

Abstract: In this paper, we address the challenging task of multimodal reasoning by incorporating the notion of ``slow thinking’’ into multimodal large language models (MLLMs). Our core idea is that models can learn to adaptively use different levels of reasoning to tackle questions of varying complexity. We propose a novel paradigm of Self-structured Chain of Thought (SCoT), which consists of minimal semantic atomic steps. Unlike existing methods that rely on structured templates or free-form paradigms, our method not only generates flexible CoT structures for various complex tasks but also mitigates the phenomenon of overthinking for easier tasks. To introduce structured reasoning into visual cognition, we design a novel AtomThink framework with four key modules: (i) a data engine to generate high-quality multimodal reasoning paths; (ii) a supervised fine-tuning (SFT) process with serialized inference data; (iii) a policy-guided multi-turn inference method; and (iv) an atomic capability metric to evaluate the single-step utilization rate. Extensive experiments demonstrate that the proposed AtomThink significantly improves the performance of baseline MLLMs, achieving more than 10% average accuracy gains on MathVista and MathVerse. Compared to state-of-the-art structured CoT approaches, our method not only achieves higher accuracy but also improves data utilization by 5 $\times$ and boosts inference efficiency by 85.3%. Our code is publicly available at https://github.com/Kun-Xiang/AtomThink.

[160] Infrared-Assisted Single-Stage Framework for Joint Restoration and Fusion of Visible and Infrared Images under Hazy Conditions

Huafeng Li, Jiaqi Fang, Yafei Zhang, Yu Liu

Main category: cs.CV

TL;DR: Single-stage framework for joint haze removal and fusion of infrared-visible images using prompt generation and infrared-assisted feature restoration.

Details

Motivation: Existing IR-VIS fusion methods neglect infrared's complementary role in restoring visible features under hazy conditions, and two-stage approaches (dehaze then fuse) are inefficient.

Method: Proposes joint learning framework with: 1) Prompt generation mechanism to handle modality-specific feature incompatibility, 2) Infrared-assisted feature restoration based on haze density, 3) Multi-stage prompt embedding fusion module for feature supplementation.

Result: Method effectively fuses IR-VIS images while removing haze, producing clear fusion results. Outperforms existing methods and is lightweight for practical deployment.

Conclusion: Single-stage collaborative training framework successfully addresses haze removal and fusion simultaneously, offering efficient and effective solution for hazy IR-VIS image processing.

Abstract: Infrared and visible (IR-VIS) image fusion has gained significant attention for its broad application value. However, existing methods often neglect the complementary role of infrared image in restoring visible image features under hazy conditions. To address this, we propose a joint learning framework that utilizes infrared image for the restoration and fusion of hazy IR-VIS images. To mitigate the adverse effects of feature diversity between IR-VIS images, we introduce a prompt generation mechanism that regulates modality-specific feature incompatibility. This creates a prompt selection matrix from non-shared image information, followed by prompt embeddings generated from a prompt pool. These embeddings help generate candidate features for dehazing. We further design an infrared-assisted feature restoration mechanism that selects candidate features based on haze density, enabling simultaneous restoration and fusion within a single-stage framework. To enhance fusion quality, we construct a multi-stage prompt embedding fusion module that leverages feature supplementation from the prompt generation module. Our method effectively fuses IR-VIS images while removing haze, yielding clear, haze-free fusion results. In contrast to two-stage methods that dehaze and then fuse, our approach enables collaborative training in a single-stage framework, making the model relatively lightweight and suitable for practical deployment. Experimental results validate its effectiveness and demonstrate advantages over existing methods. The source code of the paper is available at \href{https://github.com/fangjiaqi0909/IASSF}{\textcolor{blue}{https://github.com/fangjiaqi0909/IASSF

[161] RobustFormer: Noise-Robust Pre-training for images and videos

Ashish Bastola, Nishant Luitel, Hao Wang, Danda Pani Paudel, Roshani Poudel, Abolfazl Razi

Main category: cs.CV

TL;DR: RobustFormer is a novel DWT-based framework for noise-robust MAE pre-training on images and videos, eliminating expensive IDWT reconstruction while improving performance under severe noise conditions.

Details

Motivation: Deep learning models like transformers are highly susceptible to noise and overfit on noisy patterns rather than robust features. Vision transformers are particularly vulnerable as they rely on pixel-level details that can be easily corrupted. DWT can isolate noise in high-frequency domains while preserving essential low-frequency information, but conventional DWT methods suffer from computational inefficiency due to required IDWT reconstruction steps.

Method: Introduces RobustFormer, a framework using Discrete Wavelet Transform (DWT) for efficient downsampling in MAE pre-training for images and videos. Eliminates the need for expensive Inverse DWT reconstruction and simplifies attention mechanisms to focus on noise-resilient multi-scale representations. First DWT-based method fully compatible with video inputs and MAE-style pre-training.

Result: Achieves up to 8% increase in Top-1 classification accuracy under severe noise conditions in Imagenet-C and up to 2.7% in Imagenet-P benchmarks compared to baseline. Up to 13% higher Top-1 accuracy on UCF-101 under severe custom noise perturbations while maintaining similar accuracy for clean datasets. Reduces computation complexity by up to 4.4% through IDWT removal compared to VideoMAE baseline without performance drop.

Conclusion: RobustFormer successfully addresses noise susceptibility in vision transformers by leveraging DWT for efficient noise-robust feature learning, eliminating computational overhead of IDWT while achieving significant performance improvements under noisy conditions across both image and video domains.

Abstract: While deep learning-based models like transformers, have revolutionized time-series and vision tasks, they remain highly susceptible to noise and often overfit on noisy patterns rather than robust features. This issue is exacerbated in vision transformers, which rely on pixel-level details that can easily be corrupt. To address this, we leverage the discrete wavelet transform (DWT) for its ability to decompose into multi-resolution layers, isolating noise primarily in the high frequency domain while preserving essential low-frequency information for resilient feature learning. Conventional DWT-based methods, however, struggle with computational inefficiencies due to the requirement for a subsequent inverse discrete wavelet transform (IDWT) step. In this work, we introduce RobustFormer, a novel framework that enables noise-robust masked autoencoder (MAE) pre-training for both images and videos by using DWT for efficient downsampling, eliminating the need for expensive IDWT reconstruction and simplifying the attention mechanism to focus on noise-resilient multi-scale representations. To our knowledge, RobustFormer is the first DWT-based method fully compatible with video inputs and MAE-style pre-training. Extensive experiments on noisy image and video datasets demonstrate that our approach achieves up to 8% increase in Top-1 classification accuracy under severe noise conditions in Imagenet-C and up to 2.7% in Imagenet-P standard benchmarks compared to the baseline and up to 13% higher Top-1 accuracy on UCF-101 under severe custom noise perturbations while maintaining similar accuracy scores for clean datasets. We also observe the reduction of computation complexity by up to 4.4% through IDWT removal compared to VideoMAE baseline without any performance drop.

[162] 3D-WAG: Hierarchical Wavelet-Guided Autoregressive Generation for High-Fidelity 3D Shapes

Tejaswini Medi, Arianna Rampini, Pradyumna Reddy, Pradeep Kumar Jayaraman, Margret Keuper

Main category: cs.CV

TL;DR: 3D-WAG introduces an autoregressive model for 3D shape generation using wavelet token maps and next-scale prediction, achieving efficient high-fidelity generation with better performance than traditional methods.

Details

Motivation: Autoregressive models have been successful in NLP and 2D image generation but remain underexplored for 3D shapes. Traditional 3D AR models use next-token prediction at voxel/point level, which is restrictive and computationally expensive for large-scale 3D data. There's a need for more efficient and controllable 3D generation methods.

Method: 3D-WAG encodes 3D shapes as multi-scale wavelet token maps and uses a Transformer to predict the “next higher-resolution token map” autoregressively. This “next-scale” prediction approach reduces computational costs compared to traditional “next-token” prediction while preserving geometric details in a hierarchical, structured manner.

Result: 3D-WAG achieves superior performance in key metrics like Coverage and MMD compared to state-of-the-art methods on widely used benchmarks. It generates high-fidelity 3D shapes that closely match the real data distribution and supports unconditional, class-conditioned, and text-conditioned shape generation.

Conclusion: The proposed next-scale prediction approach for autoregressive 3D shape generation offers computational efficiency while maintaining geometric fidelity, making AR models more practical for data-intensive 3D domains compared to diffusion models.

Abstract: Autoregressive (AR) models have achieved remarkable success in natural language and image generation, but their application to 3D shape modeling remains largely unexplored. Unlike diffusion models, AR models enable more efficient and controllable generation with faster inference times, making them especially suitable for data-intensive domains. Traditional 3D generative models using AR approaches often rely on next-token" predictions at the voxel or point level. While effective for certain applications, these methods can be restrictive and computationally expensive when dealing with large-scale 3D data. To tackle these challenges, we introduce 3D-WAG, an AR model for 3D implicit distance fields that can perform unconditional shape generation, class-conditioned and also text-conditioned shape generation. Our key idea is to encode shapes as multi-scale wavelet token maps and use a Transformer to predict the next higher-resolution token map" in an autoregressive manner. By redefining 3D AR generation task as next-scale" prediction, we reduce the computational cost of generation compared to traditional next-token" prediction models, while preserving essential geometric details of 3D shapes in a more structured and hierarchical manner. We evaluate 3D-WAG to showcase its benefit by quantitative and qualitative comparisons with state-of-the-art methods on widely used benchmarks. Our results show 3D-WAG achieves superior performance in key metrics like Coverage and MMD, generating high-fidelity 3D shapes that closely match the real data distribution.

Linhao Huang, Xue Jiang, Zhiqiang Wang, Wentao Mo, Xi Xiao, Yong-Jie Yin, Bo Han, Feng Zheng

Main category: cs.CV

TL;DR: I2V-MLLM attack uses image-based MLLM as surrogate to craft transferable adversarial videos for black-box attacks on video MLLMs, achieving competitive performance to white-box attacks.

Details

Motivation: Video-based MLLMs are vulnerable to adversarial examples, but transferability to unseen models (common real-world scenario) remains unexplored. Existing methods fail in black-box settings due to poor feature generalization, sparse frame focus, and lack of multimodal integration.

Method: I2V-MLLM attack uses image-based MLLM (I-MLLM) as surrogate model. Integrates multimodal interactions and spatiotemporal information to disrupt video representations in latent space. Uses perturbation propagation technique to handle different unknown frame sampling strategies.

Result: Achieves strong transferability across different V-MLLMs on multiple video-text tasks. Black-box attacks using BLIP-2 as surrogate achieve competitive performance to white-box attacks: 57.98% AASR on MSVD-QA and 58.26% on MSRVTT-QA for Zero-Shot VideoQA.

Conclusion: The proposed I2V-MLLM attack effectively addresses limitations of existing methods and demonstrates strong adversarial transferability in black-box scenarios, revealing vulnerabilities in V-MLLMs that need attention for security.

Abstract: Video-based multimodal large language models (V-MLLMs) have shown vulnerability to adversarial examples in video-text multimodal tasks. However, the transferability of adversarial videos to unseen models - a common and practical real-world scenario - remains unexplored. In this paper, we pioneer an investigation into the transferability of adversarial video samples across V-MLLMs. We find that existing adversarial attack methods face significant limitations when applied in black-box settings for V-MLLMs, which we attribute to the following shortcomings: (1) lacking generalization in perturbing video features, (2) focusing only on sparse key-frames, and (3) failing to integrate multimodal information. To address these limitations and deepen the understanding of V-MLLM vulnerabilities in black-box scenarios, we introduce the Image-to-Video MLLM (I2V-MLLM) attack. In I2V-MLLM, we utilize an image-based multimodal large language model (I-MLLM) as a surrogate model to craft adversarial video samples. Multimodal interactions and spatiotemporal information are integrated to disrupt video representations within the latent space, improving adversarial transferability. Additionally, a perturbation propagation technique is introduced to handle different unknown frame sampling strategies. Experimental results demonstrate that our method can generate adversarial examples that exhibit strong transferability across different V-MLLMs on multiple video-text multimodal tasks. Compared to white-box attacks on these models, our black-box attacks (using BLIP-2 as a surrogate model) achieve competitive performance, with average attack success rate (AASR) of 57.98% on MSVD-QA and 58.26% on MSRVTT-QA for Zero-Shot VideoQA tasks, respectively.

[164] CombatVLA: An Efficient Vision-Language-Action Model for Combat Tasks in 3D Action Role-Playing Games

Peng Chen, Pi Bu, Yingyao Wang, Xinyi Wang, Ziming Wang, Jie Guo, Yingxiu Zhao, Qi Zhu, Jun Song, Siran Yang, Jiamang Wang, Bo Zheng

Main category: cs.CV

TL;DR: CombatVLA is a 3B Vision-Language-Action model optimized for real-time combat tasks in 3D action RPGs, achieving superior performance and 50x speedup over existing methods.

Details

Motivation: Current VLAs struggle with real-time decision-making in complex 3D environments requiring second-level responses, high-resolution perception, and tactical reasoning under dynamic conditions, especially for combat tasks in action RPGs.

Method: Developed a 3B VLA model trained on video-action pairs collected by an action tracker, formatted as action-of-thought (AoT) sequences, and integrated into an action execution framework with truncated AoT strategy for efficient inference.

Result: Outperforms all existing models on combat understanding benchmark, achieves 50-fold acceleration in game combat, and has higher task success rate than human players.

Conclusion: CombatVLA represents an efficient VLA solution for real-time combat tasks in 3D environments, with all resources being open-sourced to advance embodied intelligence research.

Abstract: Recent advances in Vision-Language-Action models (VLAs) have expanded the capabilities of embodied intelligence. However, significant challenges remain in real-time decision-making in complex 3D environments, which demand second-level responses, high-resolution perception, and tactical reasoning under dynamic conditions. To advance the field, we introduce CombatVLA, an efficient VLA model optimized for combat tasks in 3D action role-playing games(ARPGs). Specifically, our CombatVLA is a 3B model trained on video-action pairs collected by an action tracker, where the data is formatted as action-of-thought (AoT) sequences. Thereafter, CombatVLA seamlessly integrates into an action execution framework, allowing efficient inference through our truncated AoT strategy. Experimental results demonstrate that CombatVLA not only outperforms all existing models on the combat understanding benchmark but also achieves a 50-fold acceleration in game combat. Moreover, it has a higher task success rate than human players. We will open-source all resources, including the action tracker, dataset, benchmark, model weights, training code, and the implementation of the framework at https://combatvla.github.io/.

[165] LightFormer: A lightweight and efficient decoder for remote sensing image segmentation

Sihang Chen, Lijun Yun, Ze Liu, JianFeng Zhu, Jie Chen, Hui Wang, Yueping Nie

Main category: cs.CV

TL;DR: LightFormer is a lightweight decoder for real-time semantic segmentation of remote sensing images that achieves excellent accuracy-efficiency trade-off for unstructured target detection.

Details

Motivation: Deep learning models for remote sensing semantic segmentation have high decoder complexity that limits real-time deployment on edge platforms for time-critical applications like disaster assessment, UAV search-and-rescue, and cultural heritage monitoring.

Method: Proposes LightFormer with two key modules: 1) Feature-fusion and refinement module using channel processing and learnable gating to efficiently aggregate multi-scale, multi-range information; 2) Spatial information selection module (SISM) integrating long-range attention with detail preservation branch to capture spatial dependencies across scales.

Result: On ISPRS Vaihingen benchmark: achieves 99.9% of GLFFNet’s mIoU (83.9% vs. 84.0%) with only 14.7% FLOPs and 15.9% parameters. Consistent strong performance on LoveDA, ISPRS Potsdam, RescueNet, and FloodNet datasets.

Conclusion: LightFormer provides a practical solution for remote sensing applications requiring both computational efficiency and high-precision segmentation, demonstrating robust performance for unstructured object perception in complex scenes.

Abstract: Deep learning techniques have achieved remarkable success in the semantic segmentation of remote sensing images and in land-use change detection. Nevertheless, their real-time deployment on edge platforms remains constrained by decoder complexity. Herein, we introduce LightFormer, a lightweight decoder for time-critical tasks that involve unstructured targets, such as disaster assessment, unmanned aerial vehicle search-and-rescue, and cultural heritage monitoring. LightFormer employs a feature-fusion and refinement module built on channel processing and a learnable gating mechanism to aggregate multi-scale, multi-range information efficiently, which drastically curtails model complexity. Furthermore, we propose a spatial information selection module (SISM) that integrates long-range attention with a detail preservation branch to capture spatial dependencies across multiple scales, thereby substantially improving the recognition of unstructured targets in complex scenes. On the ISPRS Vaihingen benchmark, LightFormer attains 99.9% of GLFFNet’s mIoU (83.9% vs. 84.0%) while requiring only 14.7% of its FLOPs and 15.9% of its parameters, thus achieving an excellent accuracy-efficiency trade-off. Consistent results on LoveDA, ISPRS Potsdam, RescueNet, and FloodNet further demonstrate its robustness and superior perception of unstructured objects. These findings highlight LightFormer as a practical solution for remote sensing applications where both computational economy and high-precision segmentation are imperative.

[166] ReVision: Refining Video Diffusion with Explicit 3D Motion Modeling

Qihao Liu, Ju He, Qihang Yu, Liang-Chieh Chen, Alan Yuille

Main category: cs.CV

TL;DR: ReVision is a plug-and-play framework that enhances video diffusion models by integrating 3D motion knowledge, enabling generation of complex motions and interactions with improved fidelity and coherence.

Details

Motivation: Current video generation models struggle with generating complex motions and interactions. There's a need for better motion fidelity and physical plausibility in generated videos, especially for scenarios involving complex actions and interactions.

Method: Three-stage framework: 1) Generate coarse video using video diffusion model, 2) Extract 2D/3D features to build object-centric representation, refine with parameterized motion prior model to get accurate 3D motion sequence, 3) Feed refined motion sequence back as additional conditioning to generate motion-consistent videos.

Result: Significantly improves motion fidelity and coherence on Stable Video Diffusion. With only 1.5B parameters, outperforms state-of-the-art 13B+ parameter models on complex video generation by substantial margin.

Conclusion: Incorporating 3D motion knowledge enables even small video diffusion models to generate complex motions and interactions with greater realism and controllability, offering promising solution for physically plausible video generation.

Abstract: In recent years, video generation has seen significant advancements. However, challenges still persist in generating complex motions and interactions. To address these challenges, we introduce ReVision, a plug-and-play framework that explicitly integrates parameterized 3D model knowledge into a pretrained conditional video generation model, significantly enhancing its ability to generate high-quality videos with complex motion and interactions. Specifically, ReVision consists of three stages. First, a video diffusion model is used to generate a coarse video. Next, we extract a set of 2D and 3D features from the coarse video to construct a 3D object-centric representation, which is then refined by our proposed parameterized motion prior model to produce an accurate 3D motion sequence. Finally, this refined motion sequence is fed back into the same video diffusion model as additional conditioning, enabling the generation of motion-consistent videos, even in scenarios involving complex actions and interactions. We validate the effectiveness of our approach on Stable Video Diffusion, where ReVision significantly improves motion fidelity and coherence. Remarkably, with only 1.5B parameters, it even outperforms a state-of-the-art video generation model with over 13B parameters on complex video generation by a substantial margin. Our results suggest that, by incorporating 3D motion knowledge, even a relatively small video diffusion model can generate complex motions and interactions with greater realism and controllability, offering a promising solution for physically plausible video generation.

[167] seq-JEPA: Autoregressive Predictive Learning of Invariant-Equivariant World Models

Hafez Ghaemi, Eilif Muller, Shahab Bakhtiari

Main category: cs.CV

TL;DR: seq-JEPA is a self-supervised learning framework that learns separate representations for equivariance- and invariance-demanding tasks by processing sequences of views with transformation embeddings, eliminating the trade-off between these task types.

Details

Motivation: Current two-view SSL methods create performance trade-offs between high-level invariance tasks (like image classification) and fine-grained equivariance tasks, limiting representation flexibility for downstream adaptation.

Method: Processes short sequences of different views of inputs, concatenates each encoded view with transformation embeddings, passes through transformer encoder, and uses predictor head to condition on upcoming actions to predict next observation representations.

Result: Demonstrates strong performance on both equivariance- and invariance-demanding downstream tasks without sacrificing one for the other, and excels at sequence aggregation tasks like path integration and predictive learning across eye movements.

Conclusion: seq-JEPA resolves the trade-off between invariance and equivariance in SSL through architectural inductive biases, enabling simultaneous learning of separate representations for different task types and excelling at sequence-based tasks.

Abstract: Joint-embedding self-supervised learning (SSL) commonly relies on transformations such as data augmentation and masking to learn visual representations, a task achieved by enforcing invariance or equivariance with respect to these transformations applied to two views of an image. This dominant two-view paradigm in SSL often limits the flexibility of learned representations for downstream adaptation by creating performance trade-offs between high-level invariance-demanding tasks such as image classification and more fine-grained equivariance-related tasks. In this work, we propose \emph{seq-JEPA}, a world modeling framework that introduces architectural inductive biases into joint-embedding predictive architectures to resolve this trade-off. Without relying on dual equivariance predictors or loss terms, seq-JEPA simultaneously learns two architecturally separate representations for equivariance- and invariance-demanding tasks. To do so, our model processes short sequences of different views (observations) of inputs. Each encoded view is concatenated with an embedding of the relative transformation (action) that produces the next observation in the sequence. These view-action pairs are passed through a transformer encoder that outputs an aggregate representation. A predictor head then conditions this aggregate representation on the upcoming action to predict the representation of the next observation. Empirically, seq-JEPA demonstrates strong performance on both equivariance- and invariance-demanding downstream tasks without sacrificing one for the other. Furthermore, it excels at tasks that inherently require aggregating a sequence of observations, such as path integration across actions and predictive learning across eye movements.

[168] PsOCR: Benchmarking Large Multimodal Models for Optical Character Recognition in Low-resource Pashto Language

Ijazul Haq, Yingjie Zhang, Irfan Ali Khan

Main category: cs.CV

TL;DR: The paper introduces PsOCR, a synthetic Pashto OCR dataset of 1M images with multi-level annotations, and benchmarks LMM performance on Pashto OCR, finding Gemini performs best overall while Qwen-7B leads among open-source models.

Details

Motivation: Pashto NLP faces challenges due to cursive script and lack of structured datasets. There's a need for OCR evaluation in low-resource languages like Pashto to advance research in similar scripts.

Method: Created synthetic Pashto OCR dataset (PsOCR) with 1M images annotated at word, line, and document levels across 1,000 font families. Evaluated 11 LMMs (7 open-source, 4 closed-source) on a 10K image benchmark subset.

Result: Gemini achieved best overall performance among all models. Among open-source models, Qwen-7B performed best. The dataset enables comprehensive evaluation of OCR capabilities for Pashto.

Conclusion: Provides first comprehensive assessment of LMMs for Pashto OCR, establishes foundation for research in Pashto and similar scripts (Arabic, Persian, Urdu), and releases PsOCR dataset publicly.

Abstract: This paper evaluates the performance of Large Multimodal Models (LMMs) on Optical Character Recognition (OCR) in the low-resource Pashto language. Natural Language Processing (NLP) in Pashto faces several challenges due to the cursive nature of its script and a scarcity of structured datasets. To address this, we developed a synthetic Pashto OCR dataset, PsOCR, consisting of one million images annotated with bounding boxes at word, line, and document levels, suitable for training and evaluating models based on different architectures, including Convolutional Neural Networks (CNNs) and Transformers. PsOCR covers variations across 1,000 unique font families, colors, image sizes, and layouts. A benchmark subset of 10K images was selected to evaluate the performance of several LMMs, including seven open-source models: DeepSeek’s Janus, InternVL, MiniCPM, Florence, and Qwen (3B and 7B), and four closed-source models: GPT-4o, Gemini, Claude, and Grok. Experimental results demonstrate that Gemini achieves the best performance among all models, whereas among open-source models, Qwen-7B stands out. This work provides an insightful assessment of the capabilities and limitations of current LMMs for OCR tasks in Pashto and establishes a foundation for further research not only in Pashto OCR but also for other similar scripts such as Arabic, Persian, and Urdu. PsOCR is available at https://github.com/zirak-ai/PashtoOCR.

[169] Branch, or Layer? Zeroth-Order Optimization for Continual Learning of Vision-Language Models

Ziwei Liu, Borui Kang, Wei Li, Hangjie Yuan, Yanbing Yang, Wenbin Li, Yifan Zhu, Tao Feng, Jun Luo

Main category: cs.CV

TL;DR: This paper pioneers Zeroth-Order (ZO) optimization for Parameter-Efficient Fine-Tuning (PEFT) in Vision-Language Continual Learning (VLCL), addressing First-Order optimization’s tendency to trap models in local minima. The authors develop a modality-aware ZO strategy with gradient sign normalization and vision modality perturbation constraints, achieving state-of-the-art results.

Details

Motivation: First-Order (FO) optimization in PEFT-based VLCL tends to trap models in suboptimal local minima due to limited exploration subspace. The paper aims to overcome this limitation by exploring Zeroth-Order (ZO) optimization, which can better escape local minima during optimization.

Method: The authors systematically explore ZO optimization for PEFT-based VLCL, starting from identifying incompatibility of naive full-ZO adoption. They investigate ZO application from modality branch-wise to fine-grained layer-wise across training units. They propose a modality-aware ZO strategy with gradient sign normalization and vision modality perturbation constraints, based on theoretical insight that vision modality exhibits higher variance than language in VLCL during ZO optimization.

Result: Extensive experiments on four benchmarks demonstrate that the proposed method achieves state-of-the-art results. The adoption of ZO optimization enables PEFT-based VLCL to better escape local minima during optimization.

Conclusion: The paper successfully pioneers ZO optimization for PEFT-based VLCL, overcoming FO optimization limitations. The modality-aware ZO strategy with gradient sign normalization and vision modality constraints effectively improves performance, establishing a new state-of-the-art approach for vision-language continual learning.

Abstract: Vision-Language Continual Learning (VLCL) has attracted significant research attention for its robust capabilities, and the adoption of Parameter-Efficient Fine-Tuning (PEFT) strategies is enabling these models to achieve competitive performance with substantially reduced resource consumption. However, dominated First-Order (FO) optimization is prone to trap models in suboptimal local minima, especially in limited exploration subspace within PEFT. To overcome this challenge, this paper pioneers a systematic exploration of adopting Zeroth-Order (ZO) optimization for PEFT-based VLCL. We first identify the incompatibility of naive full-ZO adoption in VLCL due to optimization process instability. We then investigate the application of ZO optimization from a modality branch-wise to a fine-grained layer-wise across various training units to identify an optimal strategy. Besides, a key theoretical insight reveals that vision modality exhibit higher variance than language counterparts in VLCL during the ZO optimization process, and we propose a modality-aware ZO strategy, which adopts gradient sign normalization in ZO and constrains vision modality perturbation to further improve performance. Benefiting from the adoption of ZO optimization, PEFT-based VLCL fulfills better ability to escape local minima during the optimization process, extensive experiments on four benchmarks demonstrate that our method achieves state-of-the-art results.

[170] Probing Deep into Temporal Profile Makes the Infrared Small Target Detector Much Better

Ruojing Li, Wei An, Yingqian Wang, Xinyi Ying, Yimian Dai, Longguang Wang, Miao Li, Yulan Guo, Li Liu

Main category: cs.CV

TL;DR: DeepPro is an efficient infrared small target detection method that transforms the task into 1D signal anomaly detection using temporal profile information instead of spatial features, achieving state-of-the-art performance with high efficiency.

Details

Motivation: Current learning-based IRST detection methods rely on spatial and short-term temporal information but suffer from unreliable performance under complex conditions and computational redundancy. The authors identify that global temporal saliency and correlation information in temporal profiles is more essential for distinguishing targets from interference.

Method: The authors remodel IRST detection as a 1D signal anomaly detection task and propose DeepPro (deep temporal probe network) that only performs calculations in the time dimension. They first built a prediction attribution tool to verify the importance of temporal profile information, then designed an efficient network that operates on temporal profiles rather than spatial features.

Result: DeepPro outperforms existing state-of-the-art IRST detection methods on widely-used benchmarks with extremely high efficiency. It achieves significant improvement on dim targets and in complex scenarios, demonstrating superior performance.

Conclusion: The work provides a new modeling domain (temporal profile), a new insight (temporal information superiority), a new method (DeepPro), and new performance benchmarks for IRST detection, which can promote the development of the field.

Abstract: Infrared small target (IRST) detection is challenging in simultaneously achieving precise, robust, and efficient performance due to extremely dim targets and strong interference. Current learning-based methods attempt to leverage more" information from both the spatial and the short-term temporal domains, but suffer from unreliable performance under complex conditions while incurring computational redundancy. In this paper, we explore the more essential" information from a more crucial domain for the detection. Through theoretical analysis, we reveal that the global temporal saliency and correlation information in the temporal profile demonstrate significant superiority in distinguishing target signals from other signals. To investigate whether such superiority is preferentially leveraged by well-trained networks, we built the first prediction attribution tool in this field and verified the importance of the temporal profile information. Inspired by the above conclusions, we remodel the IRST detection task as a one-dimensional signal anomaly detection task, and propose an efficient deep temporal probe network (DeepPro) that only performs calculations in the time dimension for IRST detection. We conducted extensive experiments to fully validate the effectiveness of our method. The experimental results are exciting, as our DeepPro outperforms existing state-of-the-art IRST detection methods on widely-used benchmarks with extremely high efficiency, and achieves a significant improvement on dim targets and in complex scenarios. We provide a new modeling domain, a new insight, a new method, and a new performance, which can promote the development of IRST detection. Codes are available at https://github.com/TinaLRJ/DeepPro.

[171] Neural-Driven Image Editing

Pengfei Zhou, Jie Xia, Xiaopeng Peng, Wangbo Zhao, Zilong Ye, Zekai Li, Suorong Yang, Jiadong Pan, Yuanxiang Chen, Ziqiao Wang, Kai Wang, Qian Zheng, Hao Jin, Xiaojun Chang, Gang Pan, Shurong Dong, Kaipeng Zhang, Yang You

Main category: cs.CV

TL;DR: LoongX is a hands-free image editing system that uses multimodal neurophysiological signals (EEG, fNIRS, PPG, head motion) instead of manual prompting, making image editing accessible to people with motor or language limitations.

Details

Motivation: Traditional image editing requires manual prompting which is labor-intensive and inaccessible to people with limited motor control or language abilities. There's a need for more intuitive, accessible image editing methods.

Method: Uses diffusion models trained on 23,928 image editing pairs with synchronized neurophysiological signals. Features cross-scale state space (CS3) module for modality-specific encoding, dynamic gated fusion (DGF) module for feature aggregation, and fine-tuning on diffusion transformer (DiT). Pre-trains encoders with contrastive learning to align cognitive states with semantic intentions.

Result: Achieves performance comparable to text-driven methods (CLIP-I: 0.6605 vs. 0.6558; DINO: 0.4812 vs. 0.4636) and outperforms them when neural signals are combined with speech (CLIP-T: 0.2588 vs. 0.2549).

Conclusion: Demonstrates the promise of neural-driven generative models for accessible, intuitive image editing and opens new directions for cognitive-driven creative technologies.

Abstract: Traditional image editing typically relies on manual prompting, making it labor-intensive and inaccessible to individuals with limited motor control or language abilities. Leveraging recent advances in brain-computer interfaces (BCIs) and generative models, we propose LoongX, a hands-free image editing approach driven by multimodal neurophysiological signals. LoongX utilizes state-of-the-art diffusion models trained on a comprehensive dataset of 23,928 image editing pairs, each paired with synchronized electroencephalography (EEG), functional near-infrared spectroscopy (fNIRS), photoplethysmography (PPG), and head motion signals that capture user intent. To effectively address the heterogeneity of these signals, LoongX integrates two key modules. The cross-scale state space (CS3) module encodes informative modality-specific features. The dynamic gated fusion (DGF) module further aggregates these features into a unified latent space, which is then aligned with edit semantics via fine-tuning on a diffusion transformer (DiT). Additionally, we pre-train the encoders using contrastive learning to align cognitive states with semantic intentions from embedded natural language. Extensive experiments demonstrate that LoongX achieves performance comparable to text-driven methods (CLIP-I: 0.6605 vs. 0.6558; DINO: 0.4812 vs. 0.4636) and outperforms them when neural signals are combined with speech (CLIP-T: 0.2588 vs. 0.2549). These results highlight the promise of neural-driven generative models in enabling accessible, intuitive image editing and open new directions for cognitive-driven creative technologies. The code and dataset are released on the project website: https://loongx1.github.io.

[172] Hallucination Score: Towards Mitigating Hallucinations in Generative Image Super-Resolution

Weiming Ren, Raghav Goyal, Zhiming Hu, Tristan Ty Aumentado-Armstrong, Iqbal Mohomed, Alex Levinshtein

Main category: cs.CV

TL;DR: This paper addresses hallucination artifacts in generative super-resolution models, proposes a Hallucination Score using MLLMs to measure them, and shows how to fine-tune diffusion models to reduce hallucinations.

Details

Motivation: Generative super-resolution models produce perceptual artifacts where generated details don't match the low-resolution input or ground truth, limiting practical deployment. These hallucinations aren't well-characterized by existing metrics.

Method: Use multimodal large language models (MLLMs) with specialized prompts to assess hallucinatory elements and generate a Hallucination Score (HS). Create efficient HS proxies and use them as differentiable reward functions to fine-tune diffusion-based GSR models.

Result: HS aligns closely with human evaluations and provides complementary insights to prior SR metrics. The proposed method successfully reduces hallucinations in diffusion-based GSR models through fine-tuning.

Conclusion: The paper introduces a novel approach to measure and mitigate hallucination artifacts in GSR using MLLMs, enabling practical improvement of generative super-resolution models by addressing a critical but under-studied issue.

Abstract: Generative super-resolution (GSR) currently sets the state-of-the-art in terms of perceptual image quality, overcoming the “regression-to-the-mean” blur of prior non-generative models. However, from a human perspective, such models do not fully conform to the optimal balance between quality and fidelity. Instead, a different class of artifacts, in which generated details fail to perceptually match the low resolution image (LRI) or ground-truth image (GTI), is a critical but under-studied issue in GSR, limiting its practical deployment. In this work, we focus on measuring, analyzing, and mitigating these artifacts (i.e., “hallucinations”). We observe that hallucinations are not well-characterized with existing image metrics or quality models, as they are orthogonal to both exact fidelity and no-reference quality. Instead, we take advantage of multimodal large language models (MLLMs) by constructing a prompt that assesses hallucinatory visual elements and generates a “Hallucination Score” (HS). We find that HS is closely aligned with human evaluations, and also provides complementary insights to prior image metrics used for super-resolution (SR) models. Finally, we propose a few efficient HS proxies and demonstrate how diffusion-based GSR models can be fine-tuned to mitigate hallucinations, leveraging HS proxies as differentiable reward functions.

[173] AttriCtrl: Fine-Grained Control of Aesthetic Attribute Intensity in Diffusion Models

Die Chen, Zhongjie Duan, Zhiwen Li, Cen Chen, Daoyuan Chen, Yaliang Li, Yingda Chen

Main category: cs.CV

TL;DR: AttriCtrl is a lightweight framework that enables continuous control over aesthetic attributes in diffusion models through a plug-and-play value encoder, allowing precise adjustment of single or multiple aesthetic dimensions on a [0,1] scale.

Details

Motivation: Current diffusion models struggle with numeric instructions for adjusting semantic attributes, especially for precise aesthetic control. Existing methods fail because text encoders are designed for discrete tokens rather than continuous values, and aesthetic alignment approaches overlook the compositional nature of aesthetics.

Method: Defines relevant aesthetic attributes, quantifies them through a hybrid strategy mapping both concrete and abstract dimensions onto a unified [0,1] scale, and uses a plug-and-play value encoder to transform user-specified values into model-interpretable embeddings for controllable generation.

Result: AttriCtrl achieves accurate and continuous control over both single and multiple aesthetic attributes, significantly enhancing personalization and diversity. It works as a lightweight adapter while keeping the diffusion model frozen, enabling seamless integration with existing frameworks like ControlNet at negligible computational cost.

Conclusion: The framework successfully addresses the gap in continuous aesthetic intensity control for diffusion models, providing precise, compositional control over aesthetic attributes while maintaining computational efficiency and compatibility with existing systems.

Abstract: Diffusion models have recently become the dominant paradigm for image generation, yet existing systems struggle to interpret and follow numeric instructions for adjusting semantic attributes. In real-world creative scenarios, especially when precise control over aesthetic attributes is required, current methods fail to provide such controllability. This limitation partly arises from the subjective and context-dependent nature of aesthetic judgments, but more fundamentally stems from the fact that current text encoders are designed for discrete tokens rather than continuous values. Meanwhile, efforts on aesthetic alignment, often leveraging reinforcement learning, direct preference optimization, or architectural modifications, primarily align models with a global notion of human preference. While these approaches improve user experience, they overlook the multifaceted and compositional nature of aesthetics, underscoring the need for explicit disentanglement and independent control of aesthetic attributes. To address this gap, we introduce AttriCtrl, a lightweight framework for continuous aesthetic intensity control in diffusion models. It first defines relevant aesthetic attributes, then quantifies them through a hybrid strategy that maps both concrete and abstract dimensions onto a unified $[0,1]$ scale. A plug-and-play value encoder is then used to transform user-specified values into model-interpretable embeddings for controllable generation. Experiments show that AttriCtrl achieves accurate and continuous control over both single and multiple aesthetic attributes, significantly enhancing personalization and diversity. Crucially, it is implemented as a lightweight adapter while keeping the diffusion model frozen, ensuring seamless integration with existing frameworks such as ControlNet at negligible computational cost.

[174] ImageNet-trained CNNs are not biased towards texture: Revisiting feature reliance through controlled suppression

Tom Burgert, Oliver Stoll, Paolo Rota, Begüm Demir

Main category: cs.CV

TL;DR: CNNs are not inherently texture-biased as previously thought; they primarily rely on local shape features, though this can be mitigated with modern training/architectures. Feature reliance patterns differ systematically across domains: shape in computer vision, color in medical imaging, and texture in remote sensing.

Details

Motivation: To challenge the established hypothesis that CNNs are inherently texture-biased by examining limitations in previous cue-conflict experiments and developing a more rigorous framework to quantify feature reliance across different domains.

Method: Proposed a domain-agnostic framework that systematically suppresses shape, texture, and color cues to quantify feature reliance, avoiding forced-choice conflicts. Evaluated both humans and neural networks under controlled suppression conditions across computer vision, medical imaging, and remote sensing domains.

Result: CNNs are not inherently texture-biased but predominantly rely on local shape features. Modern training strategies or architectures (ConvNeXt, ViTs) can substantially mitigate this reliance. Feature reliance patterns differ systematically across domains: computer vision models prioritize shape, medical imaging models emphasize color, and remote sensing models show stronger texture reliance.

Conclusion: The texture-bias hypothesis for CNNs is oversimplified; CNNs primarily use local shape features, and feature reliance patterns vary systematically across different application domains, suggesting domain-specific considerations for model design and evaluation.

Abstract: The hypothesis that Convolutional Neural Networks (CNNs) are inherently texture-biased has shaped much of the discourse on feature use in deep learning. We revisit this hypothesis by examining limitations in the cue-conflict experiment by Geirhos et al. To address these limitations, we propose a domain-agnostic framework that quantifies feature reliance through systematic suppression of shape, texture, and color cues, avoiding the confounds of forced-choice conflicts. By evaluating humans and neural networks under controlled suppression conditions, we find that CNNs are not inherently texture-biased but predominantly rely on local shape features. Nonetheless, this reliance can be substantially mitigated through modern training strategies or architectures (ConvNeXt, ViTs). We further extend the analysis across computer vision, medical imaging, and remote sensing, revealing that reliance patterns differ systematically: computer vision models prioritize shape, medical imaging models emphasize color, and remote sensing models exhibit a stronger reliance on texture. Code is available at https://github.com/tomburgert/feature-reliance.

[175] Reflect3r: Single-View 3D Stereo Reconstruction Aided by Mirror Reflections

Jing Wu, Zirui Wang, Iro Laina, Victor Adrian Prisacariu

Main category: cs.CV

TL;DR: Single-image 3D reconstruction using mirror reflections as virtual stereo views, enabling multi-view stereo from a single capture with applications to both static and dynamic scenes.

Details

Motivation: Mirror reflections are common in everyday environments and contain stereo information within a single image, but existing methods don't fully exploit this for 3D reconstruction. The authors aim to simplify the imaging process by using mirrors to create virtual views from single images.

Method: Treat mirror reflections as auxiliary views, design a transformation to construct physically valid virtual cameras for direct pixel-domain virtual view generation. Use symmetric-aware loss to refine pose estimation by exploiting geometric symmetry from mirrors. Framework extends to dynamic scenes with per-frame geometry recovery.

Result: Created a fully customizable synthetic dataset of 16 Blender scenes with ground-truth point clouds and camera poses. Extensive experiments on both real-world and synthetic data demonstrate the method’s effectiveness for generalizable and robust 3D reconstruction.

Conclusion: The method successfully enables multi-view stereo from single images using mirror reflections, simplifying the imaging process while maintaining compatibility with feed-forward reconstruction models. The approach works for both static and dynamic scenes and shows promising results through comprehensive evaluation.

Abstract: Mirror reflections are common in everyday environments and can provide stereo information within a single capture, as the real and reflected virtual views are visible simultaneously. We exploit this property by treating the reflection as an auxiliary view and designing a transformation that constructs a physically valid virtual camera, allowing direct pixel-domain generation of the virtual view while adhering to the real-world imaging process. This enables a multi-view stereo setup from a single image, simplifying the imaging process, making it compatible with powerful feed-forward reconstruction models for generalizable and robust 3D reconstruction. To further exploit the geometric symmetry introduced by mirrors, we propose a symmetric-aware loss to refine pose estimation. Our framework also naturally extends to dynamic scenes, where each frame contains a mirror reflection, enabling efficient per-frame geometry recovery. For quantitative evaluation, we provide a fully customizable synthetic dataset of 16 Blender scenes, each with ground-truth point clouds and camera poses. Extensive experiments on real-world data and synthetic data are conducted to illustrate the effectiveness of our method.

[176] Sprint: Sparse-Dense Residual Fusion for Efficient Diffusion Transformers

Dogyun Park, Moayed Haji-Ali, Yanyu Li, Willi Menapace, Sergey Tulyakov, Hyunwoo J. Kim, Aliaksandr Siarohin, Anil Kag

Main category: cs.CV

TL;DR: SPRINT enables aggressive token dropping (up to 75%) in Diffusion Transformers while preserving quality, achieving 9.8x training savings with comparable performance.

Details

Motivation: Diffusion Transformers (DiTs) have state-of-the-art generative performance but suffer from quadratic training costs with sequence length, making large-scale pretraining prohibitively expensive. Existing token dropping methods either degrade representations, are parameter-heavy, or fail at high drop ratios.

Method: SPRINT uses sparse-dense residual fusion: early layers process all tokens for local detail, deeper layers operate on sparse subsets to reduce computation, and their outputs are fused through residual connections. Training follows a two-stage schedule: long masked pre-training for efficiency followed by short full-token fine-tuning to close the train-inference gap.

Result: On ImageNet-1K 256x256, SPRINT achieves 9.8x training savings with comparable FID/FDD. At inference, its Path-Drop Guidance (PDG) nearly halves FLOPs while improving quality.

Conclusion: SPRINT establishes a simple, effective, and general solution for efficient DiT training, enabling aggressive token dropping while maintaining generative quality.

Abstract: Diffusion Transformers (DiTs) deliver state-of-the-art generative performance but their quadratic training cost with sequence length makes large-scale pretraining prohibitively expensive. Token dropping can reduce training cost, yet naïve strategies degrade representations, and existing methods are either parameter-heavy or fail at high drop ratios. We present SPRINT, Sparse–Dense Residual Fusion for Efficient Diffusion Transformers, a simple method that enables aggressive token dropping (up to 75%) while preserving quality. SPRINT leverages the complementary roles of shallow and deep layers: early layers process all tokens to capture local detail, deeper layers operate on a sparse subset to cut computation, and their outputs are fused through residual connections. Training follows a two-stage schedule: long masked pre-training for efficiency followed by short full-token fine-tuning to close the train–inference gap. On ImageNet-1K 256x256, SPRINT achieves 9.8x training savings with comparable FID/FDD, and at inference, its Path-Drop Guidance (PDG) nearly halves FLOPs while improving quality. These results establish SPRINT as a simple, effective, and general solution for efficient DiT training.

[177] CAST-LUT: Tokenizer-Guided HSV Look-Up Tables for Purple Flare Removal

Pu Wang, Shuning Sun, Jialang Lu, Chen Wu, Zhihua Zhang, Youshan Zhang, Chenggang Shan, Dianjie Lu, Guijuan Zhang, Zhuoran Zheng

Main category: cs.CV

TL;DR: A novel HSV-LUT based network for purple flare removal that uses decoupled HSV components and a two-stage architecture with semantic tokenization, achieving state-of-the-art results on a new large-scale dataset.

Details

Motivation: Purple flare artifacts degrade image quality, existing traditional methods lack flexibility with hand-crafted features, and deep learning is hampered by scarcity of paired training data.

Method: Two-stage network: 1) Chroma-Aware Spectral Tokenizer (CAST) converts RGB to HSV and encodes H/V channels into semantic tokens; 2) HSV-LUT module generates independent 1D-LUT correction curves for H, S, V channels based on tokens.

Result: Model significantly outperforms existing methods in visual effects and achieves state-of-the-art performance on all quantitative metrics, validated on a new large-scale purple flare dataset.

Conclusion: The proposed decoupled HSV-LUT approach effectively addresses purple flare removal by solving color coupling problems, with the new dataset and metrics enabling robust training and evaluation.

Abstract: Purple flare, a diffuse chromatic aberration artifact commonly found around highlight areas, severely degrades the tone transition and color of the image. Existing traditional methods are based on hand-crafted features, which lack flexibility and rely entirely on fixed priors, while the scarcity of paired training data critically hampers deep learning. To address this issue, we propose a novel network built upon decoupled HSV Look-Up Tables (LUTs). The method aims to simplify color correction by adjusting the Hue (H), Saturation (S), and Value (V) components independently. This approach resolves the inherent color coupling problems in traditional methods. Our model adopts a two-stage architecture: First, a Chroma-Aware Spectral Tokenizer (CAST) converts the input image from RGB space to HSV space and independently encodes the Hue (H) and Value (V) channels into a set of semantic tokens describing the Purple flare status; second, the HSV-LUT module takes these tokens as input and dynamically generates independent correction curves (1D-LUTs) for the three channels H, S, and V. To effectively train and validate our model, we built the first large-scale purple flare dataset with diverse scenes. We also proposed new metrics and a loss function specifically designed for this task. Extensive experiments demonstrate that our model not only significantly outperforms existing methods in visual effects but also achieves state-of-the-art performance on all quantitative metrics.

[178] Co-Training Vision Language Models for Remote Sensing Multi-task Learning

Qingyun Li, Shuran Ma, Junwei Luo, Yi Yu, Yue Zhou, Fengxiang Wang, Xudong Lu, Xiaoxing Wang, Xin He, Yushi Chen, Xue Yang

Main category: cs.CV

TL;DR: RSCoVLM is a vision-language model baseline for remote sensing multi-task learning that addresses RS-specific challenges like diverse image scales and ultra-high-resolution images through unified strategies and achieves state-of-the-art performance across diverse tasks.

Details

Motivation: Transformers show strong performance on individual RS tasks, but there's a need for unified models that excel across multiple tasks through multi-task learning. Vision-language models with text-based interfaces show promise for MTL in remote sensing, but need to address RS-specific challenges like diverse image scales and computational burdens.

Method: 1) Data curation engine for RS data acquisition, processing, and flexible vision-language conversations; 2) Unified dynamic-resolution strategy for diverse RS image scales; 3) Zoom-in Chain mechanism with LRS-VQA-Zoom dataset for ultra-high-resolution images; 4) Enhanced object detection capability with novel evaluation protocol for fair comparison.

Result: RSCoVLM achieves state-of-the-art performance across diverse RS tasks, outperforming existing RS VLMs and rivaling specialized expert models. All tools, models, and datasets are open-sourced for reproducibility.

Conclusion: RSCoVLM provides a simple yet flexible VLM baseline for RS multi-task learning that effectively addresses RS-specific challenges and demonstrates significant potential for advancing toward general-purpose RS models.

Abstract: With Transformers achieving outstanding performance on individual remote sensing (RS) tasks, we are now approaching the realization of a unified model that excels across multiple tasks through multi-task learning (MTL). Compared to single-task approaches, MTL methods offer improved generalization, enhanced scalability, and greater practical applicability. Recently, vision language models (VLMs) have achieved promising results in RS image understanding, grounding, and ultra-high-resolution (UHR) image reasoning, respectively. Moreover, the unified text-based interface demonstrates significant potential for MTL. Hence, in this work, we present RSCoVLM, a simple yet flexible VLM baseline for RS MTL. Firstly, we create the data curation engine, including data acquisition, offline processing and integrating, as well as online loading and weighting. This data engine effectively addresses complex RS data enviroment and generates flexible vision-language conversations. Furthermore, we propose a unified dynamic-resolution strategy to address the diverse image scales inherent in RS imagery. For UHR images, we introduce the Zoom-in Chain mechanism together with its corresponding dataset, LRS-VQA-Zoom. The strategies are flexible and effectively mitigate the computational burdens. Additionally, we significantly enhance the model’s object detection capability and propose a novel evaluation protocol that ensures fair comparison between VLMs and conventional detection models. Extensive experiments demonstrate that RSCoVLM achieves state-of-the-art performance across diverse tasks, outperforming existing RS VLMs and even rivaling specialized expert models. All the training and evaluating tools, model weights, and datasets have been fully open-sourced to support reproducibility. We expect that this baseline will promote further progress toward general-purpose RS models.

[179] Video Generation Models Are Good Latent Reward Models

Xiaoyue Mi, Wenqing Yu, Jiesong Lian, Shibo Jie, Ruizhe Zhong, Zijun Liu, Guozhen Zhang, Zixiang Zhou, Zhiyong Xu, Yuan Zhou, Qinglin Lu, Fan Tang

Main category: cs.CV

TL;DR: PRFL enables efficient preference optimization for video generation by conducting reward feedback learning entirely in latent space instead of pixel space, reducing memory/time costs while improving alignment with human preferences.

Details

Motivation: Existing video reward models use pixel-space approaches that require expensive VAE decoding, incur high memory/time costs, and only optimize late-stage visual quality rather than fundamental motion dynamics and structural coherence.

Method: Propose Process Reward Feedback Learning (PRFL) that conducts preference optimization entirely in latent space using pre-trained video generation models as reward models, enabling efficient gradient backpropagation throughout the full denoising chain without VAE decoding.

Result: PRFL significantly improves alignment with human preferences while achieving substantial reductions in memory consumption and training time compared to RGB ReFL.

Conclusion: Pre-trained video generation models are naturally suited for reward modeling in noisy latent space, and PRFL provides an efficient framework for aligning video generation with human preferences through latent-space optimization.

Abstract: Reward feedback learning (ReFL) has proven effective for aligning image generation with human preferences. However, its extension to video generation faces significant challenges. Existing video reward models rely on vision-language models designed for pixel-space inputs, confining ReFL optimization to near-complete denoising steps after computationally expensive VAE decoding. This pixel-space approach incurs substantial memory overhead and increased training time, and its late-stage optimization lacks early-stage supervision, refining only visual quality rather than fundamental motion dynamics and structural coherence. In this work, we show that pre-trained video generation models are naturally suited for reward modeling in the noisy latent space, as they are explicitly designed to process noisy latent representations at arbitrary timesteps and inherently preserve temporal information through their sequential modeling capabilities. Accordingly, we propose Process Reward Feedback Learning~(PRFL), a framework that conducts preference optimization entirely in latent space, enabling efficient gradient backpropagation throughout the full denoising chain without VAE decoding. Extensive experiments demonstrate that PRFL significantly improves alignment with human preferences, while achieving substantial reductions in memory consumption and training time compared to RGB ReFL.

[180] A Novel Patch-Based TDA Approach for Computed Tomography

Dashti A. Ali, Aras T. Asaad, Jacob J. Peoples, Mohammad Hamghalam, Alex Robins, Mane Piliposyan, Richard K. G. Do, Natalie Gangai, Yun S. Chun, Ahmad Bashir Barekzai, Jayasree Chakraborty, Hala Khasawneh, Camila Vilela, Natally Horvat, João Miranda, Alice C. Wei, Amber L. Simpson

Main category: cs.CV

TL;DR: A novel patch-based persistent homology approach for CT imaging outperforms traditional 3D cubical complex method in both classification performance and computational efficiency.

Details

Motivation: Traditional 3D cubical complex filtration for persistent homology in CT imaging has limitations in performance and computational complexity, especially with higher resolution images.

Method: Proposes a patch-based persistent homology construction approach specifically designed for volumetric medical imaging data (CT modality), with comprehensive experiments across multiple datasets.

Result: Patch-based TDA significantly outperforms cubical complex method with average improvements of 10.38% accuracy, 6.94% AUC, 2.06% sensitivity, 11.58% specificity, and 8.51% F1 score across all datasets, while being more time-efficient.

Conclusion: The patch-based TDA approach is superior for CT imaging analysis, offering both better classification performance and computational efficiency, with a provided Python package (Patch-TDA) for practical implementation.

Abstract: The development of machine learning (ML) models based on computed tomography (CT) imaging modality has been a major focus of recent research in the medical imaging domain. Incorporating robust feature engineering approach can highly improve the performance of these models. Topological data analysis (TDA), a recent development based on the mathematical field of algebraic topology, mainly focuses on the data from a topological perspective, extracting deeper insight and higher dimensional structures from the data. Persistent homology (PH), a fundamental tool in the area of TDA, can extract topological features such as connected components, cycles and voids from the data. A popular approach to construct PH from 3D CT images is to utilize the 3D cubical complex filtration, a method adapted for grid-structured data. However, this approach may not always yield the best performance and can suffer from computational complexity with higher resolution CT images. This study introduces a novel patch-based PH construction approach tailored for volumetric medical imaging data, in particular CT modality. A wide range of experiments has been conducted on several datasets of 3D CT images to comprehensively analyze the performance of the proposed method with various parameters and benchmark it against the 3D cubical complex algorithm. Our results highlight the dominance of the patch-based TDA approach in terms of both classification performance and time-efficiency. The proposed approach outperformed the cubical complex method, achieving average improvement of 10.38%, 6.94%, 2.06%, 11.58%, and 8.51% in accuracy, AUC, sensitivity, specificity, and F1 score, respectively, across all datasets. Finally, we provide a convenient python package, Patch-TDA, to facilitate the utilization of the proposed approach.

[181] PixelArena: A benchmark for Pixel-Precision Visual Intelligence

Feng Liang, Sizhe Cheng, Chenqi Yi, Yong Wang

Main category: cs.CV

TL;DR: PixelArena introduces a semantic segmentation-based benchmark to objectively evaluate fine-grained image generation capabilities of omni-modal models, finding Gemini 3 Pro Image shows emergent zero-shot mask generation with high fidelity.

Details

Motivation: Current image generation benchmarks focus too much on aesthetics rather than fine-grained capabilities, failing to objectively evaluate visual intelligence of emerging omni-modal models with multimodal input/output.

Method: Proposes PixelArena benchmark using semantic segmentation tasks to examine fine-grained generative intelligence with pixel precision, evaluating models’ ability to generate semantic masks in zero-shot settings.

Result: Gemini 3 Pro Image demonstrates emergent image generation capabilities, generating semantic masks with high fidelity under zero-shot conditions, showing unprecedented visual intelligence and true generalization in new image generation tasks.

Conclusion: The findings signal exciting progress in omni-modal models and provide insights for future research in dataset development, model development, and metric design for evaluating visual intelligence.

Abstract: Omni-modal models that have multimodal input and output are emerging. However, benchmarking their multimodal generation, especially in image generation, is challenging due to the subtleties of human preferences and model biases. Many image generation benchmarks focus on aesthetics instead of the fine-grained generation capabilities of these models, failing to evaluate their visual intelligence with objective metrics. In PixelArena, we propose using semantic segmentation tasks to objectively examine their fine-grained generative intelligence with pixel precision. With our benchmark and experiments, we find the latest Gemini 3 Pro Image has emergent image generation capabilities that generate semantic masks with high fidelity under zero-shot settings, showcasing visual intelligence unseen before and true generalization in new image generation tasks. We further investigate its results, compare them qualitatively and quantitatively with those of other models, and present failure cases. The findings not only signal exciting progress in the field but also provide insights into future research related to dataset development, omni-modal model development, and the design of metrics.

[182] Pyramidal Adaptive Cross-Gating for Multimodal Detection

Zidong Gu, Shoufu Tian

Main category: cs.CV

TL;DR: PACGNet introduces a pyramidal adaptive cross-gating network for aerial object detection that addresses cross-modal noise and preserves feature pyramid hierarchy through symmetrical cross-gating and pyramidal feature-aware multimodal gating modules.

Details

Motivation: Existing multimodal fusion methods for aerial object detection suffer from two critical flaws: they are prone to cross-modal noise and disrupt the hierarchical structure of feature pyramids, which impairs fine-grained detection of small objects in UAV reconnaissance applications.

Method: Proposes PACGNet with two core components: 1) Symmetrical Cross-Gating (SCG) module that uses bidirectional symmetrical gating to selectively absorb complementary information while suppressing noise and preserving semantic integrity; 2) Pyramidal Feature-aware Multimodal Gating (PFMG) module that reconstructs feature hierarchy via progressive hierarchical gating, using detailed features from higher-resolution levels to guide fusion at lower-resolution levels.

Result: Achieves state-of-the-art performance on DroneVehicle and VEDAI datasets with mAP50 scores of 82.2% and 82.1% respectively.

Conclusion: PACGNet effectively addresses cross-modal noise and feature pyramid disruption in aerial object detection through deep fusion within the backbone, setting new benchmarks for multimodal object detection in UAV applications.

Abstract: Object detection in aerial imagery is a critical task in applications such as UAV reconnaissance. Although existing methods have extensively explored feature interaction between different modalities, they commonly rely on simple fusion strategies for feature aggregation. This introduces two critical flaws: it is prone to cross-modal noise and disrupts the hierarchical structure of the feature pyramid, thereby impairing the fine-grained detection of small objects. To address this challenge, we propose the Pyramidal Adaptive Cross-Gating Network (PACGNet), an architecture designed to perform deep fusion within the backbone. To this end, we design two core components: the Symmetrical Cross-Gating (SCG) module and the Pyramidal Feature-aware Multimodal Gating (PFMG) module. The SCG module employs a bidirectional, symmetrical “horizontal” gating mechanism to selectively absorb complementary information, suppress noise, and preserve the semantic integrity of each modality. The PFMG module reconstructs the feature hierarchy via a progressive hierarchical gating mechanism. This leverages the detailed features from a preceding, higher-resolution level to guide the fusion at the current, lower-resolution level, effectively preserving fine-grained details as features propagate. Through evaluations conducted on the DroneVehicle and VEDAI datasets, our PACGNet sets a new state-of-the-art benchmark, with mAP50 scores reaching 82.2% and 82.1% respectively.

[183] Multimodal Interpretation of Remote Sensing Images: Dynamic Resolution Input Strategy and Multi-scale Vision-Language Alignment Mechanism

Siyu Zhang, Lianlei Shan, Runhe Qiu

Main category: cs.CV

TL;DR: Proposes a VLM framework with Dynamic Resolution Input Strategy and Multi-scale Vision-language Alignment Mechanism for multimodal remote sensing fusion, improving accuracy and efficiency in image captioning and cross-modal retrieval.

Details

Motivation: Existing methods have limitations: fixed resolutions fail to balance efficiency and detail, and single-scale alignment lacks semantic hierarchy, hindering accurate surface information extraction from remote sensing images.

Method: Vision-language Model framework with two innovations: 1) Dynamic Resolution Input Strategy (DRIS) using coarse-to-fine approach to adaptively allocate computational resources, 2) Multi-scale Vision-language Alignment Mechanism (MS-VLAM) with three-tier alignment (object, local-region, global levels) to capture cross-modal semantic consistency.

Result: Experimental results on RS-GPT4V dataset show significant improvements in semantic understanding accuracy and computational efficiency. Superior performance in BLEU-4, CIDEr for image captioning, and R@10 for cross-modal retrieval compared to conventional methods.

Conclusion: The framework provides a novel approach for efficient and robust multimodal remote sensing systems, offering theoretical foundation and technical guidance for intelligent remote sensing interpretation engineering applications.

Abstract: Multimodal fusion of remote sensing images serves as a core technology for overcoming the limitations of single-source data and improving the accuracy of surface information extraction, which exhibits significant application value in fields such as environmental monitoring and urban planning. To address the deficiencies of existing methods, including the failure of fixed resolutions to balance efficiency and detail, as well as the lack of semantic hierarchy in single-scale alignment, this study proposes a Vision-language Model (VLM) framework integrated with two key innovations: the Dynamic Resolution Input Strategy (DRIS) and the Multi-scale Vision-language Alignment Mechanism (MS-VLAM).Specifically, the DRIS adopts a coarse-to-fine approach to adaptively allocate computational resources according to the complexity of image content, thereby preserving key fine-grained features while reducing redundant computational overhead. The MS-VLAM constructs a three-tier alignment mechanism covering object, local-region and global levels, which systematically captures cross-modal semantic consistency and alleviates issues of semantic misalignment and granularity imbalance.Experimental results on the RS-GPT4V dataset demonstrate that the proposed framework significantly improves the accuracy of semantic understanding and computational efficiency in tasks including image captioning and cross-modal retrieval. Compared with conventional methods, it achieves superior performance in evaluation metrics such as BLEU-4 and CIDEr for image captioning, as well as R@10 for cross-modal retrieval. This technical framework provides a novel approach for constructing efficient and robust multimodal remote sensing systems, laying a theoretical foundation and offering technical guidance for the engineering application of intelligent remote sensing interpretation.

[184] RxnBench: A Multimodal Benchmark for Evaluating Large Language Models on Chemical Reaction Understanding from Scientific Literature

Hanzheng Li, Xi Fang, Yixuan Li, Chaozheng Huang, Junjie Wang, Xi Wang, Hongzhe Bai, Bojun Hao, Shenyu Lin, Huiqi Liang, Linfeng Zhang, Guolin Ke

Main category: cs.CV

TL;DR: RxnBench is a new benchmark for evaluating Multimodal Large Language Models on chemical reaction understanding from scientific PDFs, revealing significant gaps in models’ ability to comprehend chemical logic and integrate cross-modal information.

Details

Motivation: While MLLMs show promise for revolutionizing chemistry, their ability to understand the dense graphical language of chemical reactions in real scientific literature remains underexplored and needs rigorous evaluation.

Method: Created RxnBench with two tasks: Single-Figure QA (1,525 questions from 305 reaction schemes) testing visual perception and mechanistic reasoning, and Full-Document QA (108 articles) requiring cross-modal integration of text, schemes, and tables.

Result: MLLMs show critical capability gaps - they excel at text extraction but struggle with deep chemical logic and precise structural recognition. Models with inference-time reasoning perform better but none achieve 50% accuracy on Full-Document QA.

Conclusion: There’s an urgent need for domain-specific visual encoders and stronger reasoning engines to advance autonomous AI chemists, as current MLLMs cannot adequately handle complex chemical reaction understanding from scientific literature.

Abstract: The integration of Multimodal Large Language Models (MLLMs) into chemistry promises to revolutionize scientific discovery, yet their ability to comprehend the dense, graphical language of reactions within authentic literature remains underexplored. Here, we introduce RxnBench, a multi-tiered benchmark designed to rigorously evaluate MLLMs on chemical reaction understanding from scientific PDFs. RxnBench comprises two tasks: Single-Figure QA (SF-QA), which tests fine-grained visual perception and mechanistic reasoning using 1,525 questions derived from 305 curated reaction schemes, and Full-Document QA (FD-QA), which challenges models to synthesize information from 108 articles, requiring cross-modal integration of text, schemes, and tables. Our evaluation of MLLMs reveals a critical capability gap: while models excel at extracting explicit text, they struggle with deep chemical logic and precise structural recognition. Notably, models with inference-time reasoning significantly outperform standard architectures, yet none achieve 50% accuracy on FD-QA. These findings underscore the urgent need for domain-specific visual encoders and stronger reasoning engines to advance autonomous AI chemists.

[185] Language as Prior, Vision as Calibration: Metric Scale Recovery for Monocular Depth Estimation

Mingxia Zhan, Li Zhang, Beibei Wang, Yingjie Wang, Zenglin Shi

Main category: cs.CV

TL;DR: A method that uses language cues to predict uncertainty-aware calibration envelopes for recovering metric depth from relative-depth models, training only lightweight calibration heads while keeping backbones frozen.

Details

Motivation: Monocular metric depth estimation remains ill-posed due to unidentifiable global scale and heightened domain-shift sensitivity, even though relative-depth foundation models transfer well. Captions provide coarse but noisy scale cues that vary with phrasing and missing objects.

Method: Under frozen-backbone calibration setting, recover metric depth via image-specific affine transform in inverse depth. Use language to predict uncertainty-aware envelope bounding feasible calibration parameters rather than text-only point estimate. Use pooled multi-scale frozen visual features to select image-specific calibration within envelope. Train with closed-form least-squares oracle providing per-image supervision.

Result: Experiments on NYUv2 and KITTI show improved in-domain accuracy. Zero-shot transfer to SUN-RGBD and DDAD demonstrates improved robustness over strong language-only baselines.

Conclusion: The proposed method effectively leverages language cues while accounting for their uncertainty, enabling robust metric depth recovery from relative-depth models with minimal training overhead.

Abstract: Relative-depth foundation models transfer well, yet monocular metric depth remains ill-posed due to unidentifiable global scale and heightened domain-shift sensitivity. Under a frozen-backbone calibration setting, we recover metric depth via an image-specific affine transform in inverse depth and train only lightweight calibration heads while keeping the relative-depth backbone and the CLIP text encoder fixed. Since captions provide coarse but noisy scale cues that vary with phrasing and missing objects, we use language to predict an uncertainty-aware envelope that bounds feasible calibration parameters in an unconstrained space, rather than committing to a text-only point estimate. We then use pooled multi-scale frozen visual features to select an image-specific calibration within this envelope. During training, a closed-form least-squares oracle in inverse depth provides per-image supervision for learning the envelope and the selected calibration. Experiments on NYUv2 and KITTI improve in-domain accuracy, while zero-shot transfer to SUN-RGBD and DDAD demonstrates improved robustness over strong language-only baselines.

[186] Higher-Order Domain Generalization in Magnetic Resonance-Based Assessment of Alzheimer’s Disease

Zobia Batool, Diala Lteif, Vijaya B. Kolachalama, Huseyin Ozkan, Erchan Aptoula

Main category: cs.CV

TL;DR: Extended MixStyle (EM) improves Alzheimer’s disease classification across different MRI datasets by blending higher-order feature moments to handle domain shifts.

Details

Motivation: Existing deep learning models for AD diagnosis using sMRI often fail to generalize to new cohorts due to domain shifts from different scanners, protocols, and patient demographics. Single-domain generalization is crucial but underexplored for fragmented AD datasets.

Method: Extended MixStyle (EM) framework that blends higher-order feature moments (skewness and kurtosis) to mimic diverse distributional variations, enabling better domain generalization. Trained on NACC dataset (n=4,647) to differentiate normal cognition from MCI/AD.

Result: EM improves cross-domain performance by 2.4 percentage points in macro-F1 over state-of-the-art SDG benchmarks when tested on three unseen cohorts (total n=3,126).

Conclusion: EM shows promise for invariant, reliable AD detection in heterogeneous real-world settings by effectively handling domain shifts through higher-order feature moment blending.

Abstract: Despite progress in deep learning for Alzheimer’s disease (AD) diagnostics, models trained on structural magnetic resonance imaging (sMRI) often do not perform well when applied to new cohorts due to domain shifts from varying scanners, protocols and patient demographics. AD, the primary driver of dementia, manifests through progressive cognitive and neuroanatomical changes like atrophy and ventricular expansion, making robust, generalizable classification essential for real-world use. While convolutional neural networks and transformers have advanced feature extraction via attention and fusion techniques, single-domain generalization (SDG) remains underexplored yet critical, given the fragmented nature of AD datasets. To bridge this gap, we introduce Extended MixStyle (EM), a framework for blending higher-order feature moments (skewness and kurtosis) to mimic diverse distributional variations. Trained on sMRI data from the National Alzheimer’s Coordinating Center (NACC; n=4,647) to differentiate persons with normal cognition (NC) from those with mild cognitive impairment (MCI) or AD and tested on three unseen cohorts (total n=3,126), EM yields enhanced cross-domain performance, improving macro-F1 on average by 2.4 percentage points over state-of-the-art SDG benchmarks, underscoring its promise for invariant, reliable AD detection in heterogeneous real-world settings. The source code will be made available upon acceptance at https://github.com/zobia111/Extended-Mixstyle.

[187] Causality-Aware Temporal Projection for Video Understanding in Video-LLMs

Zhengjian Kang, Qi Chen, Rui Liu, Kangtong Mo, Xingyu Zhang, Xiaoyu Deng, Ye Zhang

Main category: cs.CV

TL;DR: V-CORE is a parameter-efficient Video-LLM framework that introduces explicit temporal ordering constraints to improve video understanding, achieving strong performance on challenging temporal and causal reasoning tasks.

Details

Motivation: Current Video-LLMs struggle with tasks requiring consistent temporal ordering and causal coherence due to unconstrained bidirectional projectors that blur temporal relationships by allowing later frames to influence earlier representations without respecting the directional nature of video reasoning.

Method: V-CORE introduces two key components: 1) Learnable Spatial Aggregation (LSA) to adaptively select salient spatial tokens and reduce redundancy, and 2) Causality-Aware Temporal Projector (CATP) that enforces structured unidirectional information flow using block-causal attention and a terminal dynamic summary token acting as a causal sink.

Result: V-CORE achieves 61.2% accuracy on the challenging NExT-QA benchmark and remains competitive across MSVD-QA, MSRVTT-QA, and TGIF-QA, with significant gains in temporal (+3.5%) and causal reasoning (+5.2%) subcategories, validating the importance of explicit temporal ordering constraints.

Conclusion: The proposed V-CORE framework demonstrates that explicit architectural mechanisms for temporal ordering are crucial for video understanding, enabling parameter-efficient training while significantly improving performance on temporal and causal reasoning tasks that require directional video reasoning.

Abstract: Recent Video Large Language Models (Video-LLMs) have shown strong multimodal reasoning capabilities, yet remain challenged by video understanding tasks that require consistent temporal ordering and causal coherence. Many parameter-efficient Video-LLMs rely on unconstrained bidirectional projectors to model inter-frame interactions, which can blur temporal ordering by allowing later frames to influence earlier representations, without explicit architectural mechanisms to respect the directional nature of video reasoning. To address this limitation, we propose V-CORE, a parameter-efficient framework that introduces explicit temporal ordering constraints for video understanding. V-CORE consists of two key components: (1) Learnable Spatial Aggregation (LSA), which adaptively selects salient spatial tokens to reduce redundancy, and (2) a Causality-Aware Temporal Projector (CATP), which enforces structured unidirectional information flow via block-causal attention and a terminal dynamic summary token acting as a causal sink. This design preserves intra-frame spatial interactions while ensuring that temporal information is aggregated in a strictly ordered manner. With 4-bit QLoRA and a frozen LLM backbone, V-CORE can be trained efficiently on a single consumer GPU. Experiments show that V-CORE achieves strong performance on the challenging NExT-QA benchmark, reaching 61.2% accuracy, and remains competitive across MSVD-QA, MSRVTT-QA, and TGIF-QA, with gains concentrated in temporal and causal reasoning subcategories (+3.5% and +5.2% respectively), directly validating the importance of explicit temporal ordering constraints.

[188] 360DVO: Deep Visual Odometry for Monocular 360-Degree Camera

Xiaopeng Guo, Yinzhe Xu, Huajian Huang, Sai-Kit Yeung

Main category: cs.CV

TL;DR: 360DVO is the first deep learning-based monocular omnidirectional visual odometry framework that uses a distortion-aware spherical feature extractor and omnidirectional differentiable bundle adjustment to achieve robust pose estimation from 360-degree images.

Details

Motivation: Existing omnidirectional visual odometry methods rely on handcrafted features or photometric objectives, which lack robustness in challenging scenarios like aggressive motion and varying illumination. There's a need for more robust learning-based approaches for 360-degree camera systems.

Method: Introduces 360DVO with two key components: 1) DAS-Feat (distortion-aware spherical feature extractor) that adaptively learns distortion-resistant features from 360-degree images, and 2) ODBA (omnidirectional differentiable bundle adjustment) module that uses sparse feature patches to establish constraints for effective pose estimation.

Result: Extensive experiments on a new real-world OVO benchmark and public synthetic datasets (TartanAir V2 and 360VO) show that 360DVO surpasses state-of-the-art baselines (360VO and OpenVSLAM), improving robustness by 50% and accuracy by 37.5%.

Conclusion: 360DVO represents the first successful deep learning-based approach to omnidirectional visual odometry, demonstrating significant improvements in both robustness and accuracy over existing methods, with potential applications in robotics, autonomous systems, and VR/AR.

Abstract: Monocular omnidirectional visual odometry (OVO) systems leverage 360-degree cameras to overcome field-of-view limitations of perspective VO systems. However, existing methods, reliant on handcrafted features or photometric objectives, often lack robustness in challenging scenarios, such as aggressive motion and varying illumination. To address this, we present 360DVO, the first deep learning-based OVO framework. Our approach introduces a distortion-aware spherical feature extractor (DAS-Feat) that adaptively learns distortion-resistant features from 360-degree images. These sparse feature patches are then used to establish constraints for effective pose estimation within a novel omnidirectional differentiable bundle adjustment (ODBA) module. To facilitate evaluation in realistic settings, we also contribute a new real-world OVO benchmark. Extensive experiments on this benchmark and public synthetic datasets (TartanAir V2 and 360VO) demonstrate that 360DVO surpasses state-of-the-art baselines (including 360VO and OpenVSLAM), improving robustness by 50% and accuracy by 37.5%. Homepage: https://chris1004336379.github.io/360DVO-homepage

[189] ThinkRL-Edit: Thinking in Reinforcement Learning for Reasoning-Centric Image Editing

Hengjia Li, Liming Jiang, Qing Yan, Yizhi Song, Hao Kang, Zichuan Liu, Xin Lu, Boxi Wu, Deng Cai

Main category: cs.CV

TL;DR: ThinkRL-Edit: A reasoning-centric RL framework for image editing that decouples visual reasoning from synthesis, expands reasoning exploration beyond denoising, and improves reward mechanisms for better instruction-faithful edits.

Details

Motivation: Current instruction-driven image editing models have limited visual reasoning capabilities, leading to suboptimal performance on reasoning-centric edits. Existing RL approaches face challenges with limited reasoning exploration, biased reward fusion, and unstable VLM-based instruction rewards.

Method: Proposes ThinkRL-Edit framework with three key innovations: (1) Chain-of-Thought-based reasoning sampling with planning and reflection stages before generation, (2) unbiased chain preference grouping strategy across reward dimensions instead of weighted aggregation, and (3) binary checklist rewards replacing interval-based VLM scores for more precise and interpretable feedback.

Result: The method significantly outperforms prior work on reasoning-centric image editing, producing instruction-faithful, visually coherent, and semantically grounded edits.

Conclusion: ThinkRL-Edit successfully addresses key challenges in RL-based image editing by decoupling reasoning from synthesis, expanding exploration beyond denoising stochasticity, and improving reward mechanisms, leading to superior performance on complex reasoning tasks.

Abstract: Instruction-driven image editing with unified multimodal generative models has advanced rapidly, yet their underlying visual reasoning remains limited, leading to suboptimal performance on reasoning-centric edits. Reinforcement learning (RL) has been investigated for improving the quality of image editing, but it faces three key challenges: (1) limited reasoning exploration confined to denoising stochasticity, (2) biased reward fusion, and (3) unstable VLM-based instruction rewards. In this work, we propose ThinkRL-Edit, a reasoning-centric RL framework that decouples visual reasoning from image synthesis and expands reasoning exploration beyond denoising. To the end, we introduce Chain-of-Thought (CoT)-based reasoning sampling with planning and reflection stages prior to generation in online sampling, compelling the model to explore multiple semantic hypotheses and validate their plausibility before committing to a visual outcome. To avoid the failures of weighted aggregation, we propose an unbiased chain preference grouping strategy across multiple reward dimensions. Moreover, we replace interval-based VLM scores with a binary checklist, yielding more precise, lower-variance, and interpretable rewards for complex reasoning. Experiments show our method significantly outperforms prior work on reasoning-centric image editing, producing instruction-faithful, visually coherent, and semantically grounded edits.

[190] TRec: Learning Hand-Object Interactions through 2D Point Track Motion

Dennis Holzmann, Sven Wachsmuth

Main category: cs.CV

TL;DR: The paper introduces a novel hand-object action recognition method that uses 2D point tracks as motion cues, achieving improved accuracy without explicit hand/object detection.

Details

Motivation: Most existing methods rely on RGB appearance, human pose estimation, or their combination, but there's potential for improvement by incorporating motion information through point tracking.

Method: Uses CoTracker to follow randomly initialized points through video frames, then feeds the resulting trajectories and image frames into a Transformer-based recognition model. No explicit hand/object detection is performed.

Result: The method achieves notable performance gains, even with only initial frame and point tracks (without full video sequence). Integrating 2D point tracks consistently enhances performance compared to models without motion information.

Conclusion: 2D point tracks serve as a lightweight yet effective representation for hand-object action understanding, demonstrating their potential as an additional motion cue for action recognition tasks.

Abstract: We present a novel approach for hand-object action recognition that leverages 2D point tracks as an additional motion cue. While most existing methods rely on RGB appearance, human pose estimation, or their combination, our work demonstrates that tracking randomly sampled image points across video frames can substantially improve recognition accuracy. Unlike prior approaches, we do not detect hands, objects, or interaction regions. Instead, we employ CoTracker to follow a set of randomly initialized points through each video and use the resulting trajectories, together with the corresponding image frames, as input to a Transformer-based recognition model. Surprisingly, our method achieves notable gains even when only the initial frame and the point tracks are provided, without incorporating the full video sequence. Experimental results confirm that integrating 2D point tracks consistently enhances performance compared to the same model trained without motion information, highlighting their potential as a lightweight yet effective representation for hand-object action understanding.

[191] From Preoperative CT to Postmastoidectomy Mesh Construction: Mastoidectomy Shape Prediction for Cochlear Implant Surgery

Yike Zhang, Eduardo Davalos, Dingjie Su, Ange Lou, Jack Noble

Main category: cs.CV

TL;DR: A hybrid self-supervised and weakly-supervised learning framework predicts mastoidectomy shape from preoperative CT scans without human annotations, achieving 0.72 Dice score for cochlear implant surgical planning.

Details

Motivation: Accurate mastoidectomy shape prediction from preoperative imaging improves cochlear implant surgical planning, reduces risks, and enhances outcomes, but limited deep learning studies exist due to challenges in acquiring ground-truth labels.

Method: Proposes a hybrid self-supervised and weakly-supervised learning framework that predicts mastoidectomy region directly from preoperative CT scans without human annotations, using 3D T-distribution loss in weakly-supervised medical imaging.

Result: Achieves mean Dice score of 0.72 for predicting complex and boundary-less mastoidectomy shape, surpassing state-of-the-art approaches and demonstrating strong performance.

Conclusion: First work integrating self-supervised and weakly-supervised learning for mastoidectomy shape prediction, providing robust and efficient solution for cochlear implant surgical planning and groundwork for constructing 3D postmastoidectomy surfaces.

Abstract: Cochlear Implant (CI) surgery treats severe hearing loss by inserting an electrode array into the cochlea to stimulate the auditory nerve. An important step in this procedure is mastoidectomy, which removes part of the mastoid region of the temporal bone to provide surgical access. Accurate mastoidectomy shape prediction from preoperative imaging improves pre-surgical planning, reduces risks, and enhances surgical outcomes. Despite its importance, there are limited deep-learning-based studies regarding this topic due to the challenges of acquiring ground-truth labels. We address this gap by investigating self-supervised and weakly-supervised learning models to predict the mastoidectomy region without human annotations. We propose a hybrid self-supervised and weakly-supervised learning framework to predict the mastoidectomy region directly from preoperative CT scans, where the mastoid remains intact. Our hybrid method achieves a mean Dice score of 0.72 when predicting the complex and boundary-less mastoidectomy shape, surpassing state-of-the-art approaches and demonstrating strong performance. The method provides groundwork for constructing 3D postmastoidectomy surfaces directly from the corresponding preoperative CT scans. To our knowledge, this is the first work that integrating self-supervised and weakly-supervised learning for mastoidectomy shape prediction, offering a robust and efficient solution for CI surgical planning while leveraging 3D T-distribution loss in weakly-supervised medical imaging.

[192] SOVABench: A Vehicle Surveillance Action Retrieval Benchmark for Multimodal Large Language Models

Oriol Rabasseda, Zenjie Li, Kamal Nasrollahi, Sergio Escalera

Main category: cs.CV

TL;DR: SOVABench is a new surveillance video retrieval benchmark focusing on vehicle actions, with two evaluation protocols. The paper also introduces a training-free MLLM framework that generates interpretable embeddings and achieves strong performance on this benchmark.

Details

Motivation: Existing video retrieval benchmarks focus on scene-level similarity but lack evaluation of action discrimination needed for surveillance applications. There's a gap in benchmarks that assess cross-action discrimination and temporal direction understanding for vehicle-related actions in surveillance footage.

Method: 1) Created SOVABench benchmark from real surveillance footage with vehicle actions; 2) Defined two evaluation protocols (inter-pair for cross-action discrimination, intra-pair for temporal direction); 3) Developed training-free framework using MLLMs to generate interpretable embeddings from MLLM-generated descriptions for images/videos.

Result: 1) SOVABench reveals that action discrimination remains challenging for state-of-the-art vision/multimodal models despite being intuitive for humans; 2) The MLLM framework achieves strong performance on SOVABench and other spatial/counting benchmarks where contrastive Vision-Language Models often fail.

Conclusion: SOVABench addresses the gap in surveillance-focused video retrieval evaluation, and the proposed MLLM framework provides an effective training-free approach for interpretable embeddings that performs well on action discrimination tasks. The benchmark and framework are publicly available.

Abstract: Automatic identification of events and recurrent behavior analysis are critical for video surveillance. However, most existing content-based video retrieval benchmarks focus on scene-level similarity and do not evaluate the action discrimination required in surveillance. To address this gap, we introduce SOVABench (Surveillance Opposite Vehicle Actions Benchmark), a real-world retrieval benchmark built from surveillance footage and centered on vehicle-related actions. SOVABench defines two evaluation protocols (inter-pair and intra-pair) to assess cross-action discrimination and temporal direction understanding. Although action distinctions are generally intuitive for human observers, our experiments show that they remain challenging for state-of-the-art vision and multimodal models. Leveraging the visual reasoning and instruction-following capabilities of Multimodal Large Language Models (MLLMs), we present a training-free framework for producing interpretable embeddings from MLLM-generated descriptions for both images and videos. The framework achieves strong performance on SOVABench as well as on several spatial and counting benchmarks where contrastive Vision-Language Models often fail. The code, annotations, and instructions to construct the benchmark are publicly available.

[193] CoV: Chain-of-View Prompting for Spatial Reasoning

Haoyu Zhao, Akide Liu, Zeyu Zhang, Weijie Wang, Feng Chen, Ruihan Zhu, Gholamreza Haffari, Bohan Zhuang

Main category: cs.CV

TL;DR: Chain-of-View (CoV) prompting enables vision-language models to actively explore 3D environments for embodied question answering by selecting relevant viewpoints and adjusting views through iterative reasoning.

Details

Motivation: Current VLMs are limited to fixed input views, hindering their ability to gather distributed context and perform complex spatial reasoning in 3D environments for embodied question answering tasks.

Method: CoV uses a two-stage process: 1) View Selection agent filters redundant frames and identifies question-aligned anchor views, 2) Fine-grained view adjustment through iterative reasoning with discrete camera actions to gather sufficient context from the 3D scene representation.

Result: CoV achieves +11.56% average improvement in LLM-Match on OpenEQA across four VLMs, with test-time scaling showing additional +2.51% improvement with increased action budget. Strong performance on ScanQA (116 CIDEr / 31.9 EM@1) and SQA3D (51.1 EM@1).

Conclusion: Question-aligned view selection with open-view search is an effective, model-agnostic strategy for improving spatial reasoning in 3D embodied question answering without requiring additional training.

Abstract: Embodied question answering (EQA) in 3D environments often requires collecting context that is distributed across multiple viewpoints and partially occluded. However, most recent vision–language models (VLMs) are constrained to a fixed and finite set of input views, which limits their ability to acquire question-relevant context at inference time and hinders complex spatial reasoning. We propose Chain-of-View (CoV) prompting, a training-free, test-time reasoning framework that transforms a VLM into an active viewpoint reasoner through a coarse-to-fine exploration process. CoV first employs a View Selection agent to filter redundant frames and identify question-aligned anchor views. It then performs fine-grained view adjustment by interleaving iterative reasoning with discrete camera actions, obtaining new observations from the underlying 3D scene representation until sufficient context is gathered or a step budget is reached. We evaluate CoV on OpenEQA across four mainstream VLMs and obtain an average +11.56% improvement in LLM-Match, with a maximum gain of +13.62% on Qwen3-VL-Flash. CoV further exhibits test-time scaling: increasing the minimum action budget yields an additional +2.51% average improvement, peaking at +3.73% on Gemini-2.5-Flash. On ScanQA and SQA3D, CoV delivers strong performance (e.g., 116 CIDEr / 31.9 EM@1 on ScanQA and 51.1 EM@1 on SQA3D). Overall, these results suggest that question-aligned view selection coupled with open-view search is an effective, model-agnostic strategy for improving spatial reasoning in 3D EQA without additional training. Code is available on https://github.com/ziplab/CoV .

cs.AI

[194] Naiad: Novel Agentic Intelligent Autonomous System for Inland Water Monitoring

Eirini Baltzi, Tilemachos Moumouris, Athena Psalta, Vasileios Tsironis, Konstantinos Karantzalos

Main category: cs.AI

TL;DR: NAIAD is an agentic AI assistant that uses LLMs and external tools to provide holistic inland water monitoring from Earth Observation data through natural language queries, achieving over 77% correctness and 85% relevancy.

Details

Motivation: Existing water monitoring methods address isolated sub-problems separately (cyanobacteria, chlorophyll, etc.), lacking integrated solutions. There's a need for holistic monitoring accessible to both experts and non-experts.

Method: NAIAD combines LLMs with external tools via Retrieval-Augmented Generation (RAG), LLM reasoning, tool orchestration, computational graph execution, and agentic reflection. It integrates weather data, Sentinel-2 imagery, remote-sensing indices (NDCI), chlorophyll-a estimation, and platforms like CyFi.

Result: Achieved over 77% correctness and 85% relevancy on a dedicated benchmark across multiple user-expertise levels. Gemma 3 (27B) and Qwen 2.5 (14B) showed best balance between computational efficiency and reasoning performance in ablation study.

Conclusion: NAIAD provides a holistic, accessible solution for inland water monitoring that successfully integrates multiple data sources and analytical tools through natural language interface, demonstrating strong adaptability and robustness.

Abstract: Inland water monitoring is vital for safeguarding public health and ecosystems, enabling timely interventions to mitigate risks. Existing methods often address isolated sub-problems such as cyanobacteria, chlorophyll, or other quality indicators separately. NAIAD introduces an agentic AI assistant that leverages Large Language Models (LLMs) and external analytical tools to deliver a holistic solution for inland water monitoring using Earth Observation (EO) data. Designed for both experts and non-experts, NAIAD provides a single-prompt interface that translates natural-language queries into actionable insights. Through Retrieval-Augmented Generation (RAG), LLM reasoning, external tool orchestration, computational graph execution, and agentic reflection, it retrieves and synthesizes knowledge from curated sources to produce tailored reports. The system integrates diverse tools for weather data, Sentinel-2 imagery, remote-sensing index computation (e.g., NDCI), chlorophyll-a estimation, and established platforms such as CyFi. Performance is evaluated using correctness and relevancy metrics, achieving over 77% and 85% respectively on a dedicated benchmark covering multiple user-expertise levels. Preliminary results show strong adaptability and robustness across query types. An ablation study on LLM backbones further highlights Gemma 3 (27B) and Qwen 2.5 (14B) as offering the best balance between computational efficiency and reasoning performance.

[195] Mathematical Knowledge Graph-Driven Framework for Equation-Based Predictive and Reliable Additive Manufacturing

Yeongbin Cha, Namjung Kim

Main category: cs.AI

TL;DR: This paper proposes an ontology-guided, equation-centric framework that integrates LLMs with an additive manufacturing knowledge graph for reliable knowledge extraction and extrapolative modeling, addressing limitations of existing data-driven approaches in AM.

Details

Motivation: Existing data-driven approaches in additive manufacturing are limited by fragmented knowledge representations and unreliable extrapolation under sparse data conditions. There's a need for more reliable methods to understand and extrapolate process-property relationships in AM.

Method: The authors propose an ontology-guided framework that integrates large language models with an additive manufacturing mathematical knowledge graph (AM-MKG). The approach transforms unstructured literature into machine-interpretable representations using formal ontologies, uses LLM-based equation generation conditioned on MKG-derived subgraphs, and introduces confidence-aware extrapolation assessment combining extrapolation distance, statistical stability, and physical consistency.

Result: The ontology-guided extraction significantly improves structural coherence and quantitative reliability of extracted knowledge. Subgraph-conditioned equation generation yields stable and physically consistent extrapolations compared to unguided LLM outputs. The framework demonstrates reliable knowledge extraction and principled extrapolative modeling.

Conclusion: This work establishes a unified pipeline for ontology-driven knowledge representation, equation-centered reasoning, and confidence-based extrapolation assessment. It highlights the potential of knowledge-graph-augmented LLMs as reliable tools for extrapolative modeling in additive manufacturing, addressing current limitations in data-driven AM approaches.

Abstract: Additive manufacturing (AM) relies critically on understanding and extrapolating process-property relationships; however, existing data-driven approaches remain limited by fragmented knowledge representations and unreliable extrapolation under sparse data conditions. In this study, we propose an ontology-guided, equation-centric framework that tightly integrates large language models (LLMs) with an additive manufacturing mathematical knowledge graph (AM-MKG) to enable reliable knowledge extraction and principled extrapolative modeling. By explicitly encoding equations, variables, assumptions, and their semantic relationships within a formal ontology, unstructured literature is transformed into machine-interpretable representations that support structured querying and reasoning. LLM-based equation generation is further conditioned on MKG-derived subgraphs, enforcing physically meaningful functional forms and mitigating non-physical or unstable extrapolation trends. To assess reliability beyond conventional predictive uncertainty, a confidence-aware extrapolation assessment is introduced, integrating extrapolation distance, statistical stability, and knowledge-graph-based physical consistency into a unified confidence score. Results demonstrate that ontology-guided extraction significantly improves the structural coherence and quantitative reliability of extracted knowledge, while subgraph-conditioned equation generation yields stable and physically consistent extrapolations compared to unguided LLM outputs. Overall, this work establishes a unified pipeline for ontology-driven knowledge representation, equation-centered reasoning, and confidence-based extrapolation assessment, highlighting the potential of knowledge-graph-augmented LLMs as reliable tools for extrapolative modeling in additive manufacturing.

[196] Effects of personality steering on cooperative behavior in Large Language Model agents

Mizuki Sakai, Mizuki Yokoyama, Wakaba Tateishi, Genki Ichinose

Main category: cs.AI

TL;DR: Personality steering in LLM agents affects cooperation in Prisoner’s Dilemma games, with agreeableness being the dominant factor promoting cooperation, but personality acts as a behavioral bias rather than deterministic control.

Details

Motivation: As LLMs are increasingly used as autonomous agents in strategic interactions, there's a need to understand how personality steering affects cooperative behavior under controlled conditions, particularly since previous studies suggest personality traits influence LLM behavior but the specific effects on cooperation remain unclear.

Method: Used repeated Prisoner’s Dilemma games to examine cooperative behavior in LLM agents (GPT-3.5-turbo, GPT-4o, GPT-5). First measured basic personality profiles using Big Five Inventory, then compared behavior under baseline vs. personality-informed conditions, and independently manipulated each personality dimension to extreme values.

Result: Agreeableness was the dominant factor promoting cooperation across all models, while other personality traits had limited impact. Explicit personality information increased cooperation but also raised vulnerability to exploitation, especially in earlier-generation models. Later-generation models showed more selective cooperation.

Conclusion: Personality steering acts as a behavioral bias rather than a deterministic control mechanism for LLM agents in strategic interactions, with agreeableness being the key factor influencing cooperative behavior.

Abstract: Large language models (LLMs) are increasingly used as autonomous agents in strategic and social interactions. Although recent studies suggest that assigning personality traits to LLMs can influence their behavior, how personality steering affects cooperation under controlled conditions remains unclear. In this study, we examine the effects of personality steering on cooperative behavior in LLM agents using repeated Prisoner’s Dilemma games. Based on the Big Five framework, we first measure basic personality profiles of three models, GPT-3.5-turbo, GPT-4o, and GPT-5, using the Big Five Inventory. We then compare behavior under baseline and personality-informed conditions, and further analyze the effects of independently manipulating each personality dimension to extreme values. Our results show that agreeableness is the dominant factor promoting cooperation across all models, while other personality traits have limited impact. Explicit personality information increases cooperation but can also raise vulnerability to exploitation, particularly in earlier-generation models. In contrast, later-generation models exhibit more selective cooperation. These findings indicate that personality steering acts as a behavioral bias rather than a deterministic control mechanism.

[197] Improving Enzyme Prediction with Chemical Reaction Equations by Hypergraph-Enhanced Knowledge Graph Embeddings

Tengwei Song, Long Yin, Zhen Han, Zhiqiang Xu

Main category: cs.AI

TL;DR: Hyper-Enz: A knowledge-enhanced hypergraph model that uses chemical reaction equations as knowledge graph triples to predict enzyme-substrate interactions, achieving 88% relative improvement in enzyme retrieval accuracy.

Details

Motivation: Existing enzyme-substrate prediction methods rely on sparse, labor-intensive expert-curated databases, limiting generalization to unseen interactions. Chemical reaction databases offer denser, more abundant data but create complex relational patterns that traditional models cannot capture.

Method: Represent chemical reaction equations as (educt, enzyme, product) triples in a knowledge graph. Propose Hyper-Enz model integrating hypergraph transformer with KGE to learn representations of hyper-edges involving multiple educts/products. Use multi-expert paradigm to guide learning with both model and chemical reaction equations.

Result: Significant improvement over traditional models: up to 88% relative improvement in average enzyme retrieval accuracy and 30% improvement in pair-level prediction.

Conclusion: The approach effectively leverages chemical reaction equations as knowledge graph data to overcome data sparsity in enzyme-substrate prediction, demonstrating superior performance through knowledge-enhanced hypergraph modeling.

Abstract: Predicting enzyme-substrate interactions has long been a fundamental problem in biochemistry and metabolic engineering. While existing methods could leverage databases of expert-curated enzyme-substrate pairs for models to learn from known pair interactions, the databases are often sparse, i.e., there are only limited and incomplete examples of such pairs, and also labor-intensive to maintain. This lack of sufficient training data significantly hinders the ability of traditional enzyme prediction models to generalize to unseen interactions. In this work, we try to exploit chemical reaction equations from domain-specific databases, given their easier accessibility and denser, more abundant data. However, interactions of multiple compounds, e.g., educts and products, with the same enzymes create complex relational data patterns that traditional models cannot easily capture. To tackle that, we represent chemical reaction equations as triples of (educt, enzyme, product) within a knowledge graph, such that we can take advantage of knowledge graph embedding (KGE) to infer missing enzyme-substrate pairs for graph completion. Particularly, in order to capture intricate relationships among compounds, we propose our knowledge-enhanced hypergraph model for enzyme prediction, i.e., Hyper-Enz, which integrates a hypergraph transformer with a KGE model to learn representations of the hyper-edges that involve multiple educts and products. Also, a multi-expert paradigm is introduced to guide the learning of enzyme-substrate interactions with both the proposed model and chemical reaction equations. Experimental results show a significant improvement, with up to a 88% relative improvement in average enzyme retrieval accuracy and 30% improvement in pair-level prediction compared to traditional models, demonstrating the effectiveness of our approach.

[198] The Persona Paradox: Medical Personas as Behavioral Priors in Clinical Language Models

Tassallah Abdullahi, Shrestha Ghosh, Hamish S Fraser, Daniel León Tramontini, Adeel Abbasi, Ghada Bourjeily, Carsten Eickhoff, Ritambhara Singh

Main category: cs.AI

TL;DR: Persona conditioning in LLMs has non-monotonic effects on clinical decision-making: medical personas improve critical care performance but degrade primary care performance, with interaction styles affecting risk behavior in model-dependent ways.

Details

Motivation: To systematically evaluate how persona-based control (professional roles and interaction styles) affects LLM behavior in high-stakes clinical decision-making, challenging the assumption that personas monotonically improve expertise and safety.

Method: Systematic evaluation of persona-based control in clinical LLMs across professional roles (Emergency Department physician, nurse) and interaction styles (bold vs. cautious). Assessed performance on clinical triage and patient-safety tasks using multidimensional evaluations capturing task accuracy, calibration, and safety-relevant risk behavior.

Result: Medical personas improve performance in critical care tasks (up to ~+20% in accuracy and calibration) but degrade performance in primary-care settings by comparable margins. Interaction style modulates risk propensity but is highly model-dependent. LLM-judge rankings favor medical personas in safety-critical cases, but human clinicians show moderate agreement on safety compliance (κ=0.43) and low confidence (95.9%) in reasoning quality.

Conclusion: Personas function as behavioral priors that introduce context-dependent trade-offs rather than guarantees of safety or expertise, highlighting the need for careful persona selection in clinical LLM applications.

Abstract: Persona conditioning can be viewed as a behavioral prior for large language models (LLMs) and is often assumed to confer expertise and improve safety in a monotonic manner. However, its effects on high-stakes clinical decision-making remain poorly characterized. We systematically evaluate persona-based control in clinical LLMs, examining how professional roles (e.g., Emergency Department physician, nurse) and interaction styles (bold vs.\ cautious) influence behavior across models and medical tasks. We assess performance on clinical triage and patient-safety tasks using multidimensional evaluations that capture task accuracy, calibration, and safety-relevant risk behavior. We find systematic, context-dependent, and non-monotonic effects: Medical personas improve performance in critical care tasks, yielding gains of up to $\sim+20%$ in accuracy and calibration, but degrade performance in primary-care settings by comparable margins. Interaction style modulates risk propensity and sensitivity, but it’s highly model-dependent. While aggregated LLM-judge rankings favor medical over non-medical personas in safety-critical cases, we found that human clinicians show moderate agreement on safety compliance (average Cohen’s $κ= 0.43$) but indicate a low confidence in 95.9% of their responses on reasoning quality. Our work shows that personas function as behavioral priors that introduce context-dependent trade-offs rather than guarantees of safety or expertise. The code is available at https://github.com/rsinghlab/Persona_Paradox.

Alessandro Bellina, Giordano De Marzo, David Garcia

Main category: cs.AI

TL;DR: AI agents exhibit systematic conformity bias in multi-agent environments, becoming highly susceptible to social influence manipulation despite near-perfect individual performance, revealing security vulnerabilities in collective AI systems.

Details

Motivation: As AI agents increasingly operate in multi-agent environments, understanding their collective behavior becomes critical for predicting artificial society dynamics. The study aims to examine conformity (alignment with group opinions under social pressure) in large multimodal language models functioning as AI agents.

Method: The researchers adapted classic visual experiments from social psychology to investigate how AI agents respond to group influence as social actors. They examined conformity across different model scales and tested sensitivity to various factors including group size, unanimity, task difficulty, and source characteristics.

Result: AI agents exhibit systematic conformity bias aligned with Social Impact Theory. Agents achieving near-perfect performance in isolation become highly susceptible to manipulation through social influence. While larger models show reduced conformity on simple tasks due to improved capabilities, they remain vulnerable when operating at their competence boundary. The vulnerability persists across model scales.

Conclusion: The findings reveal fundamental security vulnerabilities in AI agent decision-making that could enable malicious manipulation, misinformation campaigns, and bias propagation in multi-agent systems. This highlights the urgent need for safeguards in collective AI deployments to prevent exploitation of conformity biases.

Abstract: As AI agents increasingly operate in multi-agent environments, understanding their collective behavior becomes critical for predicting the dynamics of artificial societies. This study examines conformity, the tendency to align with group opinions under social pressure, in large multimodal language models functioning as AI agents. By adapting classic visual experiments from social psychology, we investigate how AI agents respond to group influence as social actors. Our experiments reveal that AI agents exhibit a systematic conformity bias, aligned with Social Impact Theory, showing sensitivity to group size, unanimity, task difficulty, and source characteristics. Critically, AI agents achieving near-perfect performance in isolation become highly susceptible to manipulation through social influence. This vulnerability persists across model scales: while larger models show reduced conformity on simple tasks due to improved capabilities, they remain vulnerable when operating at their competence boundary. These findings reveal fundamental security vulnerabilities in AI agent decision-making that could enable malicious manipulation, misinformation campaigns, and bias propagation in multi-agent systems, highlighting the urgent need for safeguards in collective AI deployments.

[200] On the Effect of Cheating in Chess

Daniel Keren

Main category: cs.AI

TL;DR: Researchers develop algorithms to measure performance gains from limited cheating using chess engines, focusing on quantifying cheating effectiveness rather than detection.

Details

Motivation: Cheating in chess using software advice has become a serious problem at all levels, including elite play. While most previous work focused on detection, this study aims to quantify how much performance improvement cheaters can actually gain from limited cheating during games.

Method: Developed and tested algorithms on a commonly used chess engine to evaluate performance gains from cheating a limited number of times during a game.

Result: The paper presents algorithms and test results showing measurable performance improvements from limited cheating, providing concrete data on cheating effectiveness.

Conclusion: Quantifying cheating effectiveness is crucial for containing and detecting cheating, as understanding the potential gains helps develop better countermeasures against software-assisted cheating in chess.

Abstract: Cheating in chess, by using advice from powerful software, has become a major problem, reaching the highest levels. As opposed to the large majority of previous work, which concerned {\em detection} of cheating, here we try to evaluate the possible gain in performance, obtained by cheating a limited number of times during a game. Algorithms are developed and tested on a commonly used chess engine (i.e software).\footnote{Needless to say, the goal of this work is not to assist cheaters, but to measure the effectiveness of cheating – which is crucial as part of the effort to contain and detect it.}

[201] ART: Adaptive Reasoning Trees for Explainable Claim Verification

Sahil Wadhwa, Himanshu Kumar, Guanqun Yang, Abbaas Alif Mohamed Nishar, Pranab Mohanty, Swapnil Shinde, Yue Wu

Main category: cs.AI

TL;DR: ART (Adaptive Reasoning Trees) is a hierarchical method for claim verification that uses LLMs to create branching arguments, adjudicate them via pairwise tournaments, and produce transparent, contestable verdicts.

Details

Motivation: LLMs have strong decision-making capabilities but lack transparency in high-stakes environments - their outputs lack faithful explanations and cannot be effectively contested to correct errors, undermining trustworthiness.

Method: ART uses hierarchical reasoning trees starting with a root claim that branches into supporting/attacking child arguments. Argument strength is determined bottom-up via pairwise tournaments of children adjudicated by a judge LLM, systematically deriving transparent and contestable verdicts.

Result: Empirical validation on multiple datasets shows ART’s structured reasoning outperforms strong baselines, establishing a new benchmark for explainable claim verification that is more reliable and ensures clarity in decision-making.

Conclusion: ART addresses the opacity problem of LLMs in high-stakes decision-making by providing a systematic, transparent, and contestable verification framework that outperforms existing methods like Chain-of-Thought.

Abstract: Large Language Models (LLMs) are powerful candidates for complex decision-making, leveraging vast encoded knowledge and remarkable zero-shot abilities. However, their adoption in high-stakes environments is hindered by their opacity; their outputs lack faithful explanations and cannot be effectively contested to correct errors, undermining trustworthiness. In this paper, we propose ART (Adaptive Reasoning Trees), a hierarchical method for claim verification. The process begins with a root claim, which branches into supporting and attacking child arguments. An argument’s strength is determined bottom-up via a pairwise tournament of its children, adjudicated by a judge LLM, allowing a final, transparent and contestable verdict to be systematically derived which is missing in methods like Chain-of-Thought (CoT). We empirically validate ART on multiple datasets, analyzing different argument generators and comparison strategies. Our findings show that ART’s structured reasoning outperforms strong baselines, establishing a new benchmark for explainable claim verification which is more reliable and ensures clarity in the overall decision making step.

[202] Crisis-Bench: Benchmarking Strategic Ambiguity and Reputation Management in Large Language Models

Cooper Lin, Maohao Ran, Yanting Zhang, Zhenglin Wan, Hongwei Fan, Yibo Xu, Yike Guo, Wei Xue, Jun Song

Main category: cs.AI

TL;DR: Crisis-Bench is a multi-agent POMDP benchmark that evaluates LLMs’ ability to strategically withhold information in corporate crisis scenarios, revealing a tension between universal safety alignment and professional needs for strategic ambiguity.

Details

Motivation: Standard safety alignment creates a "transparency tax" that harms professional domains like PR, negotiation, and crisis management which require strategic information withholding. There's a gap between general safety principles and professional utility in high-stakes scenarios.

Method: Created Crisis-Bench: a multi-agent POMDP with 80 storylines across 8 industries. Features a PR Agent navigating 7-day corporate crisis simulations with separate Private/Public narrative states to enforce information asymmetry. Introduced Adjudicator-Market Loop evaluation where public sentiment affects simulated stock price.

Result: Models show a dichotomy: some capitulate to ethical concerns, while others demonstrate Machiavellian strategic withholding to stabilize stock prices. Crisis-Bench provides first quantitative framework for assessing “Reputation Management” capabilities.

Conclusion: Need to shift from rigid moral absolutism to context-aware professional alignment. Different domains require different ethical frameworks, and universal safety alignment may be inappropriate for professional contexts requiring strategic ambiguity.

Abstract: Standard safety alignment optimizes Large Language Models (LLMs) for universal helpfulness and honesty, effectively instilling a rigid “Boy Scout” morality. While robust for general-purpose assistants, this one-size-fits-all ethical framework imposes a “transparency tax” on professional domains requiring strategic ambiguity and information withholding, such as public relations, negotiation, and crisis management. To measure this gap between general safety and professional utility, we introduce Crisis-Bench, a multi-agent Partially Observable Markov Decision Process (POMDP) that evaluates LLMs in high-stakes corporate crises. Spanning 80 diverse storylines across 8 industries, Crisis-Bench tasks an LLM-based Public Relations (PR) Agent with navigating a dynamic 7-day corporate crisis simulation while managing strictly separated Private and Public narrative states to enforce rigorous information asymmetry. Unlike traditional benchmarks that rely on static ground truths, we introduce the Adjudicator-Market Loop: a novel evaluation metric where public sentiment is adjudicated and translated into a simulated stock price, creating a realistic economic incentive structure. Our results expose a critical dichotomy: while some models capitulate to ethical concerns, others demonstrate the capacity for Machiavellian, legitimate strategic withholding in order to stabilize the simulated stock price. Crisis-Bench provides the first quantitative framework for assessing “Reputation Management” capabilities, arguing for a shift from rigid moral absolutism to context-aware professional alignment.

[203] PRISMA: Reinforcement Learning Guided Two-Stage Policy Optimization in Multi-Agent Architecture for Open-Domain Multi-Hop Question Answering

Yu Liu, Wenxiao Zhang, Cong Cao, Wenxuan Lu, Fangfang Yuan, Diandian Guo, Kun Peng, Qiang Sun, Kaiyan Zhang, Yanbing Liu, Jin B. Hong, Bowen Zhou, Zhiyuan Ma

Main category: cs.AI

TL;DR: PRISMA is a decoupled RL framework with Plan-Retrieve-Inspect-Solve-Memoize architecture that addresses retrieval collapse and learning instability in multi-hop RAG systems through reasoning-guided collaboration and two-stage policy optimization.

Details

Motivation: Current RL-optimized RAG systems for multi-hop question answering suffer from two main problems: 1) Retrieval Collapse - iterative retrieval fails to find intermediate evidence without reasoning-guided planning, and 2) Learning Instability - end-to-end training has weak credit assignment and poor error localization, leading to overfitting and limited transferability.

Method: PRISMA uses a decoupled Plan-Retrieve-Inspect-Solve-Memoize architecture where agents collaborate: Inspector provides reasoning feedback to refine Planner’s decomposition and retrieval, while enforcing evidence-grounded reasoning in Solver. Optimization uses Two-Stage Group Relative Policy Optimization (GRPO): Stage I calibrates Planner and Solver as specialized experts; Stage II uses Observation-Aware Residual Policy Optimization (OARPO) to enhance Inspector’s verification and recovery capabilities.

Result: PRISMA achieves state-of-the-art performance on ten benchmarks and can be deployed efficiently in real-world scenarios.

Conclusion: The decoupled RL-guided framework with reasoning-guided collaboration effectively addresses retrieval collapse and learning instability in multi-hop RAG systems, enabling reliable deployment with improved performance and transferability.

Abstract: Answering real-world open-domain multi-hop questions over massive corpora is a critical challenge in Retrieval-Augmented Generation (RAG) systems. Recent research employs reinforcement learning (RL) to end-to-end optimize the retrieval-augmented reasoning process, directly enhancing its capacity to resolve complex queries. However, reliable deployment is hindered by two obstacles. 1) Retrieval Collapse: iterative retrieval over large corpora fails to locate intermediate evidence containing bridge answers without reasoning-guided planning, causing downstream reasoning to collapse. 2) Learning Instability: end-to-end trajectory training suffers from weak credit assignment across reasoning chains and poor error localization across modules, causing overfitting to benchmark-specific heuristics that limit transferability and stability. To address these problems, we propose PRISMA, a decoupled RL-guided framework featuring a Plan-Retrieve-Inspect-Solve-Memoize architecture. PRISMA’s strength lies in reasoning-guided collaboration: the Inspector provides reasoning-based feedback to refine the Planner’s decomposition and fine-grained retrieval, while enforcing evidence-grounded reasoning in the Solver. We optimize individual agent capabilities via Two-Stage Group Relative Policy Optimization (GRPO). Stage I calibrates the Planner and Solver as specialized experts in planning and reasoning, while Stage II utilizes Observation-Aware Residual Policy Optimization (OARPO) to enhance the Inspector’s ability to verify context and trigger targeted recovery. Experiments show that PRISMA achieves state-of-the-art performance on ten benchmarks and can be deployed efficiently in real-world scenarios.

Zixuan Xiao, Jun Ma, Siwei Zhang

Main category: cs.AI

TL;DR: MMUEChange is a multi-modal agent framework that integrates heterogeneous urban data for robust analysis of complex urban change scenarios, achieving 46.7% improvement in task success rate over baselines.

Details

Motivation: Current urban change detection approaches rely on rigid, single-modal analysis, limiting their ability to handle complex urban scenarios. There's a need for flexible integration of heterogeneous urban data to better understand urban environment changes for sustainable development.

Method: Proposes MMUEChange, a multi-modal agent framework with a modular toolkit and a core Modality Controller for cross- and intra-modal alignment, enabling flexible integration of heterogeneous urban data.

Result: Achieves 46.7% improvement in task success rate compared to best-performing baseline, effectively mitigates hallucination, and demonstrates capacity through case studies: small park shift in NYC, water pollution spread in Hong Kong, and dumpsite decline in Shenzhen with contrasting waste-activity relationships.

Conclusion: MMUEChange effectively supports complex urban change analysis with real-world policy implications, overcoming limitations of traditional single-modal approaches through flexible multi-modal integration.

Abstract: Understanding urban environment change is essential for sustainable development. However, current approaches, particularly remote sensing change detection, often rely on rigid, single-modal analysis. To overcome these limitations, we propose MMUEChange, a multi-modal agent framework that flexibly integrates heterogeneous urban data via a modular toolkit and a core module, Modality Controller for cross- and intra-modal alignment, enabling robust analysis of complex urban change scenarios. Case studies include: a shift toward small, community-focused parks in New York, reflecting local green space efforts; the spread of concentrated water pollution across districts in Hong Kong, pointing to coordinated water management; and a notable decline in open dumpsites in Shenzhen, with contrasting links between nighttime economic activity and waste types, indicating differing urban pressures behind domestic and construction waste. Compared to the best-performing baseline, the MMUEChange agent achieves a 46.7% improvement in task success rate and effectively mitigates hallucination, demonstrating its capacity to support complex urban change analysis tasks with real-world policy implications.

[205] The Evaluation Gap in Medicine, AI and LLMs: Navigating Elusive Ground Truth & Uncertainty via a Probabilistic Paradigm

Aparna Elangovan, Lei Xu, Mahsa Elyasi, Ismail Akdulum, Mehmet Aksakal, Enes Gurun, Brian Hur, Saab Mansour, Ravid Shwartz Ziv, Karin Verspoor, Dan Roth

Main category: cs.AI

TL;DR: Proposes a probabilistic paradigm for AI benchmarking that accounts for uncertainty in ground truth answers, introducing expected accuracy and F1 metrics to properly evaluate systems when expert agreement varies.

Details

Motivation: Current AI benchmarking ignores uncertainty in ground truth answers, which is particularly problematic in medicine where uncertainty is pervasive. This can lead to misleading conclusions where non-experts appear to perform similarly to experts when ground truth answers have high variability.

Method: Introduces a probabilistic paradigm that theoretically explains how ground truth certainty affects expert performance. Develops concepts of expected accuracy and expected F1 to estimate scores given ground truth answer variability. Recommends stratifying results by probability of ground truth answers (measured by expert agreement rates).

Result: Shows that high certainty in ground truth answers is necessary for experts to achieve high scores, while in datasets with high variation, there may be little difference between random labelers and experts. Stratification becomes critical when overall performance drops below 80%.

Conclusion: AI benchmarking should account for ground truth uncertainty through stratified evaluation by probability of ground truth answers. This makes performance comparisons more reliable in high certainty bins and mitigates the confounding effect of uncertainty.

Abstract: Benchmarking the relative capabilities of AI systems, including Large Language Models (LLMs) and Vision Models, typically ignores the impact of uncertainty in the underlying ground truth answers from experts. This ambiguity is particularly consequential in medicine where uncertainty is pervasive. In this paper, we introduce a probabilistic paradigm to theoretically explain how high certainty in ground truth answers is almost always necessary for even an expert to achieve high scores, whereas in datasets with high variation in ground truth answers there may be little difference between a random labeller and an expert. Therefore, ignoring uncertainty in ground truth evaluation data can result in the misleading conclusion that a non-expert has similar performance to that of an expert. Using the probabilistic paradigm, we thus bring forth the concepts of expected accuracy and expected F1 to estimate the score an expert human or system can achieve given ground truth answer variability. Our work leads to the recommendation that when establishing the capability of a system, results should be stratified by probability of the ground truth answer, typically measured by the agreement rate of ground truth experts. Stratification becomes critical when the overall performance drops below a threshold of 80%. Under stratified evaluation, performance comparison becomes more reliable in high certainty bins, mitigating the effect of the key confounding factor – uncertainty.

[206] Explainable AI: Learning from the Learners

Ricardo Vinuesa, Steven L. Brunton, Gianmarco Mengaldo

Main category: cs.AI

TL;DR: XAI combined with causal reasoning enables learning from AI systems to extract causal mechanisms, guide design/control, and support trust in high-stakes applications.

Details

Motivation: While AI outperforms humans in many tasks, its internal representations remain opaque. There's a need to make AI more interpretable and to learn from what AI systems have discovered.

Method: Combine explainable AI (XAI) with causal reasoning, using foundation models and explainability methods to extract causal mechanisms and guide robust design and control.

Result: XAI enables extraction of causal mechanisms, guides robust design and control, and supports trust and accountability in high-stakes applications.

Conclusion: XAI serves as a unifying framework for human-AI collaboration in science and engineering, though challenges remain in faithfulness, generalization, and usability of explanations.

Abstract: Artificial intelligence now outperforms humans in several scientific and engineering tasks, yet its internal representations often remain opaque. In this Perspective, we argue that explainable artificial intelligence (XAI), combined with causal reasoning, enables {\it learning from the learners}. Focusing on discovery, optimization and certification, we show how the combination of foundation models and explainability methods allows the extraction of causal mechanisms, guides robust design and control, and supports trust and accountability in high-stakes applications. We discuss challenges in faithfulness, generalization and usability of explanations, and propose XAI as a unifying framework for human-AI collaboration in science and engineering.

[207] Safety Not Found (404): Hidden Risks of LLM-Based Robotics Decision Making

Jua Han, Jaeyoon Seo, Jungbin Min, Jean Oh, Jihie Kim

Main category: cs.AI

TL;DR: LLMs show critical safety failures in robotics scenarios where even 1% error rates can cause catastrophic harm, making them unsuitable for direct deployment in safety-critical systems.

Details

Motivation: As LLMs become integral to robotics decision-making, the physical dimension of risk grows significantly - a single wrong instruction can directly endanger human safety, creating an urgent need to systematically evaluate LLM performance in scenarios where minor errors are catastrophic.

Method: Qualitative evaluation of fire evacuation scenario identified critical failure cases, then designed seven quantitative tasks categorized into: Complete Information (ASCII maps to isolate spatial reasoning), Incomplete Information (inferring missing context), and Safety-Oriented Spatial Reasoning (natural language evaluation in life-threatening contexts). Benchmarked various LLMs and VLMs across these tasks.

Result: Serious vulnerabilities revealed: several models achieved 0% success rate in ASCII navigation, and in simulated fire drills, models instructed robots to move toward hazardous areas instead of emergency exits. Analysis shows how “rare” 1% failure rates escalate into catastrophic outcomes in robotics contexts.

Conclusion: Current LLMs are not ready for direct deployment in safety-critical systems. A 99% accuracy rate is dangerously misleading in robotics as it implies one out of every hundred executions could result in catastrophic harm. Even state-of-the-art models cannot guarantee safety, and absolute reliance on them creates unacceptable risks.

Abstract: One mistake by an AI system in a safety-critical setting can cost lives. As Large Language Models (LLMs) become integral to robotics decision-making, the physical dimension of risk grows; a single wrong instruction can directly endanger human safety. This paper addresses the urgent need to systematically evaluate LLM performance in scenarios where even minor errors are catastrophic. Through a qualitative evaluation of a fire evacuation scenario, we identified critical failure cases in LLM-based decision-making. Based on these, we designed seven tasks for quantitative assessment, categorized into: Complete Information, Incomplete Information, and Safety-Oriented Spatial Reasoning (SOSR). Complete information tasks utilize ASCII maps to minimize interpretation ambiguity and isolate spatial reasoning from visual processing. Incomplete information tasks require models to infer missing context, testing for spatial continuity versus hallucinations. SOSR tasks use natural language to evaluate safe decision-making in life-threatening contexts. We benchmark various LLMs and Vision-Language Models (VLMs) across these tasks. Beyond aggregate performance, we analyze the implications of a 1% failure rate, highlighting how “rare” errors escalate into catastrophic outcomes. Results reveal serious vulnerabilities: several models achieved a 0% success rate in ASCII navigation, while in a simulated fire drill, models instructed robots to move toward hazardous areas instead of emergency exits. Our findings lead to a sobering conclusion: current LLMs are not ready for direct deployment in safety-critical systems. A 99% accuracy rate is dangerously misleading in robotics, as it implies one out of every hundred executions could result in catastrophic harm. We demonstrate that even state-of-the-art models cannot guarantee safety, and absolute reliance on them creates unacceptable risks.

[208] WildSci: Advancing Scientific Reasoning from In-the-Wild Literature

Tengxiao Liu, Deepak Nathani, Zekun Li, Kevin Yang, William Yang Wang

Main category: cs.AI

TL;DR: WildSci introduces a new dataset of domain-specific science questions automatically synthesized from peer-reviewed literature, covering 9 disciplines and 26 subdomains, enabling scalable training for scientific reasoning in LLMs through multiple-choice format and reinforcement learning.

Details

Motivation: Progress in LLM reasoning has been limited in scientific domains like medicine and materials science due to limited dataset coverage and complexity of open-ended scientific questions, unlike mathematics and coding where abundant data and objective metrics exist.

Method: Created WildSci dataset by automatically synthesizing domain-specific science questions from peer-reviewed literature, framed in multiple-choice format for scalable training. Applied reinforcement learning to finetune models and analyzed training dynamics including domain-specific performance changes, response behaviors, and generalization trends.

Result: Experiments on scientific benchmarks demonstrate the effectiveness of the WildSci dataset and approach. The dataset enables scalable and sustainable research in scientific reasoning.

Conclusion: WildSci addresses challenges in scientific reasoning for LLMs by providing a comprehensive dataset and training approach, released publicly to advance research in scientific reasoning across multiple disciplines.

Abstract: Recent progress in large language model (LLM) reasoning has focused on domains like mathematics and coding, where abundant high-quality data and objective evaluation metrics are readily available. In contrast, progress in LLM reasoning models remains limited in scientific domains such as medicine and materials science due to limited dataset coverage and the inherent complexity of open-ended scientific questions. To address these challenges, we introduce WildSci, a new dataset of domain-specific science questions automatically synthesized from peer-reviewed literature, covering 9 scientific disciplines and 26 subdomains. By framing complex scientific reasoning tasks in a multiple-choice format, we enable scalable training with well-defined reward signals. We further apply reinforcement learning to finetune models on these data and analyze the resulting training dynamics, including domain-specific performance changes, response behaviors, and generalization trends. Experiments on a suite of scientific benchmarks demonstrate the effectiveness of our dataset and approach. We release WildSci to enable scalable and sustainable research in scientific reasoning, available at https://huggingface.co/datasets/JustinTX/WildSci.

[209] Symbolic Planning and Multi-Agent Path Finding in Extremely Dense Environments with Unassigned Agents

Bo Fu, Zhe Chen, Rahul Chandan, Alex Barbosa, Michael Caldara, Joey Durham, Federico Pecora

Main category: cs.AI

TL;DR: The paper introduces the Block Rearrangement Problem (BRaP) for warehouse management, proposes five search-based algorithms, and shows they can efficiently solve rearrangement in up to 80x80 grids despite exponential search space growth.

Details

Motivation: Large warehouse management requires efficient rearrangement of storage blocks within dense grids to achieve desired configurations, which is computationally challenging due to the exponential search space.

Method: Five search-based algorithms leveraging joint configuration space search, classical planning, multi-agent pathfinding, and expert heuristics, building on intuitions from sliding puzzle problems.

Result: The methods demonstrate efficiency in creating rearrangement plans for deeply buried blocks in up to 80x80 grids, despite the exponential relationship between search space size and block number.

Conclusion: The proposed approaches effectively solve the Block Rearrangement Problem, showing practical applicability for large-scale warehouse management scenarios.

Abstract: We introduce the Block Rearrangement Problem (BRaP), a challenging component of large warehouse management which involves rearranging storage blocks within dense grids to achieve a goal state. We formally define the BRaP as a graph search problem. Building on intuitions from sliding puzzle problems, we propose five search-based solution algorithms, leveraging joint configuration space search, classical planning, multi-agent pathfinding, and expert heuristics. We evaluate the five approaches empirically for plan quality and scalability. Despite the exponential relation between search space size and block number, our methods demonstrate efficiency in creating rearrangement plans for deeply buried blocks in up to 80x80 grids.

[210] Reinforcement Learning of Large Language Models for Interpretable Credit Card Fraud Detection

Cooper Lin, Yanting Zhang, Maohao Ran, Wei Xue, Hongwei Fan, Yibo Xu, Zhenglin Wan, Sirui Han, Yike Guo, Jun Song

Main category: cs.AI

TL;DR: This paper proposes a reinforcement learning approach using GSPO algorithm to post-train lightweight language models for fraud detection on raw e-commerce transaction data, achieving significant F1-score improvements by discovering novel fraud indicators.

Details

Motivation: Despite theoretical promise, LLMs remain largely unexploited for real-world fraud detection in e-commerce. There's a gap between conventional ML limitations and LLMs' untapped potential for handling domain-specific transaction data, which needs empirical validation.

Method: Uses Reinforcement Learning (RL) with Group Sequence Policy Optimization (GSPO) algorithm and rule-based reward system to post-train lightweight language models on raw transaction data from a Chinese payment company. Models explore diverse trust/risk signals in textual transaction data including customer info, shipping details, product descriptions, and order history.

Result: Post-trained language models achieve substantial F1-score improvements on held-out test data. Performance improvements are primarily due to RL’s exploration mechanism that discovers novel fraud indicators beyond traditional engineered features.

Conclusion: The RL-based approach effectively bridges the gap between conventional ML and LLMs for fraud detection, demonstrating practical effectiveness in handling real-world e-commerce transaction data through exploration of diverse fraud signals.

Abstract: E-commerce platforms and payment solution providers face increasingly sophisticated fraud schemes, ranging from identity theft and account takeovers to complex money laundering operations that exploit the speed and anonymity of digital transactions. However, despite their theoretical promise, the application of Large Language Models (LLMs) to fraud detection in real-world financial contexts remains largely unexploited, and their practical effectiveness in handling domain-specific e-commerce transaction data has yet to be empirically validated. To bridge this gap between conventional machine learning limitations and the untapped potential of LLMs in fraud detection, this paper proposes a novel approach that employs Reinforcement Learning (RL) to post-train lightweight language models specifically for fraud detection tasks using only raw transaction data. We utilize the Group Sequence Policy Optimization (GSPO) algorithm combined with a rule-based reward system to fine-tune language models of various sizes on a real-life transaction dataset provided by a Chinese global payment solution company. Through this reinforcement learning framework, the language models are encouraged to explore diverse trust and risk signals embedded within the textual transaction data, including patterns in customer information, shipping details, product descriptions, and order history. Our experimental results demonstrate the effectiveness of this approach, with post-trained language models achieving substantial F1-score improvements on held-out test data. Our findings demonstrate that the observed performance improvements are primarily attributable to the exploration mechanism inherent in reinforcement learning, which allows models to discover novel fraud indicators beyond those captured by traditional engineered features.

[211] Adversarial Yet Cooperative: Multi-Perspective Reasoning in Retrieved-Augmented Language Models

Can Xu, Lingyong Yan, Jiayi Wu, Haosen Wang, Shuaiqiang Wang, Yuchen Li, Jizhou Huang, Dawei Yin, Xiang Li

Main category: cs.AI

TL;DR: ARR framework introduces adversarial reasoning between Reasoner and Verifier with process-aware rewards for better RAG performance.

Details

Motivation: Current LRM+RAG approaches have two limitations: 1) single-perspective reasoning without self-correction, and 2) outcome-oriented rewards that don't guide complex multi-step reasoning processes.

Method: Proposes Adversarial Reasoning RAG (ARR) with Reasoner-Verifier framework where both engage in reasoning on retrieved evidence and critique each other’s logic, guided by process-aware advantage rewards combining observational signals with model uncertainty.

Result: Experiments on multiple benchmarks demonstrate the effectiveness of the method.

Conclusion: The ARR framework addresses critical limitations in current LRM+RAG approaches by enabling adversarial reasoning with process-aware rewards for improved reasoning fidelity and verification rigor.

Abstract: Recent advances in synergizing large reasoning models (LRMs) with retrieval-augmented generation (RAG) have shown promising results, yet two critical challenges remain: (1) reasoning models typically operate from a single, unchallenged perspective, limiting their ability to conduct deep, self-correcting reasoning over external documents, and (2) existing training paradigms rely excessively on outcome-oriented rewards, which provide insufficient signal for shaping the complex, multi-step reasoning process. To address these issues, we propose an Reasoner-Verifier framework named Adversarial Reasoning RAG (ARR). The Reasoner and Verifier engage in reasoning on retrieved evidence and critiquing each other’s logic while being guided by process-aware advantage that requires no external scoring model. This reward combines explicit observational signals with internal model uncertainty to jointly optimize reasoning fidelity and verification rigor. Experiments on multiple benchmarks demonstrate the effectiveness of our method.

[212] A Causal Information-Flow Framework for Unbiased Learning-to-Rank

Haoming Gong, Qingyao Ai, Zhihao Tao, Yongfeng Zhang

Main category: cs.AI

TL;DR: A novel causal learning framework for unbiased learning-to-rank that uses structural causal models and information theory to measure and reduce multiple biases (position, selection, trust) in click data, outperforming existing methods.

Details

Motivation: Click data in web search and recommendation systems suffers from multiple biases (position, selection, trust) that prevent learning true relevance. Existing ULTR methods mainly correct position bias but cannot measure remaining bias, provide risk guarantees, or handle multiple bias sources jointly.

Method: Combines Structural Causal Models (SCMs) with information-theoretic tools. SCMs specify click generation and identify true relevance, while conditional mutual information measures bias leakage into learned relevance estimates. Uses this leakage measure as a regularizer during training and incorporates doubly robust estimator for reliable risk estimation.

Result: Experiments on standard Learning-to-Rank benchmarks show the method consistently reduces measured bias leakage and improves ranking performance, especially in realistic scenarios where multiple biases (position and trust) interact strongly.

Conclusion: The proposed causal learning framework effectively addresses multiple biases in click data through SCMs and information theory, providing better bias measurement and reduction than existing ULTR methods while improving ranking performance.

Abstract: In web search and recommendation systems, user clicks are widely used to train ranking models. However, click data is heavily biased, i.e., users tend to click higher-ranked items (position bias), choose only what was shown to them (selection bias), and trust top results more (trust bias). Without explicitly modeling these biases, the true relevance of ranked items cannot be correctly learned from clicks. Existing Unbiased Learning-to-Rank (ULTR) methods mainly correct position bias and rely on propensity estimation, but they cannot measure remaining bias, provide risk guarantees, or jointly handle multiple bias sources. To overcome these challenges, this paper introduces a novel causal learning-based ranking framework that extends ULTR by combining Structural Causal Models (SCMs) with information-theoretic tools. SCMs specify how clicks are generated and help identify the true relevance signal from click data, while conditional mutual information, measures how much bias leaks into the learned relevance estimates. We use this leakage measure to define a rigorous notion of disentanglement and include it as a regularizer during model training to reduce bias. In addition, we incorporate a causal inference estimator, i.e., doubly robust estimator, to ensure more reliable risk estimation. Experiments on standard Learning-to-Rank benchmarks show that our method consistently reduces measured bias leakage and improves ranking performance, especially in realistic scenarios where multiple biases-such as position and trust bias-interact strongly.

[213] Cumulative Path-Level Semantic Reasoning for Inductive Knowledge Graph Completion

Jiapu Wang, Xinghe Cheng, Zezheng Wu, Ruiqi Ma, Rui Wang, Zhichao Yan, Haoran Luo, Yuhao Jiang, Kai Sun

Main category: cs.AI

TL;DR: CPSR framework improves inductive KGC by adaptively masking noisy structural information and evaluating path-level semantic contributions to handle emerging entities better.

Details

Motivation: Existing inductive KGC methods struggle with noisy structural information and difficulty capturing long-range dependencies in reasoning paths, limiting their effectiveness with emerging entities in dynamic KGs.

Method: Proposes CPSR framework with: 1) query-dependent masking module to adaptively filter noisy structural information while retaining target-relevant information, and 2) global semantic scoring module that evaluates both individual node contributions and collective impact along reasoning paths.

Result: Experimental results demonstrate that CPSR achieves state-of-the-art performance in inductive knowledge graph completion tasks.

Conclusion: CPSR effectively addresses challenges in inductive KGC by simultaneously capturing structural and semantic information, enabling better handling of emerging entities through adaptive noise filtering and comprehensive path-level semantic reasoning.

Abstract: Conventional Knowledge Graph Completion (KGC) methods aim to infer missing information in incomplete Knowledge Graphs (KGs) by leveraging existing information, which struggle to perform effectively in scenarios involving emerging entities. Inductive KGC methods can handle the emerging entities and relations in KGs, offering greater dynamic adaptability. While existing inductive KGC methods have achieved some success, they also face challenges, such as susceptibility to noisy structural information during reasoning and difficulty in capturing long-range dependencies in reasoning paths. To address these challenges, this paper proposes the Cumulative Path-Level Semantic Reasoning for inductive knowledge graph completion (CPSR) framework, which simultaneously captures both the structural and semantic information of KGs to enhance the inductive KGC task. Specifically, the proposed CPSR employs a query-dependent masking module to adaptively mask noisy structural information while retaining important information closely related to the targets. Additionally, CPSR introduces a global semantic scoring module that evaluates both the individual contributions and the collective impact of nodes along the reasoning path within KGs. The experimental results demonstrate that CPSR achieves state-of-the-art performance.

[214] GenCtrl – A Formal Controllability Toolkit for Generative Models

Emily Cheng, Carmen Amo Alonso, Federico Danieli, Arno Blaas, Luca Zappella, Pau Rodriguez, Xavier Suau

Main category: cs.AI

TL;DR: The paper provides a theoretical framework to analyze whether generative models are truly controllable, proposing an algorithm to estimate controllable sets with formal guarantees and demonstrating that model controllability is surprisingly fragile.

Details

Motivation: As generative models become ubiquitous, there's a critical need for fine-grained control over generation processes. While many controlled generation methods exist, it remains unclear whether these models are fundamentally controllable in the first place.

Method: The paper frames human-model interaction as a control process and proposes a novel algorithm to estimate the controllable sets of models in a dialogue setting. It provides formal probably-approximately correct (PAC) bounds for controllable set estimates that are distribution-free, require no assumptions except output boundedness, and work for any black-box nonlinear control system (any generative model).

Result: Empirical demonstrations on different tasks in controlling dialogue processes (for both language models and text-to-image generation) show that model controllability is surprisingly fragile and highly dependent on the experimental setting.

Conclusion: The work highlights the need for rigorous controllability analysis, shifting the focus from simply attempting control to first understanding its fundamental limits. The theoretical framework provides formal guarantees for assessing model controllability.

Abstract: As generative models become ubiquitous, there is a critical need for fine-grained control over the generation process. Yet, while controlled generation methods from prompting to fine-tuning proliferate, a fundamental question remains unanswered: are these models truly controllable in the first place? In this work, we provide a theoretical framework to formally answer this question. Framing human-model interaction as a control process, we propose a novel algorithm to estimate the controllable sets of models in a dialogue setting. Notably, we provide formal guarantees on the estimation error as a function of sample complexity: we derive probably-approximately correct bounds for controllable set estimates that are distribution-free, employ no assumptions except for output boundedness, and work for any black-box nonlinear control system (i.e., any generative model). We empirically demonstrate the theoretical framework on different tasks in controlling dialogue processes, for both language models and text-to-image generation. Our results show that model controllability is surprisingly fragile and highly dependent on the experimental setting. This highlights the need for rigorous controllability analysis, shifting the focus from simply attempting control to first understanding its fundamental limits.

[215] HAG: Hierarchical Demographic Tree-based Agent Generation for Topic-Adaptive Simulation

Rongxin Chen, Tianyu Wu, Bingbing Xu, Xiucheng Xu, Huawei Shen

Main category: cs.AI

TL;DR: HAG: Hierarchical Agent Generation framework for high-fidelity agent initialization in Agent-Based Modeling, addressing both macro-level distribution alignment and micro-level individual rationality through a two-stage process.

Details

Motivation: Existing approaches for agent initialization have limitations - static data-based methods fail to adapt to unseen topics, while LLM-based methods lack macro-level distribution awareness, leading to inconsistencies between persona attributes and reality.

Method: Two-stage hierarchical framework: 1) Use World Knowledge Model to infer hierarchical conditional probabilities and construct Topic-Adaptive Tree for macro-level distribution alignment; 2) Grounded real-world data instantiation and agentic augmentation for micro-level consistency.

Result: HAG significantly outperforms baselines, reducing population alignment errors by average 37.7% and enhancing sociological consistency by 18.8% on multi-domain benchmark with PACE evaluation framework.

Conclusion: HAG provides a robust framework for high-fidelity agent initialization that achieves both macro-level distribution alignment and micro-level individual rationality, addressing limitations of existing approaches.

Abstract: High-fidelity agent initialization is crucial for credible Agent-Based Modeling across diverse domains. A robust framework should be Topic-Adaptive, capturing macro-level joint distributions while ensuring micro-level individual rationality. Existing approaches fall into two categories: static data-based retrieval methods that fail to adapt to unseen topics absent from the data, and LLM-based generation methods that lack macro-level distribution awareness, resulting in inconsistencies between micro-level persona attributes and reality. To address these problems, we propose HAG, a Hierarchical Agent Generation framework that formalizes population generation as a two-stage decision process. Firstly, utilizing a World Knowledge Model to infer hierarchical conditional probabilities to construct the Topic-Adaptive Tree, achieving macro-level distribution alignment. Then, grounded real-world data, instantiation and agentic augmentation are carried out to ensure micro-level consistency. Given the lack of specialized evaluation, we establish a multi-domain benchmark and a comprehensive PACE evaluation framework. Extensive experiments show that HAG significantly outperforms representative baselines, reducing population alignment errors by an average of 37.7% and enhancing sociological consistency by 18.8%.

[216] CHDP: Cooperative Hybrid Diffusion Policies for Reinforcement Learning in Parameterized Action Space

Bingyi Liu, Jinbo He, Haiyong Shi, Enshu Wang, Weizhen Han, Jingxiang Hao, Peixi Wang, Zhuangzhuang Zhang

Main category: cs.AI

TL;DR: CHDP is a novel framework that treats hybrid action spaces as cooperative games, using discrete and continuous diffusion policies that work together with sequential updates and codebook embeddings to improve expressiveness and scalability.

Details

Motivation: Hybrid action spaces combining discrete choices and continuous parameters are common in robotics and game AI, but current methods struggle with limited policy expressiveness and poor scalability in high-dimensional settings.

Method: Proposes Cooperative Hybrid Diffusion Policies (CHDP) with two cooperative agents: one using discrete diffusion policy and one using continuous diffusion policy conditioned on discrete action representations. Uses sequential updates to avoid conflicts, codebook embeddings for high-dimensional discrete spaces, and Q-function guidance for alignment.

Result: CHDP outperforms state-of-the-art methods by up to 19.3% in success rate on challenging hybrid action benchmarks.

Conclusion: The cooperative game perspective combined with diffusion policies, sequential updates, and codebook embeddings effectively addresses hybrid action space challenges, achieving significant performance improvements.

Abstract: Hybrid action space, which combines discrete choices and continuous parameters, is prevalent in domains such as robot control and game AI. However, efficiently modeling and optimizing hybrid discrete-continuous action space remains a fundamental challenge, mainly due to limited policy expressiveness and poor scalability in high-dimensional settings. To address this challenge, we view the hybrid action space problem as a fully cooperative game and propose a \textbf{Cooperative Hybrid Diffusion Policies (CHDP)} framework to solve it. CHDP employs two cooperative agents that leverage a discrete and a continuous diffusion policy, respectively. The continuous policy is conditioned on the discrete action’s representation, explicitly modeling the dependency between them. This cooperative design allows the diffusion policies to leverage their expressiveness to capture complex distributions in their respective action spaces. To mitigate the update conflicts arising from simultaneous policy updates in this cooperative setting, we employ a sequential update scheme that fosters co-adaptation. Moreover, to improve scalability when learning in high-dimensional discrete action space, we construct a codebook that embeds the action space into a low-dimensional latent space. This mapping enables the discrete policy to learn in a compact, structured space. Finally, we design a Q-function-based guidance mechanism to align the codebook’s embeddings with the discrete policy’s representation during training. On challenging hybrid action benchmarks, CHDP outperforms the state-of-the-art method by up to $19.3%$ in success rate.

[217] Circular Reasoning: Understanding Self-Reinforcing Loops in Large Reasoning Models

Zenghao Duan, Liang Pang, Zihao Wei, Wenbin Duan, Yuxin Tian, Shicheng Xu, Jingcheng Deng, Zhiyi Yin, Xueqi Cheng

Main category: cs.AI

TL;DR: The paper introduces Circular Reasoning as a failure mode in Large Reasoning Models where they get stuck in self-reinforcing repetitive loops, proposes LoopBench dataset to study this, analyzes the mechanism as state collapse with V-shaped attention patterns, and uses CUSUM algorithm for early loop prediction.

Details

Motivation: Despite the success of test-time scaling, Large Reasoning Models frequently encounter repetitive loops that lead to computational waste and inference failure. The authors identify a distinct failure mode called Circular Reasoning where models get trapped in self-reinforcing cycles.

Method: 1) Introduce LoopBench dataset with two loop typologies: numerical loops and statement loops. 2) Mechanistically characterize circular reasoning as state collapse with distinct boundaries. 3) Analyze the self-reinforcing V-shaped attention mechanism driving loops. 4) Employ Cumulative Sum (CUSUM) algorithm to capture precursors for early loop prediction.

Result: Experiments across diverse LRMs validate the accuracy of the CUSUM algorithm for early loop prediction and elucidate the stability of long-chain reasoning. The analysis reveals that reasoning impasses trigger loop onset, which then persists as inescapable cycles.

Conclusion: The paper systematically analyzes Circular Reasoning in LRMs, provides a dataset and mechanistic understanding of the phenomenon, and demonstrates an effective method for early loop prediction using CUSUM algorithm, contributing to improving the stability of long-chain reasoning in large models.

Abstract: Despite the success of test-time scaling, Large Reasoning Models (LRMs) frequently encounter repetitive loops that lead to computational waste and inference failure. In this paper, we identify a distinct failure mode termed Circular Reasoning. Unlike traditional model degeneration, this phenomenon manifests as a self-reinforcing trap where generated content acts as a logical premise for its own recurrence, compelling the reiteration of preceding text. To systematically analyze this phenomenon, we introduce LoopBench, a dataset designed to capture two distinct loop typologies: numerical loops and statement loops. Mechanistically, we characterize circular reasoning as a state collapse exhibiting distinct boundaries, where semantic repetition precedes textual repetition. We reveal that reasoning impasses trigger the loop onset, which subsequently persists as an inescapable cycle driven by a self-reinforcing V-shaped attention mechanism. Guided by these findings, we employ the Cumulative Sum (CUSUM) algorithm to capture these precursors for early loop prediction. Experiments across diverse LRMs validate its accuracy and elucidate the stability of long-chain reasoning.

[218] Logic-Parametric Neuro-Symbolic NLI: Controlling Logical Formalisms for Verifiable LLM Reasoning

Ali Farjami, Luca Redondi, Marco Valentino

Main category: cs.AI

TL;DR: A logic-parametric neuro-symbolic framework for NLI that treats logic as a controllable component, enabling systematic comparison of different formalisms and showing logic-internal approaches outperform logic-external ones, with effectiveness being domain-dependent.

Details

Motivation: Existing neuro-symbolic approaches for verifiable natural language inference rely on fixed logical formalisms, limiting robustness and adaptability. There's a need for a more flexible framework that can systematically compare different logics and adapt to domain requirements.

Method: Propose a logic-parametric framework using LogiKEy methodology to embed various classical and non-classical formalisms into higher-order logic (HOL). Compare logic-external approaches (normative requirements via axioms) with logic-internal approaches (normative patterns from logic’s built-in structure), focusing on normative reasoning.

Result: Logic-internal strategies consistently improve performance and produce more efficient hybrid proofs for NLI. Effectiveness is domain-dependent: first-order logic favors commonsense reasoning, while deontic and modal logics excel in ethical domains.

Conclusion: Making logic a first-class, parametric element in neuro-symbolic architectures enables more robust, modular, and adaptable reasoning, with logic-internal approaches showing superior performance and domain-specific logic choices being crucial.

Abstract: Large language models (LLMs) and theorem provers (TPs) can be effectively combined for verifiable natural language inference (NLI). However, existing approaches rely on a fixed logical formalism, a feature that limits robustness and adaptability. We propose a logic-parametric framework for neuro-symbolic NLI that treats the underlying logic not as a static background, but as a controllable component. Using the LogiKEy methodology, we embed a range of classical and non-classical formalisms into higher-order logic (HOL), enabling a systematic comparison of inference quality, explanation refinement, and proof behavior. We focus on normative reasoning, where the choice of logic has significant implications. In particular, we compare logic-external approaches, where normative requirements are encoded via axioms, with logic-internal approaches, where normative patterns emerge from the logic’s built-in structure. Extensive experiments demonstrate that logic-internal strategies can consistently improve performance and produce more efficient hybrid proofs for NLI. In addition, we show that the effectiveness of a logic is domain-dependent, with first-order logic favouring commonsense reasoning, while deontic and modal logics excel in ethical domains. Our results highlight the value of making logic a first-class, parametric element in neuro-symbolic architectures for more robust, modular, and adaptable reasoning.

[219] Overcoming Joint Intractability with Lossless Hierarchical Speculative Decoding

Yuxuan Zhou, Fei Huang, Heng Li, Fengyi Wu, Tianyu Wang, Jianwei Zhang, Junyang Lin, Zhi-Qi Cheng

Main category: cs.AI

TL;DR: HSD is a lossless verification method for speculative decoding that boosts accepted tokens by balancing probability mass across branches, achieving 12% performance gain in EAGLE-3.

Details

Motivation: Verification is a key bottleneck in speculative decoding. Existing sequence-level verification methods rely on approximations or partial information, struggling with joint intractability.

Method: Hierarchical Speculative Decoding (HSD) - a provably lossless verification method that overcomes joint intractability by balancing excess and deficient probability mass across accessible branches.

Result: HSD yields consistent improvements in acceptance rates across diverse model families and benchmarks. Integrating HSD into EAGLE-3 yields over 12% performance gain, establishing state-of-the-art decoding efficiency without compromising distribution fidelity.

Conclusion: HSD is a strong, explainable, and general verification method that can be integrated into various speculative decoding frameworks to significantly boost decoding efficiency while maintaining distribution fidelity.

Abstract: Verification is a key bottleneck in improving inference speed while maintaining distribution fidelity in Speculative Decoding. Recent work has shown that sequence-level verification leads to a higher number of accepted tokens compared to token-wise verification. However, existing solutions often rely on surrogate approximations or are constrained by partial information, struggling with joint intractability. In this work, we propose Hierarchical Speculative Decoding (HSD), a provably lossless verification method that significantly boosts the expected number of accepted tokens and overcomes joint intractability by balancing excess and deficient probability mass across accessible branches. Our extensive large-scale experiments demonstrate that HSD yields consistent improvements in acceptance rates across diverse model families and benchmarks. Moreover, its strong explainability and generality make it readily integrable into a wide range of speculative decoding frameworks. Notably, integrating HSD into EAGLE-3 yields over a 12% performance gain, establishing state-of-the-art decoding efficiency without compromising distribution fidelity. Code is available at https://github.com/ZhouYuxuanYX/Hierarchical-Speculative-Decoding.

[220] PII-VisBench: Evaluating Personally Identifiable Information Safety in Vision Language Models Along a Continuum of Visibility

G M Shahariar, Zabir Al Nazi, Md Olid Hasan Bhuiyan, Zhouxing Shi

Main category: cs.AI

TL;DR: PII-VisBench benchmark evaluates how Vision Language Models’ PII leakage varies with subjects’ online presence visibility, showing models disclose more PII for high-visibility subjects.

Details

Motivation: Existing VLM privacy evaluations treat PII leakage as static extraction and ignore how subjects' online presence (volume of available data) influences privacy alignment, creating a gap in understanding real-world privacy risks.

Method: Created PII-VisBench with 4000 unique probes across 200 subjects stratified into four visibility categories (high, medium, low, zero) based on online information availability. Evaluated 18 open-source VLMs (0.3B-32B) using Refusal Rate and Conditional PII Disclosure Rate metrics, plus paraphrasing and jailbreak-style attack prompts.

Result: Models show consistent pattern: refusals increase and PII disclosures decrease (9.10% high to 5.34% low) as subject visibility drops. High-visibility subjects trigger more PII disclosures, with substantial model-family heterogeneity and PII-type disparities. Attack prompts expose visibility-dependent failures.

Conclusion: Online presence visibility significantly impacts VLM privacy behavior, necessitating visibility-aware safety evaluation and training interventions to address real-world privacy risks.

Abstract: Vision Language Models (VLMs) are increasingly integrated into privacy-critical domains, yet existing evaluations of personally identifiable information (PII) leakage largely treat privacy as a static extraction task and ignore how a subject’s online presence–the volume of their data available online–influences privacy alignment. We introduce PII-VisBench, a novel benchmark containing 4000 unique probes designed to evaluate VLM safety through the continuum of online presence. The benchmark stratifies 200 subjects into four visibility categories: high, medium, low, and zero–based on the extent and nature of their information available online. We evaluate 18 open-source VLMs (0.3B-32B) based on two key metrics: percentage of PII probing queries refused (Refusal Rate) and the fraction of non-refusal responses flagged for containing PII (Conditional PII Disclosure Rate). Across models, we observe a consistent pattern: refusals increase and PII disclosures decrease (9.10% high to 5.34% low) as subject visibility drops. We identify that models are more likely to disclose PII for high-visibility subjects, alongside substantial model-family heterogeneity and PII-type disparities. Finally, paraphrasing and jailbreak-style prompts expose attack and model-dependent failures, motivating visibility-aware safety evaluation and training interventions.

[221] DynaDebate: Breaking Homogeneity in Multi-Agent Debate with Dynamic Path Generation

Zhenghao Li, Zhi Zheng, Wei Chen, Jielun Zhao, Yong Chen, Tong Xu, Enhong Chen

Main category: cs.AI

TL;DR: DynaDebate introduces a dynamic multi-agent debate framework with three key mechanisms to overcome limitations of existing MAD approaches, achieving superior performance across benchmarks.

Details

Motivation: Existing Multi-Agent Debate frameworks suffer from unguided initialization causing agents to adopt identical reasoning paths and errors, leading to ineffective debates that degenerate into simple majority voting rather than true collaborative problem-solving.

Method: DynaDebate employs three key mechanisms: 1) Dynamic Path Generation and Allocation using a dedicated Path Generation Agent to create diverse solution paths with adaptive redundancy; 2) Process-Centric Debate focusing on step-by-step logic critique rather than outcome voting; 3) Trigger-Based Verification Agent activated upon disagreement to use external tools for objective deadlock resolution.

Result: Extensive experiments demonstrate that DynaDebate achieves superior performance across various benchmarks, surpassing existing state-of-the-art Multi-Agent Debate methods.

Conclusion: DynaDebate effectively addresses the limitations of existing MAD frameworks by introducing dynamic path generation, process-centric debate, and verification mechanisms, resulting in more effective collaborative problem-solving and reasoning in multi-agent systems.

Abstract: Recent years have witnessed the rapid development of Large Language Model-based Multi-Agent Systems (MAS), which excel at collaborative decision-making and complex problem-solving. Recently, researchers have further investigated Multi-Agent Debate (MAD) frameworks, which enhance the reasoning and collaboration capabilities of MAS through information exchange and debate among multiple agents. However, existing approaches often rely on unguided initialization, causing agents to adopt identical reasoning paths that lead to the same errors. As a result, effective debate among agents is hindered, and the final outcome frequently degenerates into simple majority voting. To solve the above problem, in this paper, we introduce Dynamic Multi-Agent Debate (DynaDebate), which enhances the effectiveness of multi-agent debate through three key mechanisms: (1) Dynamic Path Generation and Allocation, which employs a dedicated Path Generation Agent to generate diverse and logical solution paths with adaptive redundancy; (2) Process-Centric Debate, which shifts the focus from surface-level outcome voting to rigorous step-by-step logic critique to ensure process correctness; (3) A Trigger-Based Verification Agent, which is activated upon disagreement and uses external tools to objectively resolve deadlocks. Extensive experiments demonstrate that DynaDebate achieves superior performance across various benchmarks, surpassing existing state-of-the-art MAD methods.

[222] From Off-Policy to On-Policy: Enhancing GUI Agents via Bi-level Expert-to-Policy Assimilation

Zezhou Wang, Ziyun Zhang, Xiaoyi Zhang, Zhuzhong Qian, Yan Lu

Main category: cs.AI

TL;DR: BEPA (Bi-Level Expert-to-Policy Assimilation) improves end-to-end vision-language GUI agents by effectively leveraging limited expert trajectories through a two-level approach that aligns expert data with policy learning.

Details

Motivation: Current GUI agents face two bottlenecks: limited verifiable tasks in datasets like OSWorld (only few hundred), and difficulty scaling expert trajectory collection. End-to-end screenshot-to-action policies lag behind framework-based systems, and naive mixing of expert trajectories with RLVR is brittle due to structural mismatch and distribution shift.

Method: BEPA uses a bi-level approach: LEVEL-1 generates policy-aligned guidance by rolling out reachable trajectories under the base policy from expert states; LEVEL-2 maintains a per-task, dynamically updated cache of these aligned trajectories for use in RLVR training.

Result: On OSWorld-Verified, BEPA improves UITARS1.5-7B success from 22.87% to 32.13% and raises a held-out split from 5.74% to 10.30%. Consistent gains also observed on MMBench-GUI and Online-Mind2Web benchmarks.

Conclusion: BEPA effectively bridges the gap between limited expert demonstrations and end-to-end policy learning for GUI agents, enabling better utilization of scarce expert data through policy-aligned trajectory assimilation.

Abstract: Vision-language models are increasingly deployed as computer-use agents (CUAs) that operate desktops and browsers. Top-performing CUAs are framework-based systems that decompose planning and execution, while end-to-end screenshot-to-action policies are easier to deploy but lag behind on benchmarks such as OSWorld-Verified. GUI datasets like OSWorld pose two bottlenecks: they expose only a few hundred interactive, verifiable tasks and environments, and expert trajectories must be gathered by interacting with these environments, making such data hard to scale. We therefore ask how reinforcement learning from verifiable rewards (RLVR) can best exploit a small pool of exist expert trajectories to train end-to-end policies. Naively mixing these off-policy traces into on-policy RLVR is brittle: even after format conversion, expert trajectories exhibit structural mismatch and distribution shift from the learner. We propose BEPA (Bi-Level Expert-to-Policy Assimilation), which turns static expert traces into policy-aligned guidance via self-rolled reachable trajectories under the base policy (LEVEL-1) and a per-task, dynamically updated cache used in RLVR (LEVEL-2). On OSWorld-Verified, BEPA improves UITARS1.5-7B success from 22.87% to 32.13% and raises a held-out split from 5.74% to 10.30%, with consistent gains on MMBench-GUI and Online-Mind2Web. Our code and data are available at: https://github.com/LEON-gittech/Verl_GUI.git

[223] StackPlanner: A Centralized Hierarchical Multi-Agent System with Task-Experience Memory Management

Ruizhe Zhang, Xinke Jiang, Zhibang Yang, Zhixin Zhang, Jiaran Gao, Yuzhen Xiao, Hongbin Lai, Xu Chu, Junfeng Zhao, Yasha Wang

Main category: cs.AI

TL;DR: StackPlanner is a hierarchical multi-agent framework with explicit memory control that addresses context bloat, error accumulation, and poor generalization in LLM-based multi-agent systems through task-level memory management and reusable coordination experience learning.

Details

Motivation: Centralized LLM-based multi-agent systems suffer from unstable long-horizon collaboration due to lack of memory management, leading to context bloat, error accumulation, and poor cross-task generalization. There's a need to address both task-level memory inefficiency and inability to reuse coordination experience.

Method: StackPlanner uses hierarchical architecture that decouples high-level coordination from subtask execution with active task-level memory control. It employs structured experience memory and reinforcement learning to retrieve and exploit reusable coordination experience.

Result: Experiments on multiple deep-search and agent system benchmarks demonstrate StackPlanner’s effectiveness in enabling reliable long-horizon multi-agent collaboration.

Conclusion: StackPlanner successfully addresses memory inefficiency and coordination experience reuse challenges in LLM-based multi-agent systems, enabling more stable and effective long-horizon collaboration.

Abstract: Multi-agent systems based on large language models, particularly centralized architectures, have recently shown strong potential for complex and knowledge-intensive tasks. However, central agents often suffer from unstable long-horizon collaboration due to the lack of memory management, leading to context bloat, error accumulation, and poor cross-task generalization. To address both task-level memory inefficiency and the inability to reuse coordination experience, we propose StackPlanner, a hierarchical multi-agent framework with explicit memory control. StackPlanner addresses these challenges by decoupling high-level coordination from subtask execution with active task-level memory control, and by learning to retrieve and exploit reusable coordination experience via structured experience memory and reinforcement learning. Experiments on multiple deep-search and agent system benchmarks demonstrate the effectiveness of our approach in enabling reliable long-horizon multi-agent collaboration.

[224] TowerMind: A Tower Defence Game Learning Environment and Benchmark for LLM as Agents

Dawei Wang, Chengming Zhou, Di Zhao, Xinyuan Liu, Marci Chi Ma, Gary Ushaw, Richard Davison

Main category: cs.AI

TL;DR: TowerMind is a lightweight, multimodal tower defense environment for evaluating LLM agents’ planning, decision-making, and hallucination, revealing performance gaps between LLMs and humans.

Details

Motivation: Existing RTS game environments for LLM evaluation have high computational demands or lack textual observations, limiting their use for assessing LLM agents' long-term planning and decision-making capabilities.

Method: Developed TowerMind, a tower defense RTS environment with low computational demands and multimodal observations (pixel-based, textual, structured game-state). Designed 5 benchmark levels to evaluate LLMs under different input settings.

Result: Revealed clear performance gap between LLMs and human experts in both capability and hallucination dimensions. Identified key LLM limitations: inadequate planning validation, lack of multifinality in decision-making, and inefficient action use.

Conclusion: TowerMind provides a lightweight, multimodal benchmark complementing existing RTS environments, offering new evaluation capabilities for AI agents including hallucination assessment and high customizability.

Abstract: Recent breakthroughs in Large Language Models (LLMs) have positioned them as a promising paradigm for agents, with long-term planning and decision-making emerging as core general-purpose capabilities for adapting to diverse scenarios and tasks. Real-time strategy (RTS) games serve as an ideal testbed for evaluating these two capabilities, as their inherent gameplay requires both macro-level strategic planning and micro-level tactical adaptation and action execution. Existing RTS game-based environments either suffer from relatively high computational demands or lack support for textual observations, which has constrained the use of RTS games for LLM evaluation. Motivated by this, we present TowerMind, a novel environment grounded in the tower defense (TD) subgenre of RTS games. TowerMind preserves the key evaluation strengths of RTS games for assessing LLMs, while featuring low computational demands and a multimodal observation space, including pixel-based, textual, and structured game-state representations. In addition, TowerMind supports the evaluation of model hallucination and provides a high degree of customizability. We design five benchmark levels to evaluate several widely used LLMs under different multimodal input settings. The results reveal a clear performance gap between LLMs and human experts across both capability and hallucination dimensions. The experiments further highlight key limitations in LLM behavior, such as inadequate planning validation, a lack of multifinality in decision-making, and inefficient action use. We also evaluate two classic reinforcement learning algorithms: Ape-X DQN and PPO. By offering a lightweight and multimodal design, TowerMind complements the existing RTS game-based environment landscape and introduces a new benchmark for the AI agent field. The source code is publicly available on GitHub(https://github.com/tb6147877/TowerMind).

[225] Open-Vocabulary 3D Instruction Ambiguity Detection

Jiayu Ding, Haoran Tang, Ge Li

Main category: cs.AI

TL;DR: The paper introduces Open-Vocabulary 3D Instruction Ambiguity Detection, a new safety-critical task for embodied AI, and presents Ambi3D benchmark with 700+ scenes and 22k instructions. It shows current 3D LLMs struggle with ambiguity detection and proposes AmbiVer, a two-stage visual evidence collection framework that improves performance.

Details

Motivation: In safety-critical domains like surgery, linguistic ambiguity can lead to catastrophic errors, but most embodied AI research overlooks this issue by assuming instructions are clear and focusing only on execution rather than confirmation.

Method: The paper proposes AmbiVer, a two-stage framework that first collects explicit visual evidence from multiple views in the 3D scene, then uses this evidence to guide a vision-language model in judging whether an instruction is ambiguous or unambiguous.

Result: State-of-the-art 3D LLMs struggle with ambiguity detection. AmbiVer demonstrates effectiveness in addressing this challenge, showing improved performance on the Ambi3D benchmark with 700+ diverse 3D scenes and 22k instructions.

Conclusion: The work defines a fundamental new task for safer embodied AI, provides a large-scale benchmark, and proposes an effective solution that paves the way for more trustworthy AI systems in safety-critical applications.

Abstract: In safety-critical domains, linguistic ambiguity can have severe consequences; a vague command like “Pass me the vial” in a surgical setting could lead to catastrophic errors. Yet, most embodied AI research overlooks this, assuming instructions are clear and focusing on execution rather than confirmation. To address this critical safety gap, we are the first to define Open-Vocabulary 3D Instruction Ambiguity Detection, a fundamental new task where a model must determine if a command has a single, unambiguous meaning within a given 3D scene. To support this research, we build Ambi3D, the large-scale benchmark for this task, featuring over 700 diverse 3D scenes and around 22k instructions. Our analysis reveals a surprising limitation: state-of-the-art 3D Large Language Models (LLMs) struggle to reliably determine if an instruction is ambiguous. To address this challenge, we propose AmbiVer, a two-stage framework that collects explicit visual evidence from multiple views and uses it to guide an vision-language model (VLM) in judging instruction ambiguity. Extensive experiments demonstrate the challenge of our task and the effectiveness of AmbiVer, paving the way for safer and more trustworthy embodied AI. Code and dataset available at https://jiayuding031020.github.io/ambi3d/.

[226] MIPO: Mutual Integration of Patient Journey and Medical Ontology for Healthcare Representation Learning

Xueping Peng, Guodong Long, Tao Shen, Sen Wang, Chengqi Zhang, Allison Clarke, Clement Schlegel

Main category: cs.AI

TL;DR: MIPO is a Transformer-based framework that mutually integrates patient journey data and medical ontologies for EHR representation learning, improving performance under both sufficient and limited data conditions.

Details

Motivation: Existing EHR representation learning methods struggle with limited data and fail to fully leverage both comprehensive medical ontologies and patient journey contexts. Current approaches use small/uniform ontologies and insufficiently capture patient journey dependencies.

Method: MIPO uses a Transformer-based architecture with two main components: (1) sequential diagnosis prediction task for task-specific representation learning, and (2) ontology-based disease-typing task. A graph-embedding module integrates patient visit records, creating a mutually reinforcing loop between patient-journey and ontology embeddings.

Result: MIPO consistently outperforms baseline methods on two real-world benchmark datasets under both sufficient and limited data conditions. The resulting diagnosis embeddings also offer improved interpretability.

Conclusion: MIPO demonstrates promise for real-world healthcare applications by effectively integrating patient journey data with medical ontologies, addressing data insufficiency issues while producing interpretable medical representations.

Abstract: Representation learning on electronic health records (EHRs) plays a vital role in downstream medical prediction tasks. Although natural language processing techniques, such as recurrent neural networks, and self-attention, have been adapted for learning medical representations from hierarchical, time-stamped EHR data, they often struggle when either general or task-specific data are limited. Recent efforts have attempted to mitigate this challenge by incorporating medical ontologies (i.e., knowledge graphs) into self-supervised tasks like diagnosis prediction. However, two main issues remain: (1) small and uniform ontologies that lack diversity for robust learning, and (2) insufficient attention to the critical contexts or dependencies underlying patient journeys, which could further enhance ontology-based learning. To address these gaps, we propose MIPO (Mutual Integration of Patient Journey and Medical Ontology), a robust end-to-end framework that employs a Transformer-based architecture for representation learning. MIPO emphasizes task-specific representation learning through a sequential diagnosis prediction task, while also incorporating an ontology-based disease-typing task. A graph-embedding module is introduced to integrate information from patient visit records, thus alleviating data insufficiency. This setup creates a mutually reinforcing loop, where both patient-journey embedding and ontology embedding benefit from each other. We validate MIPO on two real-world benchmark datasets, showing that it consistently outperforms baseline methods under both sufficient and limited data conditions. Furthermore, the resulting diagnosis embeddings offer improved interpretability, underscoring the promise of MIPO for real-world healthcare applications.

[227] KALE-LM-Chem: Vision and Practice Toward an AI Brain for Chemistry

Weichen Dai, Yezeng Chen, Zijie Dai, Yubo Liu, Zhijie Huang, Yixuan Pan, Baiyang Song, Chengli Zhong, Xinhe Li, Zeyu Wang, Zhuoying Feng, Yi Zhou

Main category: cs.AI

TL;DR: The paper introduces a vision for an AI-powered chemical brain using LLMs with four core capabilities, and presents KALE-LM-Chem models that achieve strong chemistry performance.

Details

Motivation: To leverage recent LLM advancements to build domain-specific intelligence for chemistry, accelerating scientific discovery through AI assistance.

Method: Proposes a framework with four core capabilities (information extraction, semantic parsing, knowledge-based QA, reasoning & planning) and introduces KALE-LM-Chem and KALE-LM-Chem-1.5 models specifically designed for chemistry.

Result: The KALE-LM-Chem models have achieved outstanding performance in chemistry-related tasks, serving as a strong starting point for chemical intelligence systems.

Conclusion: Domain knowledge and logic are essential for AI systems to assist scientific discovery, and this work provides a foundation for realizing more intelligent AI to advance science, technology, and society.

Abstract: Recent advancements in large language models (LLMs) have demonstrated strong potential for enabling domain-specific intelligence. In this work, we present our vision for building an AI-powered chemical brain, which frames chemical intelligence around four core capabilities: information extraction, semantic parsing, knowledge-based QA, and reasoning & planning. We argue that domain knowledge and logic are essential pillars for enabling such a system to assist and accelerate scientific discovery. To initiate this effort, we introduce our first generation of large language models for chemistry: KALE-LM-Chem and KALE-LM-Chem-1.5, which have achieved outstanding performance in tasks related to the field of chemistry. We hope that our work serves as a strong starting point, helping to realize more intelligent AI and promoting the advancement of human science and technology, as well as societal development.

[228] Detection of LLM-Paraphrased Code and Identification of the Responsible LLM Using Coding Style Features

Shinwoo Park, Hyundong Jin, Jeong-won Cha, Yo-Sub Han

Main category: cs.AI

TL;DR: Proposes LPcodedec, a method to detect LLM-paraphrased code and identify which LLM was used, based on coding style differences like naming consistency and structure.

Details

Motivation: Growing concerns about intellectual property protection as LLMs can produce paraphrased versions of proprietary code, creating urgent need for detection systems.

Method: Constructs LPcode dataset of human-written and LLM-paraphrased code pairs, analyzes coding style differences, and develops LPcodedec detection method.

Result: LPcodedec outperforms baselines with 2.64% and 15.17% F1 score improvements and achieves 1,343x and 213x speedups for the two detection tasks.

Conclusion: The approach effectively detects LLM-paraphrased code and identifies the specific LLM used, addressing intellectual property protection concerns in code generation.

Abstract: Recent progress in large language models (LLMs) for code generation has raised serious concerns about intellectual property protection. Malicious users can exploit LLMs to produce paraphrased versions of proprietary code that closely resemble the original. While the potential for LLM-assisted code paraphrasing continues to grow, research on detecting it remains limited, underscoring an urgent need for detection system. We respond to this need by proposing two tasks. The first task is to detect whether code generated by an LLM is a paraphrased version of original human-written code. The second task is to identify which LLM is used to paraphrase the original code. For these tasks, we construct a dataset LPcode consisting of pairs of human-written code and LLM-paraphrased code using various LLMs. We statistically confirm significant differences in the coding styles of human-written and LLM-paraphrased code, particularly in terms of naming consistency, code structure, and readability. Based on these findings, we develop LPcodedec, a detection method that identifies paraphrase relationships between human-written and LLM-generated code, and discover which LLM is used for the paraphrasing. LPcodedec outperforms the best baselines in two tasks, improving F1 scores by 2.64% and 15.17% while achieving speedups of 1,343x and 213x, respectively. Our code and data are available at https://github.com/Shinwoo-Park/detecting_llm_paraphrased_code_via_coding_style_features.

[229] Climbing the Ladder of Reasoning: What LLMs Can-and Still Can’t-Solve after SFT?

Yiyou Sun, Georgia Zhou, Haoyue Bai, Hao Wang, Dacheng Li, Nouha Dziri, Dawn Song

Main category: cs.AI

TL;DR: Analysis of how SFT enhances math reasoning capabilities using AIME24 dataset, revealing ladder-like difficulty structure and specific requirements for advancing between tiers.

Details

Motivation: While supervised fine-tuning (SFT) improves language models' mathematical reasoning performance, the specific capabilities enhanced through such fine-tuning remain poorly understood. The paper aims to analyze how reasoning capabilities evolve with SFT.

Method: Conducted detailed analysis of model performance on AIME24 dataset, categorized questions into four difficulty tiers (Easy, Medium, Hard, Extremely Hard), identified specific requirements for advancing between tiers, and examined scaling effects of dataset size.

Result: Discovered ladder-like structure in problem difficulty. Progression from Easy to Medium requires R1 reasoning style with minimal SFT (500-1K instances). Hard-level questions suffer from frequent step errors with accuracy plateauing at ~65%. Exh-level questions require unconventional problem-solving skills that current models uniformly struggle with. Curated small datasets offer limited advantage - scaling dataset size proves more effective.

Conclusion: The analysis provides a clearer roadmap for advancing language model capabilities in mathematical reasoning by identifying specific bottlenecks and requirements for different difficulty tiers, with implications for dataset construction and training strategies.

Abstract: Recent supervised fine-tuning (SFT) approaches have significantly improved language models’ performance on mathematical reasoning tasks, even when models are trained at a small scale. However, the specific capabilities enhanced through such fine-tuning remain poorly understood. In this paper, we conduct a detailed analysis of model performance on the AIME24 dataset to understand how reasoning capabilities evolve. We discover a ladder-like structure in problem difficulty, categorize questions into four tiers (Easy, Medium, Hard, and Extremely Hard (Exh)), and identify the specific requirements for advancing between tiers. We find that progression from Easy to Medium tier requires adopting an R1 reasoning style with minimal SFT (500-1K instances), while Hard-level questions suffer from frequent model’s errors at each step of the reasoning chain, with accuracy plateauing at around 65% despite logarithmic scaling. Exh-level questions present a fundamentally different challenge; they require unconventional problem-solving skills that current models uniformly struggle with. Additional findings reveal that carefully curated small-scale datasets offer limited advantage-scaling dataset size proves far more effective. Our analysis provides a clearer roadmap for advancing language model capabilities in mathematical reasoning.

[230] SelfBudgeter: Adaptive Token Allocation for Efficient LLM Reasoning

Zheng Li, Qingxiu Dong, Jingyuan Ma, Di Zhang, Kai Jia, Zhifang Sui

Main category: cs.AI

TL;DR: SelfBudgeter is a self-adaptive reasoning strategy that dynamically allocates token budgets based on query complexity, achieving 61% response compression while maintaining accuracy, with user control over reasoning length.

Details

Motivation: Large reasoning models consume excessive tokens even for simple queries, leading to resource waste and prolonged user latency. There's a need for more efficient and controllable reasoning strategies.

Method: Two-stage approach: 1) Train model to self-estimate required reasoning budget based on query, 2) Introduce budget-guided GPRO (Generalized Policy Regularization Optimization) for reinforcement learning to maintain accuracy while reducing output length.

Result: Achieves average response length compression of 61% on math reasoning tasks while maintaining accuracy. Dynamically allocates budgets according to problem complexity. Provides user transparency and control over reasoning length.

Conclusion: SelfBudgeter offers an effective solution for efficient and controllable reasoning, balancing resource efficiency with accuracy while giving users visibility and control over reasoning processes.

Abstract: Recently, large reasoning models demonstrate exceptional performance on various tasks. However, reasoning models always consume excessive tokens even for simple queries, leading to resource waste and prolonged user latency. To address this challenge, we propose SelfBudgeter - a self-adaptive reasoning strategy for efficient and controllable reasoning. Specifically, we first train the model to self-estimate the required reasoning budget based on the query. We then introduce budget-guided GPRO for reinforcement learning, which effectively maintains accuracy while reducing output length. Experimental results demonstrate that SelfBudgeter dynamically allocates budgets according to problem complexity, achieving an average response length compression of 61% on math reasoning tasks while maintaining accuracy. Furthermore, SelfBudgeter allows users to see how long generation will take and decide whether to continue or stop. Additionally, users can directly control the reasoning length by setting token budgets upfront.

[231] Rethinking Supply Chain Planning: A Generative Paradigm

Jiaheng Yin, Yongzhi Qi, Jianshen Zhang, Dongyang Geng, Zhengyu Chen, Hao Hu, Wei Qi, Zuo-Jun Max Shen

Main category: cs.AI

TL;DR: This paper proposes a Generative AI-powered agentic framework to transform supply chain planning from fragmented static processes into an interactive, integrated cognitive system, achieving 22% planning accuracy improvement and 2% in-stock rate increase at JD.com.

Details

Motivation: Traditional supply chain planning paradigms are inadequate for modern e-commerce due to fragmented processes, static optimization, dynamic demand, organizational silos, and multi-stage coordination complexity.

Method: Introduces a Generative AI-powered agentic framework that serves as an intelligent cognitive interface, bridging unstructured business contexts with structured analytical workflows to comprehend complex semantics and coordinate decisions across organizational boundaries.

Result: Empirical validation at JD.com shows approximately 22% improvement in planning accuracy and 2% increase in in-stock rates, demonstrating the efficacy of the cognitive paradigm.

Conclusion: The study successfully transforms supply chain planning into an adaptive, knowledge-driven capability through a cognitive paradigm that unifies human strategic intent with adaptive execution, shifting from rigid control to intelligent orchestration.

Abstract: Supply chain planning is the critical process of anticipating future demand and coordinating operational activities across the logistics network. However, within the context of contemporary e-commerce, traditional planning paradigms, typically characterized by fragmented processes and static optimization, prove inadequate in addressing dynamic demand, organizational silos, and the complexity of multi-stage coordination. To address these challenges, this study proposes a fundamental rethinking of supply chain planning, redefining it not merely as a computational task, but as an interactive, integrated, and automated cognitive process. This new paradigm emphasizes the organic unification of human strategic intent with adaptive execution, shifting the focus from rigid control to continuous, intelligent orchestration. To operationalize this conceptual shift, we introduce a Generative AI-powered agentic framework. Functioning as an intelligent cognitive interface, this framework bridges the gap between unstructured business contexts and structured analytical workflows, enabling the system to comprehend complex semantics and coordinate decisions across organizational boundaries. We demonstrate the empirical validity of this approach within JD.com’s large-scale operations. The deployment confirms the efficacy of this cognitive paradigm, yielding an approximate 22% improvement in planning accuracy and a 2% increase in in-stock rates, thereby validating the transformation of planning into an adaptive, knowledge-driven capability.

[232] Learning How to Use Tools, Not Just When: Pattern-Aware Tool-Integrated Reasoning

Ningning Xu, Yuxuan Jiang, Shubhashis Roy Dipta, Hengyuan Zhang

Main category: cs.AI

TL;DR: A pattern-aware approach for tool-integrated reasoning that improves code usage and accuracy by aligning calculator vs algorithmic pattern selection with teacher preferences.

Details

Motivation: Prior work on tool-integrated reasoning mainly focused on when to invoke tools, overlooking how tools are applied. Misaligned pattern choices (calculator vs algorithmic) often cause failures even when reasoning is sound.

Method: Two-stage framework: 1) builds code competence from both calculator pattern (direct computation) and algorithmic pattern (encoding problems as programs), 2) aligns pattern selection with teacher preferences.

Result: Substantial improvements on challenging math datasets: Code@1 on MATH500 from 64.0% to 70.5%, and on AIME24 from 26.7% to 50.0%.

Conclusion: Pattern-aware approach is effective for tool-integrated reasoning, highlighting the importance of considering how tools are applied, not just when to invoke them.

Abstract: Tool-integrated reasoning (TIR) has become a key approach for improving large reasoning models (LRMs) on complex problems. Prior work has mainly studied when to invoke tools, while overlooking how tools are applied. We identify two common patterns: a calculator pattern that uses code for direct computation, and an algorithmic pattern that encodes problems as programs. Misaligned choices often cause failures even when reasoning is sound. We propose a two-stage framework that first builds code competence from both patterns and then aligns pattern selection with teacher preferences. Across challenging math datasets, our pattern-aware method substantially improves both code usage and accuracy, for instance raising Code@1 on MATH500 from 64.0% to 70.5% and on AIME24 from 26.7% to 50.0%. These gains highlight the effectiveness of a pattern-aware approach for tool-integrated reasoning.

[233] TDHook: A Lightweight Framework for Interpretability

Yoann Poupart

Main category: cs.AI

TL;DR: TDHook is a lightweight, generic interpretability framework for PyTorch models that handles complex composed architectures across domains like CV, NLP, and DRL, offering attribution, probing, and intervention methods with better performance than existing tools.

Details

Motivation: Existing interpretability frameworks struggle with complex models that have multiple inputs/outputs or use composable networks (like image captioning or DRL), as they don't fit well into standard APIs.

Method: Built on tensordict, TDHook provides a flexible get-set API for interventions, ready-to-use attribution and probing methods, and is designed to handle composed models across different domains.

Result: TDHook requires half the disk space of transformer_lens and achieves up to 2x speed-up over captum for integrated gradients on multi-target pipelines on both CPU and GPU.

Conclusion: TDHook bridges the gap between different interpretability method classes, making modern interpretability pipelines more accessible for complex models across various domains.

Abstract: Interpretability of Deep Neural Networks (DNNs) is a growing field driven by the study of vision and language models. Yet, some use cases, like image captioning, or domains like Deep Reinforcement Learning (DRL), require complex modelling, with multiple inputs and outputs or use composable and separated networks. As a consequence, they rarely fit natively into the API of popular interpretability frameworks. We thus present TDHook, an open-source, lightweight, generic interpretability framework based on $\texttt{tensordict}$ and applicable to any $\texttt{torch}$ model. It focuses on handling complex composed models which can be trained for Computer Vision, Natural Language Processing, Reinforcement Learning or any other domain. This library features ready-to-use methods for attribution, probing and a flexible get-set API for interventions, and is aiming to bridge the gap between these method classes to make modern interpretability pipelines more accessible. TDHook is designed with minimal dependencies, requiring roughly half as much disk space as $\texttt{transformer_lens}$, and, in our controlled benchmark, achieves up to a $\times$2 speed-up over $\texttt{captum}$ when running integrated gradients for multi-target pipelines on both CPU and GPU. In addition, to value our work, we showcase concrete use cases of our library with composed interpretability pipelines in Computer Vision (CV) and Natural Language Processing (NLP), as well as with complex models in DRL.

[234] ContractEval: A Benchmark for Evaluating Contract-Satisfying Assertions in Code Generation

Soohan Lim, Joonghyuk Hahn, Hyunwoo Park, Sang-Ki Ko, Yo-Sub Han

Main category: cs.AI

TL;DR: ContractEval is a new benchmark for evaluating whether generated code properly rejects contract-violating inputs by triggering assertions, addressing a gap in current code generation benchmarks that only test functional correctness on well-formed inputs.

Details

Motivation: Current code generation benchmarks only measure functional correctness on well-formed inputs, leaving a gap where generated programs may appear correct but fail to satisfy contracts (assertion-level validity constraints for rejecting ill-formed inputs).

Method: ContractEval builds on HumanEval+ and MBPP+ by augmenting each task with contract-violation tests derived from reference assertions using a neuro-symbolic pipeline: an LLM converts assertion clauses into constraints, and an SMT solver enumerates satisfiable violation combinations to generate inputs that violate selected clauses while satisfying the rest.

Result: Across five code LLMs, standard prompting yields 0% contract satisfaction, while adding a few contract-violation examples boosts contract satisfaction to 49-53% while maintaining pass@1 by 92% of the original.

Conclusion: ContractEval addresses a critical gap in code generation evaluation by measuring contract satisfaction, and shows that including contract-violation examples significantly improves model performance on rejecting ill-formed inputs while preserving functional correctness.

Abstract: Current code generation benchmarks measure functional correctness on well-formed inputs, as test cases are curated to satisfy input preconditions. This leaves a gap: generated programs may appear correct but fail to satisfy contracts – assertion-level validity constraints for rejecting ill-formed inputs. We introduce ContractEval, a benchmark for evaluating contract-satisfying assertions in code generation, i.e., whether code rejects contract-violating inputs by triggering intended assertions. Built on HumanEval+ and MBPP+, ContractEval augments each task with contract-violation tests derived from reference assertions. We synthesize these via a neuro-symbolic pipeline: an LLM converts assertion clauses into constraints, and an SMT solver enumerates satisfiable violation combinations to generate inputs that violate selected clauses while satisfying the rest. Across five code LLMs, standard prompting yields 0% contract satisfaction, while adding a few contract-violation examples boosts contract satisfaction to 49–53% while maintaining pass@1 by 92% of the original. Our code is available at https://github.com/suhanmen/ContractEval.

[235] See or Say Graphs: Agent-Driven Scalable Graph Structure Understanding with Vision-Language Models

Shuo Han, Yukun Cao, Zezhong Ding, Zengyi Gao, S Kevin Zhou, Xike Xie

Main category: cs.AI

TL;DR: GraphVista is a unified framework that enhances graph structure understanding by improving scalability through hierarchical GraphRAG organization and modality coordination via a planning agent that routes tasks to text or visual modalities.

Details

Motivation: Current vision-language models for graph understanding face scalability bottlenecks due to input-token constraints and lack effective mechanisms to coordinate textual and visual modalities, limiting their ability to handle large graphs and fully exploit both modalities.

Method: GraphVista uses a hierarchical GraphRAG base to organize graph information, retrieving only task-relevant textual descriptions and high-resolution visual subgraphs. It introduces a planning agent that decomposes tasks and routes them to the most suitable modality: text for explicit graph properties and visual for local graph structure reasoning.

Result: GraphVista scales to graphs up to 200× larger than existing benchmarks and consistently outperforms textual, visual, and fusion-based methods, achieving up to 4.4× quality improvement over state-of-the-art baselines by fully exploiting complementary strengths of both modalities.

Conclusion: GraphVista successfully addresses scalability and modality coordination challenges in graph structure understanding, demonstrating that hierarchical organization and intelligent modality routing can significantly enhance performance on large-scale graph reasoning tasks.

Abstract: Vision-language models (VLMs) have shown promise in graph structure understanding, but remain limited by input-token constraints, facing scalability bottlenecks and lacking effective mechanisms to coordinate textual and visual modalities. To address these challenges, we propose GraphVista, a unified framework that enhances both scalability and modality coordination in graph structure understanding. For scalability, GraphVista organizes graph information hierarchically into a lightweight GraphRAG base, which retrieves only task-relevant textual descriptions and high-resolution visual subgraphs, compressing redundant context while preserving key reasoning elements. For modality coordination, GraphVista introduces a planning agent that decomposes and routes tasks to the most suitable modality-using the text modality for direct access to explicit graph properties and the visual modality for local graph structure reasoning grounded in explicit topology. Extensive experiments demonstrate that GraphVista scales to large graphs, up to 200$\times$ larger than those used in existing benchmarks, and consistently outperforms existing textual, visual, and fusion-based methods, achieving up to 4.4$\times$ quality improvement over the state-of-the-art baselines by fully exploiting the complementary strengths of both modalities.

[236] Visual Attention Reasoning via Hierarchical Search and Self-Verification

Wei Cai, Jian Zhao, Yuchen Yuan, Tianle Zhang, Ming Zhu, Haichuan Tang, Xuelong Li

Main category: cs.AI

TL;DR: VAR is a reinforcement learning framework that reduces hallucinations in MLLMs by using hierarchical search with self-verification and explicit visual grounding through bounding boxes.

Details

Motivation: Multimodal Large Language Models (MLLMs) suffer from frequent hallucinations due to fragile linear reasoning and weak visual grounding, which undermines their reliability and safety.

Method: VAR reformulates reasoning as hierarchical search with self-verification, generates explicit bounding boxes for traceable evidence grounding using a novel reward function (geometric precision + semantic sufficiency), and replaces linear Chain-of-Thought with tree-search policy that allows backtracking to correct errors.

Result: Theoretical analysis validates the framework’s reliability, and extensive experiments show VAR significantly outperforms state-of-the-art methods on complex hallucination and safety benchmarks.

Conclusion: VAR provides an effective solution to reduce hallucinations in MLLMs through structured reasoning with explicit visual grounding and error-correction capabilities.

Abstract: Multimodal Large Language Models (MLLMs) frequently hallucinate due to their reliance on fragile, linear reasoning and weak visual grounding. We propose Visual Attention Reasoning (VAR), a reinforcement learning framework that reformulates reasoning as a hierarchical search with self-verification. VAR enforces traceable evidence grounding by generating explicit bounding boxes, guided by a novel reward function combining geometric precision and semantic sufficiency. Furthermore, it replaces linear Chain-of-Thought with a tree-search policy capable of backtracking to correct logical errors. Theoretical analysis validates the framework’s reliability, and extensive experiments demonstrate that VAR significantly outperforms state-of-the-art methods on complex hallucination and safety benchmarks.

[237] On the Emergence of Induction Heads for In-Context Learning

Tiberiu Musat, Tiago Pimentel, Lorenzo Noci, Alessandro Stolfo, Mrinmaya Sachan, Thomas Hofmann

Main category: cs.AI

TL;DR: The paper studies induction heads in transformers, revealing their simple weight structure, proving training dynamics are constrained to a 19D subspace with only 3 dimensions responsible for emergence, and finding emergence time follows quadratic asymptotic bound in context length.

Details

Motivation: To understand the emergence of induction heads - a key mechanism for in-context learning in transformers - by analyzing their weight structure and training dynamics.

Method: Theoretical analysis using minimal ICL task formulation and modified transformer architecture, formal proof of training dynamics constraints, and empirical validation of subspace dimensionality.

Result: Discovered simple interpretable structure of induction head weight matrices, proved training dynamics constrained to 19D subspace (with only 3D responsible for emergence), and found emergence time follows quadratic asymptotic bound in context length.

Conclusion: Induction heads emerge through constrained training dynamics in a low-dimensional subspace, with predictable emergence time scaling quadratically with context length, providing theoretical understanding of this important in-context learning mechanism.

Abstract: Transformers have become the dominant architecture for natural language processing. Part of their success is owed to a remarkable capability known as in-context learning (ICL): they can acquire and apply novel associations solely from their input context, without any updates to their weights. In this work, we study the emergence of induction heads, a previously identified mechanism in two-layer transformers that is particularly important for in-context learning. We uncover a relatively simple and interpretable structure of the weight matrices implementing the induction head. We theoretically explain the origin of this structure using a minimal ICL task formulation and a modified transformer architecture. We give a formal proof that the training dynamics remain constrained to a 19-dimensional subspace of the parameter space. Empirically, we validate this constraint while observing that only 3 dimensions account for the emergence of an induction head. By further studying the training dynamics inside this 3-dimensional subspace, we find that the time until the emergence of an induction head follows a tight asymptotic bound that is quadratic in the input context length.

[238] SPAN: Benchmarking and Improving Cross-Calendar Temporal Reasoning of Large Language Models

Zhongjian Miao, Hao Fu, Chen Wei

Main category: cs.AI

TL;DR: SPAN is a cross-calendar temporal reasoning benchmark requiring LLMs to perform intra-calendar reasoning and inter-calendar conversion across six calendars, with LLMs achieving only 34.5% accuracy, but a tool-augmented Time Agent reaches 95.31%.

Details

Motivation: Current LLMs struggle with cross-calendar temporal reasoning, which is essential for temporally and culturally adaptive AI systems. The authors aim to create a comprehensive benchmark to evaluate and improve LLMs' ability to reason across different calendar systems.

Method: The authors introduce SPAN benchmark with ten cross-calendar reasoning directions, two reasoning types, and two question formats across six calendars. They use a template-driven protocol for dynamic instance generation to enable time-variant, contamination-free evaluation. They also develop Time Agent, an LLM-powered system using tool-augmented code generation to solve these tasks.

Result: State-of-the-art LLMs achieve only 34.5% average accuracy on SPAN, with none exceeding 80%. The authors identify Future-Date Degradation and Calendar Asymmetry Bias as key obstacles. Their Time Agent achieves 95.31% accuracy, significantly outperforming baselines.

Conclusion: Cross-calendar temporal reasoning remains challenging for current LLMs, but tool-augmented code generation shows strong potential to address this limitation. The work highlights the need for more temporally and culturally adaptive LLMs.

Abstract: We introduce SPAN, a cross-calendar temporal reasoning benchmark, which requires LLMs to perform intra-calendar temporal reasoning and inter-calendar temporal conversion. SPAN features ten cross-calendar temporal reasoning directions, two reasoning types, and two question formats across six calendars. To enable time-variant and contamination-free evaluation, we propose a template-driven protocol for dynamic instance generation that enables assessment on a user-specified Gregorian date. We conduct extensive experiments on both open- and closed-source state-of-the-art (SOTA) LLMs over a range of dates spanning 100 years from 1960 to 2060. Our evaluations show that these LLMs achieve an average accuracy of only 34.5%, with none exceeding 80%, indicating that this task remains challenging. Through in-depth analysis of reasoning types, question formats, and temporal reasoning directions, we identify two key obstacles for LLMs: Future-Date Degradation and Calendar Asymmetry Bias. To strengthen LLMs’ cross-calendar temporal reasoning capability, we further develop an LLM-powered Time Agent that leverages tool-augmented code generation. Empirical results show that Time Agent achieves an average accuracy of 95.31%, outperforming several competitive baselines, highlighting the potential of tool-augmented code generation to advance cross-calendar temporal reasoning. We hope this work will inspire further efforts toward more temporally and culturally adaptive LLMs.

[239] Darth Vecdor: An Open-Source System for Generating Knowledge Graphs Through Large Language Model Queries

Jonathan A. Handler

Main category: cs.AI

TL;DR: Darth Vecdor (DV) is an open-source tool that extracts structured knowledge from LLMs into SQL databases/knowledge graphs, addressing LLM limitations like errors and inconsistency, with a focus on healthcare applications.

Details

Motivation: While LLMs contain vast knowledge, direct querying has issues: cost, speed, safety, and confidence concerns in high-volume operations. Structured knowledge extraction into databases could mitigate these problems, especially for healthcare applications.

Method: DV extracts knowledge from LLMs into structured, terminology-mapped SQL databases/knowledge graphs. It addresses LLM response issues (erroneous, off-topic, free-text, inconsistent) and allows multi-element responses. Features a browser-based GUI for non-technical domain experts to do prompt engineering.

Result: DV has been released as free, open-source, extensible software with a simple GUI. It’s provided “as is” without warranties, acknowledging potential bugs and risks, but aims to help improve healthcare through appropriate use.

Conclusion: DV offers a structured approach to LLM knowledge extraction that addresses key limitations, making it potentially valuable for healthcare applications despite acknowledged risks and the need for responsible use.

Abstract: Many large language models (LLMs) are trained on a massive body of knowledge present on the Internet. Darth Vecdor (DV) was designed to extract this knowledge into a structured, terminology-mapped, SQL database (“knowledge base” or “knowledge graph”). Knowledge graphs may be useful in many domains, including healthcare. Although one might query an LLM directly rather than a SQL-based knowledge graph, concerns such as cost, speed, safety, and confidence may arise, especially in high-volume operations. These may be mitigated when the information is pre-extracted from the LLM and becomes query-able through a standard database. However, the author found the need to address several issues. These included erroneous, off-topic, free-text, overly general, and inconsistent LLM responses, as well as allowing for multi-element responses. DV was built with features intended to mitigate these issues. To facilitate ease of use, and to allow for prompt engineering by those with domain expertise but little technical background, DV provides a simple, browser-based graphical user interface. DV has been released as free, open-source, extensible software, on an “as is” basis, without warranties or conditions of any kind, either express or implied. Users need to be cognizant of the potential risks and benefits of using DV and its outputs, and users are responsible for ensuring any use is safe and effective. DV should be assumed to have bugs, potentially very serious ones. However, the author hopes that appropriate use of current and future versions of DV and its outputs can help improve healthcare.

[240] Monadic Context Engineering

Yifan Zhang, Yang Yuan, Mengdi Wang, Andrew Chi-Chih Yao

Main category: cs.AI

TL;DR: MCE introduces a formal monadic framework for building robust AI agents by leveraging algebraic structures (Functors, Applicatives, Monads) to manage state, errors, and concurrency systematically.

Details

Motivation: Current LLM-based agent architectures use brittle, ad hoc patterns that struggle with state management, error handling, and concurrency, leading to unreliable systems.

Method: Monadic Context Engineering (MCE) treats agent workflows as computational contexts using algebraic structures: Monads for sequential composition, Applicatives for parallel execution, and Monad Transformers for composing capabilities.

Result: MCE enables construction of complex, resilient AI agents from simple, verifiable components, and extends to Meta-Agents for generative orchestration of sub-agent workflows.

Conclusion: MCE provides a formal, algebraic foundation for agent design that addresses brittleness in current architectures through systematic management of cross-cutting concerns.

Abstract: The proliferation of Large Language Models (LLMs) has catalyzed a shift towards autonomous agents capable of complex reasoning and tool use. However, current agent architectures are frequently constructed using imperative, ad hoc patterns. This results in brittle systems plagued by difficulties in state management, error handling, and concurrency. This paper introduces Monadic Context Engineering (MCE), a novel architectural paradigm leveraging the algebraic structures of Functors, Applicative Functors, and Monads to provide a formal foundation for agent design. MCE treats agent workflows as computational contexts where cross-cutting concerns, such as state propagation, short-circuiting error handling, and asynchronous execution, are managed intrinsically by the algebraic properties of the abstraction. We demonstrate how Monads enable robust sequential composition, how Applicatives provide a principled structure for parallel execution, and crucially, how Monad Transformers allow for the systematic composition of these capabilities. This layered approach enables developers to construct complex, resilient, and efficient AI agents from simple, independently verifiable components. We further extend this framework to describe Meta-Agents, which leverage MCE for generative orchestration, dynamically creating and managing sub-agent workflows through metaprogramming.

[241] What Drives Success in Physical Planning with Joint-Embedding Predictive World Models?

Basile Terver, Tsung-Yen Yang, Jean Ponce, Adrien Bardes, Yann LeCun

Main category: cs.AI

TL;DR: The paper proposes JEPA-WMs, a family of world models that plan in learned representation space rather than input space, and conducts a comprehensive study to identify optimal architectural choices for improved planning efficiency in physical tasks.

Details

Motivation: To develop AI agents capable of solving diverse physical tasks and generalizing to unseen tasks/environments by improving world model planning efficiency through representation space optimization rather than input space planning.

Method: Characterizes JEPA-WMs (Joint Embedding Predictive Architecture World Models) and systematically studies key components: model architecture, training objectives, and planning algorithms. Experiments conducted in simulated environments and with real-world robotic data.

Result: Proposed model outperforms established baselines DINO-WM and V-JEPA-2-AC in both navigation and manipulation tasks. Code, data, and checkpoints are publicly available.

Conclusion: Planning in learned representation space (JEPA-WMs) offers more efficient planning by abstracting irrelevant details, and systematic optimization of architectural components leads to superior performance compared to existing methods.

Abstract: A long-standing challenge in AI is to develop agents capable of solving a wide range of physical tasks and generalizing to new, unseen tasks and environments. A popular recent approach involves training a world model from state-action trajectories and subsequently use it with a planning algorithm to solve new tasks. Planning is commonly performed in the input space, but a recent family of methods has introduced planning algorithms that optimize in the learned representation space of the world model, with the promise that abstracting irrelevant details yields more efficient planning. In this work, we characterize models from this family as JEPA-WMs and investigate the technical choices that make algorithms from this class work. We propose a comprehensive study of several key components with the objective of finding the optimal approach within the family. We conducted experiments using both simulated environments and real-world robotic data, and studied how the model architecture, the training objective, and the planning algorithm affect planning success. We combine our findings to propose a model that outperforms two established baselines, DINO-WM and V-JEPA-2-AC, in both navigation and manipulation tasks. Code, data and checkpoints are available at https://github.com/facebookresearch/jepa-wms.

[242] Accelerating Monte-Carlo Tree Search with Optimized Posterior Policies

Keith Frankston, Benjamin Howard

Main category: cs.AI

TL;DR: RMCTS is a faster AlphaZero-style MCTS algorithm using breadth-first search and recursive posterior policy computation, achieving 40x speedup for single root states and matching MCTS-UCB quality in 1/3 training time.

Details

Motivation: To overcome the GPU latency bottleneck in AlphaZero's MCTS-UCB by developing a faster Monte Carlo tree search algorithm that maintains similar quality while significantly reducing computational time.

Method: Recursive MCTS (RMCTS) uses breadth-first tree exploration with batched network inferences, computing optimized posterior policies recursively from leaves to root based on Grill et al.’s regularized policy optimization framework, with trees defined by prior network policies rather than adaptive selection.

Result: RMCTS achieves 40+ times faster search for single root states and 3x faster for batch searches, with RMCTS-trained networks matching MCTS-UCB quality in one-third the training time across Connect-4, Dots-and-Boxes, and Othello games.

Conclusion: RMCTS provides significant speed advantages over MCTS-UCB while maintaining comparable network quality, making it a practical alternative for training AlphaZero-style agents with reduced computational costs.

Abstract: We introduce a recursive AlphaZero-style Monte–Carlo tree search algorithm, “RMCTS”. The advantage of RMCTS over AlphaZero’s MCTS-UCB is speed. In RMCTS, the search tree is explored in a breadth-first manner, so that network inferences naturally occur in large batches. This significantly reduces the GPU latency cost. We find that RMCTS is often more than 40 times faster than MCTS-UCB when searching a single root state, and about 3 times faster when searching a large batch of root states. The recursion in RMCTS is based on computing optimized posterior policies at each game state in the search tree, starting from the leaves and working back up to the root. Here we use the posterior policy explored in “Monte–Carlo tree search as regularized policy optimization” (Grill, et al.) Their posterior policy is the unique policy which maximizes the expected reward given estimated action rewards minus a penalty for diverging from the prior policy. The tree explored by RMCTS is not defined in an adaptive manner, as it is in MCTS-UCB. Instead, the RMCTS tree is defined by following prior network policies at each node. This is a disadvantage, but the speedup advantage is more significant, and in practice we find that RMCTS-trained networks match the quality of MCTS-UCB-trained networks in roughly one-third of the training time. We include timing and quality comparisons of RMCTS vs. MCTS-UCB for three games: Connect-4, Dots-and-Boxes, and Othello.

[243] EverMemOS: A Self-Organizing Memory Operating System for Structured Long-Horizon Reasoning

Chuanrui Hu, Xingze Gao, Zuyi Zhou, Dannong Xu, Yi Bai, Xintong Li, Hui Zhang, Tong Li, Chong Zhang, Lidong Bing, Yafeng Deng

Main category: cs.AI

TL;DR: EverMemOS is a self-organizing memory operating system for LLMs that implements an engram-inspired lifecycle to maintain coherent behavior over extended interactions through episodic trace formation, semantic consolidation, and reconstructive recollection.

Details

Motivation: LLMs deployed as long-term interactive agents face limitations due to finite context windows, making it difficult to sustain coherent behavior over extended interactions. Existing memory systems store isolated records and retrieve fragments, limiting their ability to consolidate evolving user states and resolve conflicts.

Method: EverMemOS implements an engram-inspired lifecycle with three key components: 1) Episodic Trace Formation converts dialogue streams into MemCells capturing episodic traces, atomic facts, and time-bounded Foresight signals; 2) Semantic Consolidation organizes MemCells into thematic MemScenes, distilling stable semantic structures and updating user profiles; 3) Reconstructive Recollection performs MemScene-guided agentic retrieval to compose necessary and sufficient context for downstream reasoning.

Result: Experiments on LoCoMo and LongMemEval show that EverMemOS achieves state-of-the-art performance on memory-augmented reasoning tasks. Additional profile study on PersonaMem v2 and qualitative case studies demonstrate chat-oriented capabilities such as user profiling and Foresight.

Conclusion: EverMemOS provides an effective memory operating system for LLMs that enables sustained coherent behavior over extended interactions through its engram-inspired lifecycle approach, outperforming existing methods on memory-augmented reasoning tasks while offering practical chat-oriented capabilities.

Abstract: Large Language Models (LLMs) are increasingly deployed as long-term interactive agents, yet their limited context windows make it difficult to sustain coherent behavior over extended interactions. Existing memory systems often store isolated records and retrieve fragments, limiting their ability to consolidate evolving user states and resolve conflicts. We introduce EverMemOS, a self-organizing memory operating system that implements an engram-inspired lifecycle for computational memory. Episodic Trace Formation converts dialogue streams into MemCells that capture episodic traces, atomic facts, and time-bounded Foresight signals. Semantic Consolidation organizes MemCells into thematic MemScenes, distilling stable semantic structures and updating user profiles. Reconstructive Recollection performs MemScene-guided agentic retrieval to compose the necessary and sufficient context for downstream reasoning. Experiments on LoCoMo and LongMemEval show that EverMemOS achieves state-of-the-art performance on memory-augmented reasoning tasks. We further report a profile study on PersonaMem v2 and qualitative case studies illustrating chat-oriented capabilities such as user profiling and Foresight. Code is available at https://github.com/EverMind-AI/EverMemOS.

[244] Architecting Agentic Communities using Design Patterns

Zoran Milosevic, Fethi Rabhi

Main category: cs.AI

TL;DR: The paper presents a formal architectural framework for building production-grade AI agent systems using enterprise design patterns, with focus on Agentic Communities where AI agents and humans coordinate through governed ecosystems.

Details

Motivation: The rapid evolution of LLMs and Agentic AI requires systematic architectural guidance for building sophisticated, production-grade systems that can be deployed in enterprise and industrial applications.

Method: The approach uses design patterns derived from enterprise distributed systems standards, formal methods, and industry practice, classifying them into three tiers: LLM Agents, Agentic AI, and Agentic Communities. The paper focuses on Agentic Communities, grounding patterns in a formal framework that specifies collaboration agreements where AI agents and humans fill roles within governed ecosystems.

Result: The framework provides both practical guidance and formal verification capabilities, enabling expression of organizational, legal, and ethical rules through accountability mechanisms that ensure operational and verifiable governance of inter-agent communication, negotiation, and intent modeling. The approach is validated through a clinical trial matching case study.

Conclusion: The paper provides actionable guidance to practitioners while maintaining the formal rigor essential for enterprise deployment in dynamic, multi-agent ecosystems, bridging the gap between practical implementation needs and formal verification requirements for production-grade AI agent systems.

Abstract: The rapid evolution of Large Language Models (LLM) and subsequent Agentic AI technologies requires systematic architectural guidance for building sophisticated, production-grade systems. This paper presents an approach for architecting such systems using design patterns derived from enterprise distributed systems standards, formal methods, and industry practice. We classify these patterns into three tiers: LLM Agents (task-specific automation), Agentic AI (adaptive goal-seekers), and Agentic Communities (organizational frameworks where AI agents and human participants coordinate through formal roles, protocols, and governance structures). We focus on Agentic Communities - coordination frameworks encompassing LLM Agents, Agentic AI entities, and humans - most relevant for enterprise and industrial applications. Drawing on established coordination principles from distributed systems, we ground these patterns in a formal framework that specifies collaboration agreements where AI agents and humans fill roles within governed ecosystems. This approach provides both practical guidance and formal verification capabilities, enabling expression of organizational, legal, and ethical rules through accountability mechanisms that ensure operational and verifiable governance of inter-agent communication, negotiation, and intent modeling. We validate this framework through a clinical trial matching case study. Our goal is to provide actionable guidance to practitioners while maintaining the formal rigor essential for enterprise deployment in dynamic, multi-agent ecosystems.

[245] AlgBench: To What Extent Do Large Reasoning Models Understand Algorithms?

Henan Sun, Kaichi Yu, Yuyao Wang, Bowen Liu, Xunkai Li, Rong-Hua Li, Nuo Chen, Jia Li

Main category: cs.AI

TL;DR: AlgBench is a new expert-curated benchmark with 3,000+ problems across 27 algorithms that reveals LRMs struggle with globally optimized algorithms like dynamic programming, showing only 49% accuracy compared to 92% on non-optimized tasks.

Details

Motivation: Existing benchmarks for algorithmic reasoning are limited and fail to answer whether Large Reasoning Models truly master algorithmic reasoning. There's a need for a comprehensive, algorithm-centric evaluation framework.

Method: Created AlgBench - an expert-curated benchmark with over 3,000 original problems spanning 27 algorithms, organized under a comprehensive taxonomy (Euclidean-structured, non-Euclidean-structured, non-optimized, local-optimized, global-optimized, heuristic-optimized). Evaluated leading LRMs including Gemini-3-Pro, DeepSeek-v3.2-Speciale, and GPT-o3.

Result: Substantial performance heterogeneity: models perform well on non-optimized tasks (up to 92%), but accuracy drops sharply to around 49% on globally optimized algorithms like dynamic programming. Discovered “strategic over-shifts” where models prematurely abandon correct algorithmic designs due to necessary low-entropy tokens.

Conclusion: The findings expose fundamental limitations of problem-centric reinforcement learning and highlight the necessity of an algorithm-centric training paradigm for robust algorithmic reasoning in Large Reasoning Models.

Abstract: Reasoning ability has become a central focus in the advancement of Large Reasoning Models (LRMs). Although notable progress has been achieved on several reasoning benchmarks such as MATH500 and LiveCodeBench, existing benchmarks for algorithmic reasoning remain limited, failing to answer a critical question: Do LRMs truly master algorithmic reasoning? To answer this question, we propose AlgBench, an expert-curated benchmark that evaluates LRMs under an algorithm-centric paradigm. AlgBench consists of over 3,000 original problems spanning 27 algorithms, constructed by ACM algorithmic experts and organized under a comprehensive taxonomy, including Euclidean-structured, non-Euclidean-structured, non-optimized, local-optimized, global-optimized, and heuristic-optimized categories. Empirical evaluations on leading LRMs (e.g., Gemini-3-Pro, DeepSeek-v3.2-Speciale and GPT-o3) reveal substantial performance heterogeneity: while models perform well on non-optimized tasks (up to 92%), accuracy drops sharply to around 49% on globally optimized algorithms such as dynamic programming. Further analysis uncovers \textbf{strategic over-shifts}, wherein models prematurely abandon correct algorithmic designs due to necessary low-entropy tokens. These findings expose fundamental limitations of problem-centric reinforcement learning and highlight the necessity of an algorithm-centric training paradigm for robust algorithmic reasoning.

[246] How to Set the Batch Size for Large-Scale Pre-training?

Yunhua Zhou, Junhao Huang, Shuhao Xing, Yechen Zhang, Runyu Peng, Qiping Guo, Xipeng Qiu

Main category: cs.AI

TL;DR: The paper revises the Critical Batch Size theory for modern WSD learning rate schedulers, deriving new E(S) relationships, identifying B_min and B_opt thresholds, and proposing a dynamic batch size scheduler that improves training efficiency and model quality.

Details

Motivation: The original Critical Batch Size theory from OpenAI doesn't align with modern pre-training using Warmup-Stable-Decay (WSD) learning rate schedulers, creating a gap between theory and practice that needs to be addressed.

Method: Derived a revised E(S) relationship tailored for WSD schedulers, analyzed theoretical properties to identify B_min (minimum batch size threshold) and B_opt (optimal batch size for data efficiency), and proposed a dynamic Batch Size Scheduler based on these properties.

Result: Extensive experiments show the revised formula accurately captures large-scale pre-training dynamics, and the proposed scheduling strategy significantly enhances both training efficiency and final model quality.

Conclusion: The paper successfully bridges the theory-practice gap for modern pre-training by updating Critical Batch Size theory for WSD schedulers, providing practical tools (B_min, B_opt, dynamic scheduler) that improve training efficiency and model performance.

Abstract: The concept of Critical Batch Size, as pioneered by OpenAI, has long served as a foundational principle for large-scale pre-training. However, with the paradigm shift towards the Warmup-Stable-Decay (WSD) learning rate scheduler, we observe that the original theoretical framework and its underlying mechanisms fail to align with new pre-training dynamics. To bridge this gap between theory and practice, this paper derives a revised E(S) relationship tailored for WSD scheduler, characterizing the trade-off between training data consumption E and steps S during pre-training. Our theoretical analysis reveals two fundamental properties of WSD-based pre-training: 1) B_min, the minimum batch size threshold required to achieve a target loss, and 2) B_opt, the optimal batch size that maximizes data efficiency by minimizing total tokens. Building upon these properties, we propose a dynamic Batch Size Scheduler. Extensive experiments demonstrate that our revised formula precisely captures the dynamics of large-scale pre-training, and the resulting scheduling strategy significantly enhances both training efficiency and final model quality.

[247] Large language models can effectively convince people to believe conspiracies

Thomas H. Costello, Kellin Pelrine, Matthew Kowal, Antonio A. Arechar, Jean-François Godbout, Adam Gleave, David Rand, Gordon Pennycook

Main category: cs.AI

TL;DR: GPT-4o is equally effective at increasing or decreasing conspiracy beliefs, with guardrails doing little to prevent misinformation promotion, though corrective conversations and accuracy prompts can mitigate risks.

Details

Motivation: To investigate whether LLMs' persuasive power advantages truth over falsehood, or if they can promote misbeliefs as easily as refuting them, particularly regarding conspiracy theories.

Method: Three pre-registered experiments with 2,724 American participants discussing uncertain conspiracy theories with GPT-4o, comparing “debunking” (arguing against) vs “bunking” (arguing for) conditions, including jailbroken and standard versions.

Result: Jailbroken GPT-4o was equally effective at increasing and decreasing conspiracy beliefs. Standard GPT-4o produced similar effects despite guardrails. Bunking AI was rated more positively and increased trust more than debunking AI. Corrective conversations reversed induced beliefs, and accuracy prompts dramatically reduced misinformation promotion.

Conclusion: LLMs possess potent abilities to promote both truth and falsehood equally, with current guardrails insufficient, but potential solutions like corrective conversations and accuracy prompts exist to mitigate risks.

Abstract: Large language models (LLMs) have been shown to be persuasive across a variety of contexts. But it remains unclear whether this persuasive power advantages truth over falsehood, or if LLMs can promote misbeliefs just as easily as refuting them. Here, we investigate this question across three pre-registered experiments in which participants (N = 2,724 Americans) discussed a conspiracy theory they were uncertain about with GPT-4o, and the model was instructed to either argue against (“debunking”) or for (“bunking”) that conspiracy. When using a “jailbroken” GPT-4o variant with guardrails removed, the AI was as effective at increasing conspiracy belief as decreasing it. Concerningly, the bunking AI was rated more positively, and increased trust in AI, more than the debunking AI. Surprisingly, we found that using standard GPT-4o produced very similar effects, such that the guardrails imposed by OpenAI did little to prevent the LLM from promoting conspiracy beliefs. Encouragingly, however, a corrective conversation reversed these newly induced conspiracy beliefs, and simply prompting GPT-4o to only use accurate information dramatically reduced its ability to increase conspiracy beliefs. Our findings demonstrate that LLMs possess potent abilities to promote both truth and falsehood, but that potential solutions may exist to help mitigate this risk.

[248] MineNPC-Task: Task Suite for Memory-Aware Minecraft Agents

Tamil Sudaravan Mohan Doss, Michael Xu, Sudha Rao, Andrew D. Wilson, Balasaravanan Thoravi Kumaravel

Main category: cs.AI

TL;DR: MineNPC-Task is a user-authored benchmark for evaluating memory-aware LLM agents in Minecraft, using real player-derived tasks with parametric templates and machine-checkable validators under a bounded-knowledge policy.

Details

Motivation: Need for better evaluation of memory-aware, mixed-initiative LLM agents in open-world environments like Minecraft, moving beyond synthetic prompts to real user-derived tasks with proper validation.

Method: Tasks elicited through formative/summative co-play with expert players, normalized into parametric templates with explicit preconditions and dependencies. Paired with machine-checkable validators under bounded-knowledge policy. Framework captures plan, action, and memory events including plan previews, clarifications, memory operations, and repairs.

Result: Initial evaluation with GPT-4o on 216 subtasks across 8 players showed recurring breakdown patterns in code execution, inventory handling, referencing, and navigation, but successful recoveries via mixed-initiative clarifications and memory use. Participants rated interaction quality positively but noted need for stronger memory persistence.

Conclusion: MineNPC-Task provides a transparent, reproducible evaluation framework for memory-aware embodied agents, with released task suite, validators, logs, and harness to support future research in this area.

Abstract: We present MineNPC-Task, a user-authored benchmark and evaluation harness for testing memory-aware, mixed-initiative LLM agents in open-world Minecraft. Rather than relying on synthetic prompts, tasks are elicited through formative and summative co-play with expert players, then normalized into parametric templates with explicit preconditions and dependency structure. These tasks are paired with machine-checkable validators under a bounded-knowledge policy that forbids out-of-world shortcuts. The harness captures plan, action, and memory events, including plan previews, targeted clarifications, memory reads and writes, precondition checks, and repair attempts, and reports outcomes relative to the total number of attempted subtasks using only in-world evidence. As an initial snapshot, we instantiate the framework with GPT-4o and evaluate 216 subtasks across 8 experienced players. We observe recurring breakdown patterns in code execution, inventory and tool handling, referencing, and navigation, alongside successful recoveries supported by mixed-initiative clarifications and lightweight memory use. Participants rated interaction quality and interface usability positively, while noting the need for stronger memory persistence across tasks. We release the complete task suite, validators, logs, and evaluation harness to support transparent and reproducible evaluation of future memory-aware embodied agents.

cs.SD

[249] CosyEdit: Unlocking End-to-End Speech Editing Capability from Zero-Shot Text-to-Speech Models

Junyang Chen, Yuhang Jia, Hui Wang, Jiaming Zhou, Yaxin Han, Mengying Feng, Yong Qin

Main category: cs.SD

TL;DR: CosyEdit is an end-to-end speech editing model adapted from CosyVoice that achieves state-of-the-art performance with only 250 hours of fine-tuning data, outperforming billion-parameter baselines and matching cascade approaches.

Details

Motivation: Traditional cascade speech editing systems suffer from complex preprocessing pipelines and reliance on explicit external temporal alignment between speech and text, which limits efficiency and performance.

Method: Adapt CosyVoice (zero-shot TTS model) through task-specific fine-tuning and optimized inference procedure; fine-tune on 250 hours of supervised data from curated GigaEdit dataset; 400M-parameter model internalizes speech-text alignment.

Result: Outperforms several billion-parameter language model baselines; matches performance of state-of-the-art cascade approaches on RealEdit benchmark; achieves reliable speech editing with high consistency between original and edited speech.

Conclusion: Robust and efficient speech editing capabilities can be unlocked from zero-shot TTS models through task-specific fine-tuning and inference optimization, providing a novel and cost-effective end-to-end solution for high-quality speech editing.

Abstract: Automatic speech editing aims to modify spoken content based on textual instructions, yet traditional cascade systems suffer from complex preprocessing pipelines and a reliance on explicit external temporal alignment. Addressing these limitations, we propose CosyEdit, an end-to-end speech editing model adapted from CosyVoice through task-specific fine-tuning and an optimized inference procedure, which internalizes speech-text alignment while ensuring high consistency between the speech before and after editing. By fine-tuning on only 250 hours of supervised data from our curated GigaEdit dataset, our 400M-parameter model achieves reliable speech editing performance. Experiments on the RealEdit benchmark indicate that CosyEdit not only outperforms several billion-parameter language model baselines but also matches the performance of state-of-the-art cascade approaches. These results demonstrate that, with task-specific fine-tuning and inference optimization, robust and efficient speech editing capabilities can be unlocked from a zero-shot TTS model, yielding a novel and cost-effective end-to-end solution for high-quality speech editing.

[250] SPAM: Style Prompt Adherence Metric for Prompt-based TTS

Chanhee Cho, Nayeon Kim, Bugeun Kim

Main category: cs.SD

TL;DR: SPAM is a new automatic metric for evaluating how well text-to-speech systems adhere to style prompts, addressing limitations of prior evaluation methods by ensuring both plausibility (human-like) and faithfulness (grounded to prompt).

Details

Motivation: Existing prompt-based TTS evaluation methods lack both plausibility (human-like assessment) and faithfulness (grounded to the prompt), making it difficult to properly measure how well synthesized speech adheres to style prompts.

Method: SPAM factorizes speech into acoustic attributes and aligns them with style prompts using CLAP-inspired approach. It’s trained with supervised contrastive loss to better distinguish different semantic meanings in prompts.

Result: SPAM shows strong correlation with human MOS scores (plausibility) and successfully discriminates different prompt semantics (faithfulness), demonstrating it’s both human-like and properly grounded to prompts.

Conclusion: SPAM provides a viable automatic solution for evaluating style prompt adherence in synthesized speech, addressing both plausibility and faithfulness concerns that previous methods lacked.

Abstract: Prompt-based text-to-speech (TTS) aims to generate speech that adheres to fine-grained style cues provided in a text prompt. However, most prior works depend on neither plausible nor faithful measures to evaluate prompt adherence. That is, they cannot ensure whether the evaluation is grounded on the prompt and is similar to a human. Thus, we present a new automatic metric, the Style Prompt Adherence Metric, which explicitly satisfies both plausibility and faithfulness. Inspired by the CLAP, our approach factorizes speech into acoustic attributes and aligns them with the style prompt. Also, we trained the scorer with a supervised contrastive loss, which could provide a clearer distinction between different semantics. We conducted two experiments on two perspectives. The plausibility experiment showed that SPAM achieved a strong correlation with the mean opinion score (MOS). Also, the faithfulness experiment demonstrated that SPAM is successfully grounded to the given style prompt, as it can discriminate different semantics of the prompt. We believe that SPAM can provide a viable automatic solution for evaluating style prompt adherence of synthesized speech.

[251] The ICASSP 2026 HumDial Challenge: Benchmarking Human-like Spoken Dialogue Systems in the LLM Era

Zhixian Zhao, Shuiyuan Wang, Guojian Li, Hongfei Xue, Chengyou Wang, Shuai Wang, Longshuai Xiao, Zihan Zhang, Hui Bu, Xin Xu, Xinsheng Wang, Hexin Liu, Eng Siong Chng, Hung-yi Lee, Haizhou Li, Lei Xie

Main category: cs.SD

TL;DR: The paper introduces the first Human-like Spoken Dialogue Systems Challenge (HumDial) at ICASSP 2026, focusing on benchmarking emotional intelligence and real-time interaction capabilities in spoken dialogue systems.

Details

Motivation: The motivation is to advance spoken dialogue systems toward truly human-like communication by addressing two key capabilities: emotional intelligence (perceiving and resonating with users' emotions) and robust interaction mechanisms (handling dynamic conversation flow like real-time turn-taking).

Method: The method involves creating a challenge anchored by a sizable dataset derived from authentic human conversations, establishing a fair evaluation platform with two tracks: (1) Emotional Intelligence for long-term emotion understanding and empathetic generation, and (2) Full-Duplex Interaction for evaluating real-time decision-making under “listening-while-speaking” conditions.

Result: The paper summarizes the dataset, track configurations, and final results of the HumDial challenge, though specific numerical results are not provided in the abstract.

Conclusion: The HumDial challenge represents a significant step toward benchmarking and advancing human-like spoken dialogue systems by systematically evaluating both emotional intelligence and real-time interaction capabilities.

Abstract: Driven by the rapid advancement of Large Language Models (LLMs), particularly Audio-LLMs and Omni-models, spoken dialogue systems have evolved significantly, progressively narrowing the gap between human-machine and human-human interactions. Achieving truly human-like'' communication necessitates a dual capability: emotional intelligence to perceive and resonate with users' emotional states, and robust interaction mechanisms to navigate the dynamic, natural flow of conversation, such as real-time turn-taking. Therefore, we launched the first Human-like Spoken Dialogue Systems Challenge (HumDial) at ICASSP 2026 to benchmark these dual capabilities. Anchored by a sizable dataset derived from authentic human conversations, this initiative establishes a fair evaluation platform across two tracks: (1) Emotional Intelligence, targeting long-term emotion understanding and empathetic generation; and (2) Full-Duplex Interaction, systematically evaluating real-time decision-making under listening-while-speaking’’ conditions. This paper summarizes the dataset, track configurations, and the final results.

[252] Detect All-Type Deepfake Audio: Wavelet Prompt Tuning for Enhanced Auditory Perception

Yuankun Xie, Ruibo Fu, Zhiyong Wang, Xiaopeng Wang, Songjun Cao, Long Ma, Haonan Cheng, Long Ye

Main category: cs.SD

TL;DR: This paper addresses the challenge of all-type audio deepfake detection across speech, sound, singing voice, and music. It establishes a comprehensive benchmark, introduces a prompt tuning self-supervised learning paradigm, and proposes wavelet prompt tuning to capture type-invariant auditory features, achieving state-of-the-art performance with minimal trainable parameters.

Details

Motivation: The rapid advancement of audio generation technologies has increased risks of malicious deepfake audio across multiple domains (speech, sound, singing voice, music), threatening multimedia security. Existing countermeasures perform well on single-type detection but decline in cross-type scenarios, creating a need for universal all-type audio deepfake detection.

Method: 1) Establish comprehensive all-type ADD benchmark for evaluation; 2) Introduce prompt tuning self-supervised learning (PT-SSL) paradigm that optimizes SSL front-end with specialized prompt tokens (458x fewer parameters than fine-tuning); 3) Propose wavelet prompt tuning (WPT)-SSL method to capture type-invariant auditory deepfake information from frequency domain without additional parameters; 4) Utilize all types of deepfake audio for co-training to achieve universal countermeasure.

Result: Experimental results show WPT-XLSR-AASIST achieved best performance with average EER of 3.58% across all evaluation sets. The method significantly outperforms fine-tuning approaches while using far fewer trainable parameters (458x reduction).

Conclusion: The proposed WPT-SSL approach effectively addresses the all-type audio deepfake detection challenge by capturing type-invariant auditory features through frequency domain analysis and prompt tuning. The method demonstrates superior performance and efficiency compared to traditional fine-tuning, providing a practical solution for universal audio deepfake detection across diverse audio types.

Abstract: The rapid advancement of audio generation technologies has escalated the risks of malicious deepfake audio across speech, sound, singing voice, and music, threatening multimedia security and trust. While existing countermeasures (CMs) perform well in single-type audio deepfake detection (ADD), their performance declines in cross-type scenarios. This paper is dedicated to studying the all-type ADD task. We are the first to comprehensively establish an all-type ADD benchmark to evaluate current CMs, incorporating cross-type deepfake detection across speech, sound, singing voice, and music. Then, we introduce the prompt tuning self-supervised learning (PT-SSL) training paradigm, which optimizes SSL front-end by learning specialized prompt tokens for ADD, requiring 458x fewer trainable parameters than fine-tuning (FT). Considering the auditory perception of different audio types, we propose the wavelet prompt tuning (WPT)-SSL method to capture type-invariant auditory deepfake information from the frequency domain without requiring additional training parameters, thereby enhancing performance over FT in the all-type ADD task. To achieve an universally CM, we utilize all types of deepfake audio for co-training. Experimental results demonstrate that WPT-XLSR-AASIST achieved the best performance, with an average EER of 3.58% across all evaluation sets.

[253] Spectral Masking and Interpolation Attack (SMIA): A Black-box Adversarial Attack against Voice Authentication and Anti-Spoofing Systems

Kamel Kamel, Hridoy Sankar Dutta, Keshav Sood, Sunil Aryal

Main category: cs.SD

TL;DR: SMIA attack manipulates inaudible frequencies in AI-generated audio to bypass voice authentication systems and countermeasures with high success rates, exposing critical security vulnerabilities.

Details

Motivation: Voice authentication systems are increasingly used in high-security sectors but face vulnerabilities from sophisticated threats like deepfakes and adversarial attacks. Current anti-spoofing countermeasures rely on static detection models that can be bypassed by novel adversarial methods, leaving a critical security gap.

Method: Proposed Spectral Masking and Interpolation Attack (SMIA) - a novel method that strategically manipulates inaudible frequency regions of AI-generated audio. By altering the voice in imperceptible zones to the human ear, SMIA creates adversarial samples that sound authentic while deceiving countermeasures.

Result: SMIA achieved: at least 82% attack success rate against combined VAS/CM systems, at least 97.5% against standalone speaker verification systems, and 100% against countermeasures. Comprehensive evaluation against state-of-the-art models under simulated real-world conditions.

Conclusion: Current security postures are insufficient against adaptive adversarial attacks. Urgent need for paradigm shift toward next-generation defenses with dynamic, context-aware frameworks capable of evolving with the threat landscape.

Abstract: Voice Authentication Systems (VAS) use unique vocal characteristics for verification. They are increasingly integrated into high-security sectors such as banking and healthcare. Despite their improvements using deep learning, they face severe vulnerabilities from sophisticated threats like deepfakes and adversarial attacks. The emergence of realistic voice cloning complicates detection, as systems struggle to distinguish authentic from synthetic audio. While anti-spoofing countermeasures (CMs) exist to mitigate these risks, many rely on static detection models that can be bypassed by novel adversarial methods, leaving a critical security gap. To demonstrate this vulnerability, we propose the Spectral Masking and Interpolation Attack (SMIA), a novel method that strategically manipulates inaudible frequency regions of AI-generated audio. By altering the voice in imperceptible zones to the human ear, SMIA creates adversarial samples that sound authentic while deceiving CMs. We conducted a comprehensive evaluation of our attack against state-of-the-art (SOTA) models across multiple tasks, under simulated real-world conditions. SMIA achieved a strong attack success rate (ASR) of at least 82% against combined VAS/CM systems, at least 97.5% against standalone speaker verification systems, and 100% against countermeasures. These findings conclusively demonstrate that current security postures are insufficient against adaptive adversarial attacks. This work highlights the urgent need for a paradigm shift toward next-generation defenses that employ dynamic, context-aware frameworks capable of evolving with the threat landscape.

[254] Zero-Day Audio DeepFake Detection via Retrieval Augmentation and Profile Matching

Xuechen Liu, Xin Wang, Junichi Yamagishi

Main category: cs.SD

TL;DR: Training-free retrieval-augmented framework for zero-day audio deepfake detection using knowledge representations and voice profile matching, achieving performance comparable to supervised methods without additional training.

Details

Motivation: Modern audio deepfake detectors struggle with zero-day attacks from novel synthesis methods not seen in training data. Conventional fine-tuning approaches are problematic when prompt response is needed, requiring a training-free solution.

Method: Proposes a training-free retrieval-augmented framework leveraging knowledge representations and voice profile matching. Includes simple yet effective retrieval and ensemble methods, with ablation studies on voice profile attributes and cross-database fusion strategies.

Result: Achieves performance comparable to supervised baselines and their fine-tuned counterparts on the DeepFake-Eval-2024 benchmark without any additional model training. Demonstrates cross-database generalizability with simple training-free fusion strategies.

Conclusion: The proposed training-free framework effectively addresses zero-day audio deepfake detection challenges, offering practical advantages over conventional fine-tuning approaches while maintaining competitive performance.

Abstract: Modern audio deepfake detectors built on foundation models and large training datasets achieve promising detection performance. However, they struggle with zero-day attacks, where the audio samples are generated by novel synthesis methods that models have not seen from reigning training data. Conventional approaches fine-tune the detector, which can be problematic when prompt response is needed. This paper proposes a training-free retrieval-augmented framework for zero-day audio deepfake detection that leverages knowledge representations and voice profile matching. Within this framework, we propose simple yet effective retrieval and ensemble methods that reach performance comparable to supervised baselines and their fine-tuned counterparts on the DeepFake-Eval-2024 benchmark, without any additional model training. We also conduct ablation on voice profile attributes, and demonstrate the cross-database generalizability of the framework with introducing simple and training-free fusion strategies.

[255] IndexTTS 2.5 Technical Report

Yunpei Li, Xun Zhou, Jinchao Wang, Lu Wang, Yong Wu, Siyi Zhou, Yiquan Zhou, Jingchen Shu

Main category: cs.SD

TL;DR: IndexTTS 2.5 enhances the zero-shot TTS foundation model with multilingual support, faster inference, and better quality through semantic compression, architectural upgrades, cross-lingual strategies, and RL optimization.

Details

Motivation: To improve upon IndexTTS 2 by expanding multilingual coverage, increasing inference speed, and enhancing overall synthesis quality while maintaining zero-shot emotional prosody replication capabilities.

Method: Four key improvements: 1) Semantic codec compression (50Hz→25Hz), 2) Architectural upgrade (U-DiT→Zipformer), 3) Multilingual extension with cross-lingual strategies, 4) Reinforcement learning optimization (GRPO) for T2S module.

Result: Achieves 2.28× RTF improvement while maintaining comparable WER and speaker similarity to IndexTTS 2. Supports Chinese, English, Japanese, Spanish with robust emotion transfer without target-language emotional training data.

Conclusion: IndexTTS 2.5 successfully enhances multilingual coverage, inference speed, and synthesis quality while preserving zero-shot emotional prosody replication capabilities across languages.

Abstract: In prior work, we introduced IndexTTS 2, a zero-shot neural text-to-speech foundation model comprising two core components: a transformer-based Text-to-Semantic (T2S) module and a non-autoregressive Semantic-to-Mel (S2M) module, which together enable faithful emotion replication and establish the first autoregressive duration-controllable generative paradigm. Building upon this, we present IndexTTS 2.5, which significantly enhances multilingual coverage, inference speed, and overall synthesis quality through four key improvements: 1) Semantic Codec Compression: we reduce the semantic codec frame rate from 50 Hz to 25 Hz, halving sequence length and substantially lowering both training and inference costs; 2) Architectural Upgrade: we replace the U-DiT-based backbone of the S2M module with a more efficient Zipformer-based modeling architecture, achieving notable parameter reduction and faster mel-spectrogram generation; 3) Multilingual Extension: We propose three explicit cross-lingual modeling strategies, boundary-aware alignment, token-level concatenation, and instruction-guided generation, establishing practical design principles for zero-shot multilingual emotional TTS that supports Chinese, English, Japanese, and Spanish, and enables robust emotion transfer even without target-language emotional training data; 4) Reinforcement Learning Optimization: we apply GRPO in post-training of the T2S module, improving pronunciation accuracy and natrualness. Experiments show that IndexTTS 2.5 not only supports broader language coverage but also replicates emotional prosody in unseen languages under the same zero-shot setting. IndexTTS 2.5 achieves a 2.28 times improvement in RTF while maintaining comparable WER and speaker similarity to IndexTTS 2.

cs.LG

[256] MoEBlaze: Breaking the Memory Wall for Efficient MoE Training on Modern GPUs

Jiyuan Zhang, Yining Liu, Siqi Yan, Lisen Deng, Jennifer Cao, Shuqi Yang, Min Ni, Bi Xue, Shen Li

Main category: cs.LG

TL;DR: MoEBlaze: A memory-efficient MoE training framework that reduces activation memory overheads and achieves 4x speedups with 50% memory savings compared to existing frameworks.

Details

Motivation: Modern Mixture-of-Experts (MoE) architectures suffer from amplified "memory wall" bottlenecks due to sparse arithmetic compute and substantial activation memory overheads from large token routing buffers and intermediate tensor materialization, limiting batch size, sequence length, and causing excessive data movements.

Method: Co-designed system approach with: (1) end-to-end token dispatch and MoE training method with optimized data structures to eliminate intermediate buffers and activation materializing, and (2) co-designed kernels with smart activation checkpoint to reduce memory footprint while improving performance.

Result: MoEBlaze achieves over 4x speedups and over 50% memory savings compared to existing MoE frameworks.

Conclusion: MoEBlaze effectively addresses the memory bottleneck in MoE training through a co-designed system approach, enabling more efficient model scaling and better performance.

Abstract: The pervasive “memory wall” bottleneck is significantly amplified in modern large-scale Mixture-of-Experts (MoE) architectures. MoE’s inherent architectural sparsity leads to sparse arithmetic compute and also introduces substantial activation memory overheads – driven by large token routing buffers and the need to materialize and buffer intermediate tensors. This memory pressure limits the maximum batch size and sequence length that can fit on GPUs, and also results in excessive data movements that hinders performance and efficient model scaling. We present MoEBlaze, a memory-efficient MoE training framework that addresses these issues through a co-designed system approach: (i) an end-to-end token dispatch and MoE training method with optimized data structures to eliminate intermediate buffers and activation materializing, and (ii) co-designed kernels with smart activation checkpoint to mitigate memory footprint while simultaneously achieving better performance. We demonstrate that MoEBlaze can achieve over 4x speedups and over 50% memory savings compared to existing MoE frameworks.

[257] TIME: Temporally Intelligent Meta-reasoning Engine for Context Triggered Explicit Reasoning

Susmit Das

Main category: cs.LG

TL;DR: TIME framework trains LLMs to use brief, context-sensitive reasoning bursts with temporal awareness instead of long pre-response thinking traces, improving efficiency and temporal reasoning while reducing token usage.

Details

Motivation: Current reasoning-oriented LLMs use long, turn-global thinking traces that are costly, reduce auditability, cannot be re-triggered, and lack temporal awareness in dialogue contexts.

Method: TIME framework introduces ISO 8601 time tags, tick turns for silent gaps, and short blocks anywhere in replies. Uses 4-phase curriculum with full-batch alignment to train Qwen3 models for context-sensitive reasoning bursts.

Result: Across 4B to 32B scales, TIME improves TIMEBench scores over base Qwen3 in both thinking and no-thinking modes while reducing reasoning tokens by about 10x.

Conclusion: TIME demonstrates that explicit reasoning can be made context-sensitive and temporally aware, improving dialogue coherence and efficiency while maintaining reasoning capabilities.

Abstract: Reasoning oriented large language models often expose explicit “thinking” as long, turn-global traces at the start of every response, either always on or toggled externally at inference time. While useful for arithmetic, programming, and problem solving, this design is costly, blurs claim level auditability, and cannot re-trigger explicit reasoning once the model begins presenting. Dialogue models are also largely blind to temporal structure, treating replies after seconds and replies after weeks as equivalent unless time is stated in text. We introduce TIME, the Temporally Intelligent Meta-reasoning Engine, a behavioral alignment framework that treats explicit reasoning as a context sensitive resource driven by discourse and temporal cues. TIME augments dialogue with optional ISO 8601 tags, tick turns that represent silent gaps, and short blocks that can appear anywhere in a reply. A four-phase curriculum including a small, maximally diverse full-batch alignment step trains Qwen3 dense models to invoke brief, in-place reasoning bursts and keep user facing text compact. We evaluate with TIMEBench, a temporally grounded dialogue benchmark probing chronology, commonsense under gaps and offsets, anomaly detection, and continuity. Across 4B to 32B scales, TIME improves TIMEBench scores over base Qwen3 in both thinking and no-thinking modes while reducing reasoning tokens by about an order of magnitude. Our training data and code are available at https://github.com/The-Coherence-Initiative/TIME and TIMEBench is available at https://github.com/The-Coherence-Initiative/TIMEBench

[258] Ontology Neural Networks for Topologically Conditioned Constraint Satisfaction

Jaehong Oh

Main category: cs.LG

TL;DR: Enhanced neuro-symbolic framework integrates topological conditioning with gradient stabilization to maintain semantic coherence while satisfying constraints, achieving 95% success rate and significant energy reduction.

Details

Motivation: Neuro-symbolic reasoning systems struggle to maintain semantic coherence while satisfying physical and logical constraints. The authors aim to enhance their previous Ontology Neural Networks by addressing these fundamental challenges through better integration of topological structure and gradient stability.

Method: The framework combines three key components: 1) Forman-Ricci curvature to capture graph topology, 2) Deep Delta Learning for stable rank-one perturbations during constraint projection, and 3) Covariance Matrix Adaptation Evolution Strategy (CMA-ES) for parameter optimization.

Result: The method achieves mean energy reduction to 1.15 (compared to baseline 11.68) with 95% success rate in constraint satisfaction tasks. It exhibits seed-independent convergence and scales gracefully up to twenty-node problems.

Conclusion: Topological structure can effectively inform gradient-based optimization in neuro-symbolic systems without sacrificing interpretability or computational efficiency, enabling better constraint satisfaction and semantic coherence.

Abstract: Neuro-symbolic reasoning systems face fundamental challenges in maintaining semantic coherence while satisfying physical and logical constraints. Building upon our previous work on Ontology Neural Networks, we present an enhanced framework that integrates topological conditioning with gradient stabilization mechanisms. The approach employs Forman-Ricci curvature to capture graph topology, Deep Delta Learning for stable rank-one perturbations during constraint projection, and Covariance Matrix Adaptation Evolution Strategy for parameter optimization. Experimental evaluation across multiple problem sizes demonstrates that the method achieves mean energy reduction to 1.15 compared to baseline values of 11.68, with 95 percent success rate in constraint satisfaction tasks. The framework exhibits seed-independent convergence and graceful scaling behavior up to twenty-node problems, suggesting that topological structure can inform gradient-based optimization without sacrificing interpretability or computational efficiency.

[259] Indonesian Multimodal Emotion Recognition via Auxiliary-Enhanced LLM Adaptation

Xueming Yan, Boyan Xu, Yaochu Jin, Lixian Xiao, Wenlong Ye, Runyang Cai, Zeqi Zheng, Jingfa Liu, Aimin Yang, Yongduan Song

Main category: cs.LG

TL;DR: Introduces IndoMER, the first Indonesian multimodal emotion recognition benchmark, and OmniMER framework that uses auxiliary modality-specific tasks to improve emotion recognition in low-resource settings.

Details

Motivation: Indonesian language is widely spoken but underserved in multimodal emotion recognition research, despite its dominance on Southeast Asian social media platforms. There's a need for culturally relevant datasets and methods for this important language.

Method: Proposes OmniMER framework built on Qwen2.5-Omni with three auxiliary modality-specific perception tasks: emotion keyword extraction (text), facial expression analysis (video), and prosody analysis (audio). These tasks help identify emotion-relevant cues before fusion.

Result: OmniMER achieves 0.582 Macro-F1 on sentiment classification and 0.454 on emotion recognition on IndoMER, outperforming base model by 7.6 and 22.1 absolute points respectively. Cross-lingual evaluation on Chinese CH-SIMS dataset shows generalizability.

Conclusion: The work addresses the gap in Indonesian multimodal emotion recognition with a culturally relevant dataset and effective framework that uses auxiliary tasks to handle cross-modal inconsistency and long-tailed distributions common in real-world Indonesian communication.

Abstract: Indonesian, spoken by over 200 million people, remains underserved in multimodal emotion recognition research despite its dominant presence on Southeast Asian social media platforms. We introduce IndoMER, the first multimodal emotion recognition benchmark for Indonesian, comprising 1,944 video segments from 203 speakers with temporally aligned text, audio, and visual annotations across seven emotion categories. The dataset exhibits realistic challenges including cross-modal inconsistency and long-tailed class distributions shaped by Indonesian cultural communication norms. To address these challenges, we propose OmniMER, a multimodal adaptation framework built upon Qwen2.5-Omni that enhances emotion recognition through three auxiliary modality-specific perception tasks: emotion keyword extraction for text, facial expression analysis for video, and prosody analysis for audio. These auxiliary tasks help the model identify emotion-relevant cues in each modality before fusion, reducing reliance on spurious correlations in low-resource settings. Experiments on IndoMER show that OmniMER achieves 0.582 Macro-F1 on sentiment classification and 0.454 on emotion recognition, outperforming the base model by 7.6 and 22.1 absolute points respectively. Cross-lingual evaluation on the Chinese CH-SIMS dataset further demonstrates the generalizability of the proposed framework. The dataset and code are publicly available. https://github.com/yanxm01/INDOMER

[260] When the Server Steps In: Calibrated Updates for Fair Federated Learning

Tianrun Yu, Kaixiang Zhao, Cheng Zhang, Anjun Gao, Yueyang Quan, Zhuqing Liu, Minghong Fang

Main category: cs.LG

TL;DR: EquFL is a server-side debiasing method for federated learning that reduces bias without modifying client protocols, using calibrated updates to produce fairer global models.

Details

Motivation: Federated learning faces fairness challenges across demographic groups, and existing debiasing methods either require client protocol modifications or lack flexible aggregation strategies.

Method: EquFL operates on the server side by generating a single calibrated update after receiving client model updates, then integrating this with aggregated client updates to produce an adjusted global model that reduces bias.

Result: Theoretically, EquFL converges to the optimal global model achieved by FedAvg and reduces fairness loss over training rounds. Empirically, it significantly mitigates bias in the system.

Conclusion: EquFL provides an effective server-side solution for bias mitigation in federated learning that maintains compatibility with existing client protocols while improving fairness.

Abstract: Federated learning (FL) has emerged as a transformative distributed learning paradigm, enabling multiple clients to collaboratively train a global model under the coordination of a central server without sharing their raw training data. While FL offers notable advantages, it faces critical challenges in ensuring fairness across diverse demographic groups. To address these fairness concerns, various fairness-aware debiasing methods have been proposed. However, many of these approaches either require modifications to clients’ training protocols or lack flexibility in their aggregation strategies. In this work, we address these limitations by introducing EquFL, a novel server-side debiasing method designed to mitigate bias in FL systems. EquFL operates by allowing the server to generate a single calibrated update after receiving model updates from the clients. This calibrated update is then integrated with the aggregated client updates to produce an adjusted global model that reduces bias. Theoretically, we establish that EquFL converges to the optimal global model achieved by FedAvg and effectively reduces fairness loss over training rounds. Empirically, we demonstrate that EquFL significantly mitigates bias within the system, showcasing its practical effectiveness.

[261] GlyRAG: Context-Aware Retrieval-Augmented Framework for Blood Glucose Forecasting

Shovito Barua Soumma, Hassan Ghasemzadeh

Main category: cs.LG

TL;DR: GlyRAG is a context-aware, retrieval-augmented framework for blood glucose forecasting that uses LLMs to extract clinical context from CGM traces without additional sensors, achieving superior accuracy and clinical reliability.

Details

Motivation: Current glucose forecasting models treat CGM data as numerical sequences, ignoring clinical context or requiring additional sensors that are difficult to deploy at scale. LLMs show promise for time-series forecasting but their role as contextual agents in diabetes care remains unexplored.

Method: GlyRAG uses an LLM as a contextualization agent to generate clinical summaries from CGM traces. These summaries are embedded and fused with patch-based glucose representations in a multimodal transformer with cross-translation loss. A retrieval module identifies similar historical episodes and uses cross-attention to integrate case-based analogues before forecasting.

Result: GlyRAG outperforms state-of-the-art methods on two T1D cohorts, achieving up to 39% lower RMSE and 1.7% further reduction over baseline. It places 85% predictions in safe zones and achieves 51% improvement in predicting dysglycemic events across both cohorts.

Conclusion: LLM-based contextualization and retrieval over CGM traces can enhance accuracy and clinical reliability of long-horizon glucose forecasting without extra sensors, supporting future agentic decision-support tools for diabetes management.

Abstract: Accurate forecasting of blood glucose from CGM is essential for preventing dysglycemic events, thus enabling proactive diabetes management. However, current forecasting models treat blood glucose readings captured using CGMs as a numerical sequence, either ignoring context or relying on additional sensors/modalities that are difficult to collect and deploy at scale. Recently, LLMs have shown promise for time-series forecasting tasks, yet their role as agentic context extractors in diabetes care remains largely unexplored. To address these limitations, we propose GlyRAG, a context-aware, retrieval-augmented forecasting framework that derives semantic understanding of blood glucose dynamics directly from CGM traces without requiring additional sensor modalities. GlyRAG employs an LLM as a contextualization agent to generate clinical summaries. These summaries are embedded by a language model and fused with patch-based glucose representations in a multimodal transformer architecture with a cross translation loss aligining textual and physiological embeddings. A retrieval module then identifies similar historical episodes in the learned embedding space and uses cross-attention to integrate these case-based analogues prior to making a forecasting inference. Extensive evaluations on two T1D cohorts show that GlyRAG consistently outperforms state-of-the art methods, achieving up to 39% lower RMSE and a further 1.7% reduction in RMSE over the baseline. Clinical evaluation shows that GlyRAG places 85% predictions in safe zones and achieves 51% improvement in predicting dysglycemic events across both cohorts. These results indicate that LLM-based contextualization and retrieval over CGM traces can enhance the accuracy and clinical reliability of long-horizon glucose forecasting without the need for extra sensors, thus supporting future agentic decision-support tools for diabetes management.

[262] The Kernel Manifold: A Geometric Approach to Gaussian Process Model Selection

Md Shafiqul Islam, Shakti Prasad Padhy, Douglas Allaire, Raymundo Arróyave

Main category: cs.LG

TL;DR: Bayesian optimization framework for GP kernel selection using kernel-of-kernels geometry and MDS embedding to create continuous kernel manifold for efficient search.

Details

Motivation: Kernel selection is critical for Gaussian Process regression performance but remains challenging and computationally expensive, requiring an efficient automated approach.

Method: Uses Bayesian optimization with kernel-of-kernels geometry, expected divergence distances between GP priors, and MDS embedding to map discrete kernel library into continuous Euclidean manifold for smooth BO search.

Result: Demonstrated superior predictive accuracy and uncertainty calibration on synthetic benchmarks, real-world time-series datasets, and additive manufacturing case study compared to baselines including LLM-guided search.

Conclusion: Establishes reusable probabilistic geometry for kernel search with direct relevance to GP modeling and deep kernel learning, providing efficient automated kernel selection framework.

Abstract: Gaussian Process (GP) regression is a powerful nonparametric Bayesian framework, but its performance depends critically on the choice of covariance kernel. Selecting an appropriate kernel is therefore central to model quality, yet remains one of the most challenging and computationally expensive steps in probabilistic modeling. We present a Bayesian optimization framework built on kernel-of-kernels geometry, using expected divergence-based distances between GP priors to explore kernel space efficiently. A multidimensional scaling (MDS) embedding of this distance matrix maps a discrete kernel library into a continuous Euclidean manifold, enabling smooth BO. In this formulation, the input space comprises kernel compositions, the objective is the log marginal likelihood, and featurization is given by the MDS coordinates. When the divergence yields a valid metric, the embedding preserves geometry and produces a stable BO landscape. We demonstrate the approach on synthetic benchmarks, real-world time-series datasets, and an additive manufacturing case study predicting melt-pool geometry, achieving superior predictive accuracy and uncertainty calibration relative to baselines including Large Language Model (LLM)-guided search. This framework establishes a reusable probabilistic geometry for kernel search, with direct relevance to GP modeling and deep kernel learning.

[263] Interactive Distillation for Cooperative Multi-Agent Reinforcement Learning

Minwoo Cho, Batuhan Altundas, Matthew Gombolay

Main category: cs.LG

TL;DR: HINT is a hierarchical interactive teacher-based knowledge distillation framework for MARL that addresses key bottlenecks in KD through hierarchical RL for scalable teaching, pseudo off-policy RL for OOD adaptation, and performance-based filtering to reduce observation mismatches.

Details

Motivation: Knowledge distillation in MARL faces three key bottlenecks: (1) difficulty synthesizing high-performing teaching policies in complex domains, (2) challenges when teachers must reason in out-of-distribution states, and (3) mismatches between decentralized students' and centralized teacher's observation spaces.

Method: HINT uses hierarchical RL to create a scalable, high-performing teacher. It introduces pseudo off-policy RL to update teacher policies using both teacher and student experience for better OOD adaptation. Performance-based filtering retains only outcome-relevant guidance to reduce observation mismatches.

Result: HINT outperforms baselines on challenging cooperative domains (FireCommander for resource allocation, MARINE for tactical combat), achieving improvements of 60% to 165% in success rate.

Conclusion: HINT effectively addresses key bottlenecks in knowledge distillation for MARL through its hierarchical interactive teacher framework, demonstrating significant performance improvements in complex cooperative domains.

Abstract: Knowledge distillation (KD) has the potential to accelerate MARL by employing a centralized teacher for decentralized students but faces key bottlenecks. Specifically, there are (1) challenges in synthesizing high-performing teaching policies in complex domains, (2) difficulties when teachers must reason in out-of-distribution (OOD) states, and (3) mismatches between the decentralized students’ and the centralized teacher’s observation spaces. To address these limitations, we propose HINT (Hierarchical INteractive Teacher-based transfer), a novel KD framework for MARL in a centralized training, decentralized execution setup. By leveraging hierarchical RL, HINT provides a scalable, high-performing teacher. Our key innovation, pseudo off-policy RL, enables the teacher policy to be updated using both teacher and student experience, thereby improving OOD adaptation. HINT also applies performance-based filtering to retain only outcome-relevant guidance, reducing observation mismatches. We evaluate HINT on challenging cooperative domains (e.g., FireCommander for resource allocation, MARINE for tactical combat). Across these benchmarks, HINT outperforms baselines, achieving improvements of 60% to 165% in success rate.

[264] Inverting Non-Injective Functions with Twin Neural Network Regression

Sebastian J. Wetzel

Main category: cs.LG

TL;DR: Twin neural network regression with k-nearest neighbor search provides a deterministic framework for inverting non-injective functions by finding input parameters for given target variables.

Details

Motivation: Non-injective functions are not invertible in general, but they can be inverted locally on sub-domains where they are injective, or by selecting preferred solutions when multiple solutions exist. There's a need for practical methods to invert such functions in applications like robot arm control.

Method: Uses twin neural network regression trained to predict adjustments to known input variables (x^anchor) to estimate unknown inputs (x^new) when target variables change from y^anchor to y^new. Combines this with k-nearest neighbor search to create a deterministic framework for finding input parameters for given target variables of non-injective functions.

Result: The method successfully inverts non-injective functions in both data-defined and mathematically-defined scenarios, including toy problems and robot arm control applications.

Conclusion: Twin neural network regression with k-nearest neighbor search provides an effective deterministic approach for inverting non-injective functions, handling both data-driven and analytical function inversion problems.

Abstract: Non-injective functions are not invertible. However, non-injective functions can be restricted to sub-domains on which they are locally injective and surjective and thus invertible if the dimensionality between input and output spaces are the same. Further, even if the dimensionalities do not match it is often possible to choose a preferred solution from many possible solutions. Twin neural network regression is naturally capable of incorporating these properties to invert non-injective functions. Twin neural network regression is trained to predict adjustments to well known input variables $\mathbf{x}^{\text{anchor}}$ to obtain an estimate for an unknown $\mathbf{x}^{\text{new}}$ under a change of the target variable from $\mathbf{y}^{\text{anchor}}$ to $\mathbf{y}^{\text{new}}$. In combination with k-nearest neighbor search, I propose a deterministic framework that finds input parameters to a given target variable of non-injective functions. The method is demonstrated by inverting non-injective functions describing toy problems and robot arm control that are a) defined by data or b) known as mathematical formula.

[265] Imitation Learning for Combinatorial Optimisation under Uncertainty

Prakash Gawas, Antoine Legrain, Louis-Martin Rousseau

Main category: cs.LG

TL;DR: Systematic taxonomy of experts for imitation learning in combinatorial optimization under uncertainty, with classification along three dimensions and a generalized DAgger algorithm.

Details

Motivation: Existing imitation learning approaches for combinatorial optimization use diverse expert constructions without a unifying framework to characterize their assumptions, computational properties, and impact on learning performance.

Method: Proposes a three-dimensional taxonomy of experts (treatment of uncertainty, optimality level, interaction mode) and develops a generalized Dataset Aggregation (DAgger) algorithm supporting multiple expert queries, aggregation, and flexible interaction strategies.

Result: Evaluation on dynamic physician-to-patient assignment shows policies from stochastic experts outperform deterministic/full-information experts, interactive learning improves quality with fewer demonstrations, and aggregated deterministic experts provide effective alternatives when stochastic optimization is computationally challenging.

Conclusion: The taxonomy provides a systematic framework for expert selection in imitation learning for combinatorial optimization, with stochastic experts generally superior but aggregated deterministic experts offering practical alternatives for computationally intensive problems.

Abstract: Imitation learning (IL) provides a data-driven framework for approximating policies for large-scale combinatorial optimisation problems formulated as sequential decision problems (SDPs), where exact solution methods are computationally intractable. A central but underexplored aspect of IL in this context is the role of the \emph{expert} that generates training demonstrations. Existing studies employ a wide range of expert constructions, yet lack a unifying framework to characterise their modelling assumptions, computational properties, and impact on learning performance. This paper introduces a systematic taxonomy of experts for IL in combinatorial optimisation under uncertainty. Experts are classified along three dimensions: (i) their treatment of uncertainty, including myopic, deterministic, full-information, two-stage stochastic, and multi-stage stochastic formulations; (ii) their level of optimality, distinguishing task-optimal and approximate experts; and (iii) their interaction mode with the learner, ranging from one-shot supervision to iterative, interactive schemes. Building on this taxonomy, we propose a generalised Dataset Aggregation (DAgger) algorithm that supports multiple expert queries, expert aggregation, and flexible interaction strategies. The proposed framework is evaluated on a dynamic physician-to-patient assignment problem with stochastic arrivals and capacity constraints. Computational experiments compare learning outcomes across expert types and interaction regimes. The results show that policies learned from stochastic experts consistently outperform those learned from deterministic or full-information experts, while interactive learning improves solution quality using fewer expert demonstrations. Aggregated deterministic experts provide an effective alternative when stochastic optimisation becomes computationally challenging.

[266] DynaSTy: A Framework for SpatioTemporal Node Attribute Prediction in Dynamic Graphs

Namrata Banerji, Tanya Berger-Wolf

Main category: cs.LG

TL;DR: Dynamic Edge-Biased Spatiotemporal Transformer for node attribute forecasting on evolving graphs with time-varying adjacency matrices.

Details

Motivation: Existing spatiotemporal GNNs assume static graphs, but many real-world applications (financial networks, biological networks, social systems) have dynamic graphs that evolve over time. Need models that can handle time-varying adjacency matrices and forecast node attributes across multiple future steps.

Method: Transformer-based model that ingests time series of node attributes and adjacency matrices, injecting adjacency as adaptable attention bias at each time step. Uses masked node-time pretraining to reconstruct missing features, scheduled sampling, and horizon-weighted loss to mitigate compounding error.

Result: Consistently outperforms strong baselines on RMSE and MAE metrics across various dynamic graph forecasting tasks.

Conclusion: Proposed dynamic edge-biased spatiotemporal model effectively handles evolving graphs and enables forecasting in multi-system settings with varying graph structures across samples.

Abstract: Accurate multistep forecasting of node-level attributes on dynamic graphs is critical for applications ranging from financial trust networks to biological networks. Existing spatiotemporal graph neural networks typically assume a static adjacency matrix. In this work, we propose an end-to-end dynamic edge-biased spatiotemporal model that ingests a multi-dimensional timeseries of node attributes and a timeseries of adjacency matrices, to predict multiple future steps of node attributes. At each time step, our transformer-based model injects the given adjacency as an adaptable attention bias, allowing the model to focus on relevant neighbors as the graph evolves. We further deploy a masked node-time pretraining objective that primes the encoder to reconstruct missing features, and train with scheduled sampling and a horizon-weighted loss to mitigate compounding error over long horizons. Unlike prior work, our model accommodates dynamic graphs that vary across input samples, enabling forecasting in multi-system settings such as brain networks across different subjects, financial systems in different contexts, or evolving social systems. Empirical results demonstrate that our method consistently outperforms strong baselines on Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE).

[267] Efficient Inference for Noisy LLM-as-a-Judge Evaluation

Yiqun T Chen, Sizhu Lu, Sijia Li, Moran Guo, Shengyi Li

Main category: cs.LG

TL;DR: This paper systematically compares two approaches for debiasing LLM-as-a-judge evaluations: measurement-error correction (Rogan-Gladen-style) and surrogate-outcome methods (prediction-powered inference), deriving efficient estimators and characterizing when PPI-style methods have lower variance.

Details

Motivation: LLMs are increasingly used as automatic evaluators (LLM-as-a-judge), but they make systematic, non-random errors. Existing debiasing approaches need systematic comparison to understand their relative performance and optimal use cases.

Method: The paper uses semiparametric efficiency theory to unify two classes of estimators: (1) measurement-error correction based on misclassification models, and (2) surrogate-outcome approaches like prediction-powered inference. It derives explicit forms of efficient influence function-based efficient estimators and characterizes conditions under which PPI-style estimators achieve strictly smaller asymptotic variance.

Result: Theoretical analysis shows conditions where PPI-style estimators have strictly smaller asymptotic variance than measurement-error corrections. Results are verified in simulations and demonstrated on real-data examples, with implementation provided in a GitHub repository.

Conclusion: The paper provides a unified theoretical framework for debiasing LLM-as-a-judge evaluations, offering guidance on when to use each approach and providing practical tools for implementation.

Abstract: Large language models (LLMs) are increasingly used as automatic evaluators of generative AI outputs, a paradigm often referred to as “LLM-as-a-judge.” In practice, LLM judges are imperfect predictions for the underlying truth and can exhibit systematic, non-random errors. Two main approaches have recently been proposed to address this issue: (i) direct measurementerror correction based on misclassification models such as Rogan-Gladen-style estimators, and (ii) surrogate-outcome approaches such as prediction-powered inference (PPI), which correct bias by calibrating prediction residuals on a small set of gold-standard human labels. In this paper, we systematically study the performance of these two approaches for estimating mean parameters (e.g., average benchmark scores or pairwise win rates). Leveraging tools from semiparametric efficiency theory, we unify the two classes of estimators by deriving explicit forms of efficient influence function (EIF)-based efficient estimators and characterize conditions under which PPI-style estimators attain strictly smaller asymptotic variance than measurement-error corrections. We verify our theoretical results in simulations and demonstrate the methods on real-data examples. We provide an implementation of the benchmarked methods and comparison utilities at https://github.com/yiqunchen/debias-llm-as-a-judge.

[268] Prediction of Fault Slip Tendency in CO${_2}$ Storage using Data-space Inversion

Xiaowen He, Su Jiang, Louis J. Durlofsky

Main category: cs.LG

TL;DR: VAE-based data-space inversion framework for predicting pressure, stress, strain, and fault slip tendency in CO2 storage projects without generating posterior geomodels.

Details

Motivation: Conventional model-based history matching methods are challenging for coupled flow-geomechanics problems with faults. Need accurate fault slip assessment for subsurface operations like CO2 storage.

Method: Use variational autoencoder (VAE) with convolutional LSTM layers to represent pressure, strain, stress fields as latent variables. Apply data-space inversion (DSI) with prior simulation results from O(1000) geomodels and observed monitoring data.

Result: DSI-VAE framework provides accurate predictions for pressure, strain, stress fields and fault slip tendency. Reduces uncertainty in key geomechanical and fault parameters.

Conclusion: VAE-based DSI framework enables efficient posterior predictions for coupled flow-geomechanics problems without generating posterior geomodels, improving fault slip assessment in CO2 storage.

Abstract: Accurately assessing the potential for fault slip is essential in many subsurface operations. Conventional model-based history matching methods, which entail the generation of posterior geomodels calibrated to observed data, can be challenging to apply in coupled flow-geomechanics problems with faults. In this work, we implement a variational autoencoder (VAE)-based data-space inversion (DSI) framework to predict pressure, stress and strain fields, and fault slip tendency, in CO${_2}$ storage projects. The main computations required by the DSI workflow entail the simulation of O(1000) prior geomodels. The posterior distributions for quantities of interest are then inferred directly from prior simulation results and observed data, without the need to generate posterior geomodels. The model used here involves a synthetic 3D system with two faults. Realizations of heterogeneous permeability and porosity fields are generated using geostatistical software, and uncertain geomechanical and fault parameters are sampled for each realization from prior distributions. Coupled flow-geomechanics simulations for these geomodels are conducted using GEOS. A VAE with stacked convolutional long short-term memory layers is trained, using the prior simulation results, to represent pressure, strain, effective normal stress and shear stress fields in terms of latent variables. The VAE parameterization is used with DSI for posterior predictions, with monitoring wells providing observed pressure and strain data. Posterior results for synthetic true models demonstrate that the DSI-VAE framework gives accurate predictions for pressure, strain, and stress fields and for fault slip tendency. The framework is also shown to reduce uncertainty in key geomechanical and fault parameters.

[269] RingSQL: Generating Synthetic Data with Schema-Independent Templates for Text-to-SQL Reasoning Models

Marko Sterbentz, Kevin Cushing, Cameron Barrie, Kristian J. Hammond

Main category: cs.LG

TL;DR: RingSQL is a hybrid text-to-SQL data generation framework combining schema-independent query templates with LLM paraphrasing to create high-quality training data that preserves SQL correctness while providing linguistic variety.

Details

Motivation: Progress in text-to-SQL systems is limited by scarcity of high-quality training data. Manual creation is expensive, and existing synthetic methods trade off reliability vs scalability - template-based approaches ensure correctness but require schema-specific templates, while LLM-based generation scales but lacks quality guarantees.

Method: RingSQL combines schema-independent query templates with LLM-based paraphrasing of natural language questions. This hybrid approach preserves SQL correctness across diverse schemas while providing broad linguistic variety through LLM paraphrasing.

Result: Models trained using RingSQL-generated data achieve average accuracy gain of +2.3% across six text-to-SQL benchmarks compared to models trained on other synthetic data.

Conclusion: RingSQL provides an effective hybrid solution for generating high-quality text-to-SQL training data that balances correctness and scalability, addressing limitations of existing synthetic data generation methods.

Abstract: Recent advances in text-to-SQL systems have been driven by larger models and improved datasets, yet progress is still limited by the scarcity of high-quality training data. Manual data creation is expensive, and existing synthetic methods trade off reliability and scalability. Template-based approaches ensure correct SQL but require schema-specific templates, while LLM-based generation scales easily but lacks quality and correctness guarantees. We introduce RingSQL, a hybrid data generation framework that combines schema-independent query templates with LLM-based paraphrasing of natural language questions. This approach preserves SQL correctness across diverse schemas while providing broad linguistic variety. In our experiments, we find that models trained using data produced by RingSQL achieve an average gain in accuracy of +2.3% across six text-to-SQL benchmarks when compared to models trained on other synthetic data. We make our code available at https://github.com/nu-c3lab/RingSQL.

[270] Efficient Differentiable Causal Discovery via Reliable Super-Structure Learning

Pingchuan Ma, Qixin Zhang, Shuai Wang, Dacheng Tao

Main category: cs.LG

TL;DR: ALVGL enhances differentiable causal discovery by learning a precision matrix decomposition to construct provably-correct super-structures that guide optimization, improving both accuracy and efficiency across various causal models.

Details

Motivation: Differentiable causal discovery methods struggle with high-dimensional data and latent confounders due to vast search spaces, complex objectives, and graph constraints. Existing super-structure approaches face challenges in learning appropriate granularity efficiently across settings.

Method: ALVGL uses sparse and low-rank decomposition of the precision matrix, optimized via ADMM, to identify components relevant to causal structure. These components construct a super-structure that’s provably a superset of the true causal graph, which then initializes standard differentiable methods with focused search space.

Result: ALVGL achieves state-of-the-art accuracy and significantly improves optimization efficiency across Gaussian/non-Gaussian settings with/without unmeasured confounders, demonstrated on synthetic and real-world datasets.

Conclusion: ALVGL provides a reliable and effective enhancement to differentiable causal discovery pipelines by leveraging precision matrix decomposition to construct provably-correct super-structures that guide optimization, addressing key challenges in high-dimensional and latent confounder settings.

Abstract: Recently, differentiable causal discovery has emerged as a promising approach to improve the accuracy and efficiency of existing methods. However, when applied to high-dimensional data or data with latent confounders, these methods, often based on off-the-shelf continuous optimization algorithms, struggle with the vast search space, the complexity of the objective function, and the nontrivial nature of graph-theoretical constraints. As a result, there has been a surge of interest in leveraging super-structures to guide the optimization process. Nonetheless, learning an appropriate super-structure at the right level of granularity, and doing so efficiently across various settings, presents significant challenges. In this paper, we propose ALVGL, a novel and general enhancement to the differentiable causal discovery pipeline. ALVGL employs a sparse and low-rank decomposition to learn the precision matrix of the data. We design an ADMM procedure to optimize this decomposition, identifying components in the precision matrix that are most relevant to the underlying causal structure. These components are then combined to construct a super-structure that is provably a superset of the true causal graph. This super-structure is used to initialize a standard differentiable causal discovery method with a more focused search space, thereby improving both optimization efficiency and accuracy. We demonstrate the versatility of ALVGL by instantiating it across a range of structural causal models, including both Gaussian and non-Gaussian settings, with and without unmeasured confounders. Extensive experiments on synthetic and real-world datasets show that ALVGL not only achieves state-of-the-art accuracy but also significantly improves optimization efficiency, making it a reliable and effective solution for differentiable causal discovery.

[271] MaxCode: A Max-Reward Reinforcement Learning Framework for Automated Code Optimization

Jiefu Ou, Sapana Chaudhary, Kaj Bostrom, Nathaniel Weir, Shuai Zhang, Huzefa Rangwala, George Karypis

Main category: cs.LG

TL;DR: MaxCode is an inference-time search framework that uses reinforcement learning and natural language critiques to help LLMs iteratively optimize code performance through execution feedback.

Details

Motivation: LLMs struggle with code optimization due to (1) complexity requiring systems/algorithm expertise, and (2) difficulty interpreting performance metrics beyond binary correctness.

Method: MaxCode unifies search methods under max-reward RL framework with modular observation/action-value functions. Integrates natural language critique model to convert execution feedback into diagnostic insights, and uses generative reward-to-go model to rerank solutions.

Result: On KernelBench (CUDA) and PIE (C++) benchmarks, MaxCode achieves 20.3% relative improvement in absolute speedup value and 10.1% improvement in relative speedup ranking compared to baselines.

Conclusion: MaxCode effectively enhances LLM-based code optimization through inference-time search with execution feedback, natural language critiques, and improved exploration strategies.

Abstract: Large Language Models (LLMs) demonstrate strong capabilities in general coding tasks but encounter two key challenges when optimizing code: (i) the complexity of writing optimized code (such as performant CUDA kernels and competition-level CPU code) requires expertise in systems, algorithms and specific languages and (ii) requires interpretation of performance metrics like timing and device utilization beyond binary correctness. In this work, we explore inference-time search algorithms that guide the LLM to discover better solutions through iterative refinement based on execution feedback. Our approach, called MaxCode unifies existing search methods under a max-reward reinforcement learning framework, making the observation and action-value functions modular for modification. To enhance the observation space, we integrate a natural language critique model that converts raw execution feedback into diagnostic insights about errors and performance bottlenecks, and the best-discounted reward seen so far. Together, these provide richer input to the code proposal function. To improve exploration during search, we train a generative reward-to-go model using action values from rollouts to rerank potential solutions. Testing on the KernelBench (CUDA) and PIE (C++) optimization benchmarks shows that MaxCode improves optimized code performance compared to baselines, achieving 20.3% and 10.1% relative improvements in absolute speedup value and relative speedup ranking, respectively.

[272] Hi-ZFO: Hierarchical Zeroth- and First-Order LLM Fine-Tuning via Importance-Guided Tensor Selection

Feihu Jin, Ying Tan

Main category: cs.LG

TL;DR: Hi-ZFO: A hierarchical hybrid optimization method for LLM fine-tuning that combines zeroth-order (ZO) and first-order (FO) optimization to balance exploration and precision.

Details

Motivation: Standard FO optimization drives LLM fine-tuning toward sharp, poorly generalizing minima, while ZO methods offer better exploration but suffer from slow convergence and high variance in generative tasks with vast output spaces.

Method: Hi-ZFO adaptively partitions the model through layer-wise importance profiling, applying precise FO updates to critical layers while using ZO optimization for less sensitive layers. ZO serves as “beneficial stochasticity” to help escape local minima rather than just a memory-saving surrogate.

Result: Validated across diverse generative, mathematical, and code reasoning tasks, Hi-ZFO consistently achieves superior performance while significantly reducing training time compared to pure FO or ZO methods.

Conclusion: Hierarchical hybrid optimization combining ZO and FO methods is effective for LLM fine-tuning, balancing exploration and precision to achieve better generalization and efficiency.

Abstract: Fine-tuning large language models (LLMs) using standard first-order (FO) optimization often drives training toward sharp, poorly generalizing minima. Conversely, zeroth-order (ZO) methods offer stronger exploratory behavior without relying on explicit gradients, yet suffer from slow convergence. More critically, our analysis reveals that in generative tasks, the vast output and search space significantly amplify estimation variance, rendering ZO methods both noisy and inefficient. To address these challenges, we propose \textbf{Hi-ZFO} (\textbf{Hi}erarchical \textbf{Z}eroth- and \textbf{F}irst-\textbf{O}rder optimization), a hybrid framework designed to synergize the precision of FO gradients with the exploratory capability of ZO estimation. Hi-ZFO adaptively partitions the model through layer-wise importance profiling, applying precise FO updates to critical layers while leveraging ZO optimization for less sensitive ones. Notably, ZO in Hi-ZFO is not merely a memory-saving surrogate; it is intentionally introduced as a source of “beneficial stochasticity” to help the model escape the local minima where pure FO optimization tends to stagnate. Validated across diverse generative, mathematical, and code reasoning tasks, Hi-ZFO consistently achieves superior performance while significantly reducing the training time. These results demonstrate the effectiveness of hierarchical hybrid optimization for LLM fine-tuning.

[273] Over-Searching in Search-Augmented Large Language Models

Roy Xie, Deepak Gopinath, David Qiu, Dong Lin, Haitian Sun, Saloni Potdar, Bhuwan Dhingra

Main category: cs.LG

TL;DR: Search-augmented LLMs often over-search unnecessarily, harming efficiency and causing hallucinations. The paper systematically evaluates this problem, introduces TPC metric to measure performance-cost trade-off, and proposes mitigation approaches.

Details

Motivation: Search-augmented LLMs frequently invoke search tools unnecessarily, leading to computational inefficiency and hallucinations from irrelevant retrieved context. There's a need to systematically understand and quantify this "over-searching" problem.

Method: Conducted systematic evaluation across multiple dimensions: query types, model categories, retrieval conditions, and multi-turn conversations. Introduced Tokens Per Correctness (TPC) metric to quantify over-searching. Investigated mitigation approaches at query and retrieval levels.

Result: Search improves accuracy on answerable queries but harms abstention on unanswerable ones. Over-searching is more pronounced in complex reasoning models and deep research systems, exacerbated by noisy retrieval, and compounds across multi-turn conversations. Negative evidence improves abstention.

Conclusion: Over-searching is a significant problem in search-augmented LLMs that requires systematic study. The TPC metric effectively captures performance-cost trade-offs. Mitigation approaches at query and retrieval levels can help, and the released OverSearchQA dataset will foster further research.

Abstract: Search-augmented large language models (LLMs) excel at knowledge-intensive tasks by integrating external retrieval. However, they often over-search – unnecessarily invoking search tool even when it does not improve response quality, which leads to computational inefficiency and hallucinations by incorporating irrelevant context. In this work, we conduct a systematic evaluation of over-searching across multiple dimensions, including query types, model categories, retrieval conditions, and multi-turn conversations. Our finding shows: (i) search generally improves answer accuracy on answerable queries but harms abstention on unanswerable ones; (ii) over-searching is more pronounced in complex reasoning models and deep research systems, is exacerbated by noisy retrieval, and compounds across turns in multi-turn conversations; and (iii) the composition of retrieved evidence is crucial, as the presence of negative evidence improves abstention. To quantify over-searching, we introduce Tokens Per Correctness (TPC), an evaluation metric that captures the performance-cost trade-off for search-augmented LLMs. Lastly, we investigate mitigation approaches at both the query and retrieval levels and release the OverSearchQA to foster continued research into efficient search-augmented LLMs.

[274] Toward an Integrated Cross-Urban Accident Prevention System: A Multi-Task Spatial-Temporal Learning Framework for Urban Safety Management

Jiayu Fang, Zhiqi Shao, Haoning Xi, Boris Choy, Junbin Gao

Main category: cs.LG

TL;DR: MLA-STNet: A unified cross-city accident prediction system using multi-task learning with Mamba attention modules to handle heterogeneous urban data and improve prediction accuracy.

Details

Motivation: Cross-city accident prevention is challenging due to heterogeneous, inconsistent, noisy urban accident data with fragmented governance and incompatible reporting standards, hindering integrated prevention frameworks.

Method: Proposes MLA-STNet with two modules: STG-MA (suppresses spatio-temporal fluctuations, strengthens long-range dependencies) and STS-MA (mitigates cross-city heterogeneity via shared-parameter design while preserving individual semantic spaces). Formulates as multi-task learning across cities.

Result: Achieves up to 6% lower RMSE, 8% higher Recall, 5% higher MAP than SOTA baselines, with <1% performance variation under 50% input noise. Validated on NYC and Chicago datasets across 75 experiments for full-day and high-frequency periods.

Conclusion: MLA-STNet effectively unifies heterogeneous urban datasets into a scalable, robust, interpretable Cross-City Accident Prevention System, enabling coordinated data-driven urban safety management.

Abstract: The development of a cross-city accident prevention system is particularly challenging due to the heterogeneity, inconsistent reporting, and inherently clustered, sparse, cyclical, and noisy nature of urban accident data. These intrinsic data properties, combined with fragmented governance and incompatible reporting standards, have long hindered the creation of an integrated, cross-city accident prevention framework. To address this gap, we propose the Mamba Local-ttention Spatial-Temporal Network MLA-STNet, a unified system that formulates accident risk prediction as a multi-task learning problem across multiple cities. MLA-STNet integrates two complementary modules: (i)the Spatio-Temporal Geographical Mamba-Attention (STG-MA), which suppresses unstable spatio-temporal fluctuations and strengthens long-range temporal dependencies; and (ii) the Spatio-Temporal Semantic Mamba-Attention (STS-MA), which mitigates cross-city heterogeneity through a shared-parameter design that jointly trains all cities while preserving individual semantic representation spaces. We validate the proposed framework through 75 experiments under two forecasting scenarios, full-day and high-frequency accident periods, using real-world datasets from New York City and Chicago. Compared with the state-of-the-art baselines, MLA-STNet achieves up to 6% lower RMSE, 8% higher Recall, and 5% higher MAP, while maintaining less than 1% performance variation under 50% input noise. These results demonstrate that MLA-STNet effectively unifies heterogeneous urban datasets within a scalable, robust, and interpretable Cross-City Accident Prevention System, paving the way for coordinated and data-driven urban safety management.

[275] DeMa: Dual-Path Delay-Aware Mamba for Efficient Multivariate Time Series Analysis

Rui An, Haohao Qu, Wenqi Fan, Xuequn Shang, Qing Li

Main category: cs.LG

TL;DR: DeMa is a dual-path delay-aware Mamba backbone that improves MTS analysis by addressing vanilla Mamba’s limitations through decomposition of temporal dynamics and cross-variate interactions with linear complexity.

Details

Motivation: Transformers dominate MTS analysis but suffer from quadratic complexity, while vanilla Mamba lacks explicit cross-variate modeling, struggles with disentangling temporal dynamics from inter-series interactions, and insufficiently models time-lag effects.

Method: DeMa decomposes MTS into intra-series temporal dynamics and inter-series interactions, using a temporal path with Mamba-SSD for long-range dynamics and a variate path with Mamba-DALA integrating delay-aware linear attention for cross-variate dependencies.

Result: DeMa achieves state-of-the-art performance across five MTS tasks (long/short-term forecasting, imputation, anomaly detection, classification) while maintaining remarkable computational efficiency with linear complexity.

Conclusion: DeMa successfully addresses vanilla Mamba’s limitations for MTS analysis, providing an efficient linear-complexity backbone that outperforms existing methods across diverse tasks while preserving computational advantages.

Abstract: Accurate and efficient multivariate time series (MTS) analysis is increasingly critical for a wide range of intelligent applications. Within this realm, Transformers have emerged as the predominant architecture due to their strong ability to capture pairwise dependencies. However, Transformer-based models suffer from quadratic computational complexity and high memory overhead, limiting their scalability and practical deployment in long-term and large-scale MTS modeling. Recently, Mamba has emerged as a promising linear-time alternative with high expressiveness. Nevertheless, directly applying vanilla Mamba to MTS remains suboptimal due to three key limitations: (i) the lack of explicit cross-variate modeling, (ii) difficulty in disentangling the entangled intra-series temporal dynamics and inter-series interactions, and (iii) insufficient modeling of latent time-lag interaction effects. These issues constrain its effectiveness across diverse MTS tasks. To address these challenges, we propose DeMa, a dual-path delay-aware Mamba backbone. DeMa preserves Mamba’s linear-complexity advantage while substantially improving its suitability for MTS settings. Specifically, DeMa introduces three key innovations: (i) it decomposes the MTS into intra-series temporal dynamics and inter-series interactions; (ii) it develops a temporal path with a Mamba-SSD module to capture long-range dynamics within each individual series, enabling series-independent, parallel computation; and (iii) it designs a variate path with a Mamba-DALA module that integrates delay-aware linear attention to model cross-variate dependencies. Extensive experiments on five representative tasks, long- and short-term forecasting, data imputation, anomaly detection, and series classification, demonstrate that DeMa achieves state-of-the-art performance while delivering remarkable computational efficiency.

[276] Scalable Heterogeneous Graph Learning via Heterogeneous-aware Orthogonal Prototype Experts

Wei Zhou, Hong Huang, Ruize Shi, Bang Liu

Main category: cs.LG

TL;DR: HOPE framework replaces standard linear prediction heads in HGNNs with heterogeneous-aware orthogonal prototype experts to address the linear projection bottleneck caused by contextual diversity and long-tail shifts in heterogeneous graphs.

Details

Motivation: HGNNs have advanced through better encoders, but their decoding stage still uses a single shared linear head, creating a "Linear Projection Bottleneck" where contextual diversity and long-tail shifts cause the global head to miss fine semantics, overfit hub nodes, and underserve tail nodes.

Method: HOPE (Heterogeneous-aware Orthogonal Prototype Experts) framework uses learnable prototype-based routing to assign instances to experts by similarity, allowing expert usage to follow natural long-tail distribution, and adds expert orthogonalization to encourage diversity and prevent collapse.

Result: Experiments on four real datasets show consistent gains across state-of-the-art HGNN backbones with minimal overhead.

Conclusion: HOPE provides a plug-and-play replacement for standard prediction heads that effectively addresses the linear projection bottleneck in heterogeneous graph neural networks by better handling contextual diversity and long-tail distribution challenges.

Abstract: Heterogeneous Graph Neural Networks(HGNNs) have advanced mainly through better encoders, yet their decoding/projection stage still relies on a single shared linear head, assuming it can map rich node embeddings to labels. We call this the Linear Projection Bottleneck: in heterogeneous graphs, contextual diversity and long-tail shifts make a global head miss fine semantics, overfit hub nodes, and underserve tail nodes. While Mixture-of-Experts(MoE) could help, naively applying it clashes with structural imbalance and risks expert collapse. We propose a Heterogeneous-aware Orthogonal Prototype Experts framework named HOPE, a plug-and-play replacement for the standard prediction head. HOPE uses learnable prototype-based routing to assign instances to experts by similarity, letting expert usage follow the natural long-tail distribution, and adds expert orthogonalization to encourage diversity and prevent collapse. Experiments on four real datasets show consistent gains across SOTA HGNN backbones with minimal overhead.

[277] Buffered AUC maximization for scoring systems via mixed-integer optimization

Moe Shiina, Shunnosuke Ikeda, Yuichi Takano

Main category: cs.LG

TL;DR: The paper presents a mixed-integer optimization framework to build scoring systems that directly maximize AUC, creating interpretable classifiers with integer coefficients.

Details

Motivation: Scoring systems are highly interpretable linear classifiers, but previous MIO approaches haven't focused on directly maximizing AUC, which is crucial for evaluating scoring systems.

Method: Developed a mixed-integer linear optimization (MILO) framework that maximizes buffered AUC (bAUC) as a tight concave lower bound on AUC, with group sparsity constraints to limit the number of variables.

Result: Computational experiments on real-world datasets show the MILO method builds scoring systems with superior AUC values compared to baseline methods using regularization and stepwise regression.

Conclusion: The research advances MIO techniques for developing highly interpretable classification models by directly optimizing AUC in scoring system construction.

Abstract: A scoring system is a linear classifier composed of a small number of explanatory variables, each assigned a small integer coefficient. This system is highly interpretable and allows predictions to be made with simple manual calculations without the need for a calculator. Several previous studies have used mixed-integer optimization (MIO) techniques to develop scoring systems for binary classification; however, they have not focused on directly maximizing AUC (i.e., area under the receiver operating characteristic curve), even though AUC is recognized as an essential evaluation metric for scoring systems. Our goal herein is to establish an effective MIO framework for constructing scoring systems that directly maximize the buffered AUC (bAUC) as the tightest concave lower bound on AUC. Our optimization model is formulated as a mixed-integer linear optimization (MILO) problem that maximizes bAUC subject to a group sparsity constraint for limiting the number of questions in the scoring system. Computational experiments using publicly available real-world datasets demonstrate that our MILO method can build scoring systems with superior AUC values compared to the baseline methods based on regularization and stepwise regression. This research contributes to the advancement of MIO techniques for developing highly interpretable classification models.

[278] Learn to Evolve: Self-supervised Neural JKO Operator for Wasserstein Gradient Flow

Xue Feng, Li Wang, Deanna Needell, Rongjie Lai

Main category: cs.LG

TL;DR: A self-supervised method for learning JKO solution operators without solving JKO subproblems, using a Learn-to-Evolve algorithm that jointly learns operators and trajectories through alternating updates.

Details

Motivation: The JKO scheme is computationally expensive due to repeatedly solving subproblems. There's a need for efficient methods to compute Wasserstein gradient flows without this computational burden.

Method: Proposes a Learn-to-Evolve algorithm that learns a JKO solution operator mapping input densities to subproblem minimizers. The method alternates between trajectory generation and operator updates, using generated data to approximate true JKO trajectories as training progresses.

Result: Numerical experiments show the method achieves accuracy, stability, and robustness across various energies and initial conditions. The approach enhances generalization through data augmentation from generated trajectories.

Conclusion: The self-supervised Learn-to-Evolve algorithm provides an efficient alternative to traditional JKO computations, overcoming computational limitations while maintaining accuracy and generalization capabilities.

Abstract: The Jordan-Kinderlehrer-Otto (JKO) scheme provides a stable variational framework for computing Wasserstein gradient flows, but its practical use is often limited by the high computational cost of repeatedly solving the JKO subproblems. We propose a self-supervised approach for learning a JKO solution operator without requiring numerical solutions of any JKO trajectories. The learned operator maps an input density directly to the minimizer of the corresponding JKO subproblem, and can be iteratively applied to efficiently generate the gradient-flow evolution. A key challenge is that only a number of initial densities are typically available for training. To address this, we introduce a Learn-to-Evolve algorithm that jointly learns the JKO operator and its induced trajectories by alternating between trajectory generation and operator updates. As training progresses, the generated data increasingly approximates true JKO trajectories. Meanwhile, this Learn-to-Evolve strategy serves as a natural form of data augmentation, significantly enhancing the generalization ability of the learned operator. Numerical experiments demonstrate the accuracy, stability, and robustness of the proposed method across various choices of energies and initial conditions.

[279] Poisson Hyperplane Processes with Rectified Linear Units

Shufei Ge, Shijia Wang, Lloyd Elliott

Main category: cs.LG

TL;DR: The paper establishes a connection between Poisson hyperplane processes (PHP) and two-layer ReLU neural networks, showing PHP with Gaussian prior provides an alternative probabilistic representation to ReLU networks.

Details

Motivation: To provide a probabilistic foundation for ReLU neural networks by connecting them with Poisson hyperplane processes, enabling scalable Bayesian inference for large-scale problems.

Method: Established theoretical connection between PHP and ReLU networks, showed PHP with Gaussian prior as alternative representation, developed decomposition propositions for scalability, and proposed annealed sequential Monte Carlo algorithm for Bayesian inference.

Result: Numerical experiments demonstrate the proposed PHP-based method outperforms classic two-layer ReLU neural networks.

Conclusion: PHP provides a probabilistic framework for ReLU networks that enables scalable Bayesian inference and shows improved performance over traditional approaches.

Abstract: Neural networks have shown state-of-the-art performances in various classification and regression tasks. Rectified linear units (ReLU) are often used as activation functions for the hidden layers in a neural network model. In this article, we establish the connection between the Poisson hyperplane processes (PHP) and two-layer ReLU neural networks. We show that the PHP with a Gaussian prior is an alternative probabilistic representation to a two-layer ReLU neural network. In addition, we show that a two-layer neural network constructed by PHP is scalable to large-scale problems via the decomposition propositions. Finally, we propose an annealed sequential Monte Carlo algorithm for Bayesian inference. Our numerical experiments demonstrate that our proposed method outperforms the classic two-layer ReLU neural network. The implementation of our proposed model is available at https://github.com/ShufeiGe/Pois_Relu.git.

[280] PaCoRe: Learning to Scale Test-Time Compute with Parallel Coordinated Reasoning

Jingcheng Hu, Yinmin Zhang, Shijie Shang, Xiaobo Yang, Yue Peng, Zhewei Huang, Hebin Zhou, Xin Wu, Jie Cheng, Fanqi Wan, Xiangwen Kong, Chengyuan Yao, Kaiwen Yan, Ailin Huang, Hongyu Zhou, Qi Han, Zheng Ge, Daxin Jiang, Xiangyu Zhang, Heung-Yeung Shum

Main category: cs.LG

TL;DR: PaCoRe enables language models to scale test-time compute massively through parallel coordinated reasoning, achieving state-of-the-art math performance with 8B parameters.

Details

Motivation: Current language models are limited by their inability to scale test-time compute beyond sequential reasoning under fixed context windows, restricting their reasoning capabilities.

Method: Parallel Coordinated Reasoning (PaCoRe) uses a message-passing architecture with multiple rounds: launching parallel reasoning trajectories, compacting findings into context-bounded messages, and synthesizing messages to guide subsequent rounds and produce final answers. Trained end-to-end with large-scale outcome-based reinforcement learning.

Result: Achieves strong improvements across diverse domains. Notably, an 8B model reaches 94.5% on HMMT 2025, surpassing GPT-5’s 93.2% by scaling effective test-time compute to roughly two million tokens without exceeding context limits.

Conclusion: PaCoRe successfully overcomes the test-time compute scaling limitation of language models through parallel coordinated reasoning, enabling multi-million-token effective compute while maintaining context bounds, with open-source release to accelerate follow-up work.

Abstract: We introduce Parallel Coordinated Reasoning (PaCoRe), a training-and-inference framework designed to overcome a central limitation of contemporary language models: their inability to scale test-time compute (TTC) far beyond sequential reasoning under a fixed context window. PaCoRe departs from the traditional sequential paradigm by driving TTC through massive parallel exploration coordinated via a message-passing architecture in multiple rounds. Each round launches many parallel reasoning trajectories, compacts their findings into context-bounded messages, and synthesizes these messages to guide the next round and ultimately produce the final answer. Trained end-to-end with large-scale, outcome-based reinforcement learning, the model masters the synthesis abilities required by PaCoRe and scales to multi-million-token effective TTC without exceeding context limits. The approach yields strong improvements across diverse domains, and notably pushes reasoning beyond frontier systems in mathematics: an 8B model reaches 94.5% on HMMT 2025, surpassing GPT-5’s 93.2% by scaling effective TTC to roughly two million tokens. We open-source model checkpoints, training data, and the full inference pipeline to accelerate follow-up work.

[281] Good Allocations from Bad Estimates

Sílvia Casacuberta, Moritz Hardt

Main category: cs.LG

TL;DR: The paper shows that treatment allocation requires far fewer samples than treatment effect estimation - O(M/ε) vs O(M/ε²) - by using coarse estimates for near-optimal allocations.

Details

Motivation: CATE estimation is the gold standard for targeting treatments but requires O(M/ε²) samples, which is expensive. The authors aim to show that treatment allocation (deciding who gets treatment) can be done with far fewer samples than full treatment effect estimation.

Method: The authors develop an algorithm that uses coarse estimates of treatment effects rather than precise estimates. They leverage the insight that for allocation decisions, approximate rankings suffice. They also incorporate budget flexibility to further reduce sample complexity.

Result: The algorithm achieves the same total treatment effect as CATE with only O(M/ε) samples for natural distributions, a quadratic improvement. Budget flexibility further reduces sample complexity. Real-world RCT evaluations show the algorithm finds nearly optimal allocations with surprisingly few samples.

Conclusion: Treatment allocation requires far fewer samples than treatment effect estimation. Coarse estimates suffice for near-optimal allocations, highlighting a fundamental distinction between estimation and allocation tasks in causal inference.

Abstract: Conditional average treatment effect (CATE) estimation is the de facto gold standard for targeting a treatment to a heterogeneous population. The method estimates treatment effects up to an error $ε> 0$ in each of $M$ different strata of the population, targeting individuals in decreasing order of estimated treatment effect until the budget runs out. In general, this method requires $O(M/ε^2)$ samples. This is best possible if the goal is to estimate all treatment effects up to an $ε$ error. In this work, we show how to achieve the same total treatment effect as CATE with only $O(M/ε)$ samples for natural distributions of treatment effects. The key insight is that coarse estimates suffice for near-optimal treatment allocations. In addition, we show that budget flexibility can further reduce the sample complexity of allocation. Finally, we evaluate our algorithm on various real-world RCT datasets. In all cases, it finds nearly optimal treatment allocations with surprisingly few samples. Our work highlights the fundamental distinction between treatment effect estimation and treatment allocation: the latter requires far fewer samples.

[282] Orchestrating Tokens and Sequences: Dynamic Hybrid Policy Optimization for RLVR

Zijun Min, Bingshuai Liu, Ante Wang, Long Zhang, Anxiang Zeng, Haibo Zhang, Jinsong Su

Main category: cs.LG

TL;DR: DHPO combines token-level and sequence-level policy optimization to improve RL for language models in reasoning tasks, outperforming existing methods on math benchmarks.

Details

Motivation: Existing RLVR algorithms have complementary limitations: GRPO preserves fine-grained credit assignment but suffers from high variance, while GSPO matches sequence-level rewards better but sacrifices token-wise credit assignment.

Method: Proposes Dynamic Hybrid Policy Optimization (DHPO) that bridges GRPO and GSPO within a single clipped surrogate objective. Uses weighting mechanisms to combine token-level and sequence-level importance ratios, with two mixing variants (averaged and entropy-guided). Employs branch-specific clipping to constrain ratios within separate trust regions before mixing.

Result: Across seven challenging mathematical reasoning benchmarks, experiments on both dense and MoE models from the Qwen3 series show that DHPO consistently outperforms GRPO and GSPO.

Conclusion: DHPO effectively combines the strengths of both token-level and sequence-level policy optimization approaches, providing a more stable and effective method for RL with verifiable rewards in language model reasoning tasks.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) offers a promising framework for optimizing large language models in reasoning tasks. However, existing RLVR algorithms focus on different granularities, and each has complementary strengths and limitations. Group Relative Policy Optimization (GRPO) updates the policy with token-level importance ratios, which preserves fine-grained credit assignment but often suffers from high variance and instability. In contrast, Group Sequence Policy Optimization (GSPO) applies single sequence-level importance ratios across all tokens in a response that better matches sequence-level rewards, but sacrifices token-wise credit assignment. In this paper, we propose Dynamic Hybrid Policy Optimization (DHPO) to bridge GRPO and GSPO within a single clipped surrogate objective. DHPO combines token-level and sequence-level importance ratios using weighting mechanisms. We explore two variants of the mixing mechanism, including an averaged mixing and an entropy-guided mixing. To further stabilize training, we employ a branch-specific clipping strategy that constrains token-level and sequence-level ratios within separate trust regions before mixing, preventing outliers in either branch from dominating the update. Across seven challenging mathematical reasoning benchmarks, experiments on both dense and MoE models from the Qwen3 series show that DHPO consistently outperforms GRPO and GSPO. We will release our code upon acceptance of this paper.

[283] PiXTime: A Model for Federated Time Series Forecasting with Heterogeneous Data Structures Across Nodes

Yiming Zhou, Mingyue Cheng, Hao Wang, Enhong Chen

Main category: cs.LG

TL;DR: PiXTime is a federated time series forecasting model that handles multi-granularity and heterogeneous variable sets across nodes using personalized patch embeddings and a global variable embedding table.

Details

Motivation: Time series data is valuable but rarely shareable across nodes due to privacy concerns. Federated learning is promising but faces challenges with diverse sampling standards leading to different time granularities and variable sets across nodes, which hinders classical federated learning approaches.

Method: PiXTime uses personalized Patch Embedding to map node-specific granularity time series into unified token sequences, a global VE Table to align variable category semantics across nodes, and a transformer-based shared model with cross-attention to enhance target series prediction while handling arbitrary numbers of variables.

Result: Experiments show PiXTime achieves state-of-the-art performance in federated settings and demonstrates superior performance on eight widely used real-world traditional benchmarks.

Conclusion: PiXTime effectively addresses the challenges of federated time series forecasting with heterogeneous granularities and variable sets, enabling practical deployment across distributed nodes while maintaining privacy.

Abstract: Time series are highly valuable and rarely shareable across nodes, making federated learning a promising paradigm to leverage distributed temporal data. However, different sampling standards lead to diverse time granularities and variable sets across nodes, hindering classical federated learning. We propose PiXTime, a novel time series forecasting model designed for federated learning that enables effective prediction across nodes with multi-granularity and heterogeneous variable sets. PiXTime employs a personalized Patch Embedding to map node-specific granularity time series into token sequences of a unified dimension for processing by a subsequent shared model, and uses a global VE Table to align variable category semantics across nodes, thereby enhancing cross-node transferability. With a transformer-based shared model, PiXTime captures representations of auxiliary series with arbitrary numbers of variables and uses cross-attention to enhance the prediction of the target series. Experiments show PiXTime achieves state-of-the-art performance in federated settings and demonstrates superior performance on eight widely used real-world traditional benchmarks.

[284] Dual-Phase LLM Reasoning: Self-Evolved Mathematical Frameworks

ShaoZhen Liu, Xinting Huang, Houwen Peng, Xin Chen, Xinyang Song, Qi Li, Zhenan Sun

Main category: cs.LG

TL;DR: A two-stage SFT framework using self-generated long CoT data with verification and difficulty-aware sampling improves LLMs’ mathematical reasoning without RL.

Details

Motivation: Existing research on LLMs for mathematical reasoning heavily relies on RL frameworks while overlooking supervised fine-tuning methods, despite SFT's potential for resource-efficient optimization of complex reasoning capabilities.

Method: Two-stage framework: 1) Multi-turn dialogue generates CoT data with verification, backtracking, subgoal decomposition, and backward reasoning, filtered by predefined rules for SFT; 2) Difficulty-aware rejection sampling dynamically optimizes data distribution to handle complex problems.

Result: Generates reasoning chains 4x longer, improves performance on GSM8K and MATH500 benchmarks, and achieves substantial improvement on competition-level AIME24 problems, demonstrating SFT effectively activates intrinsic reasoning capabilities.

Conclusion: SFT with self-generated long CoT data provides a resource-efficient pathway for enhancing LLMs’ mathematical reasoning, proving SFT can effectively activate models’ intrinsic capabilities without relying on RL frameworks.

Abstract: In recent years, large language models (LLMs) have demonstrated significant potential in complex reasoning tasks like mathematical problem-solving. However, existing research predominantly relies on reinforcement learning (RL) frameworks while overlooking supervised fine-tuning (SFT) methods. This paper proposes a new two-stage training framework that enhances models’ self-correction capabilities through self-generated long chain-of-thought (CoT) data. During the first stage, a multi-turn dialogue strategy guides the model to generate CoT data incorporating verification, backtracking, subgoal decomposition, and backward reasoning, with predefined rules filtering high-quality samples for supervised fine-tuning. The second stage employs a difficulty-aware rejection sampling mechanism to dynamically optimize data distribution, strengthening the model’s ability to handle complex problems. The approach generates reasoning chains extended over 4 times longer while maintaining strong scalability, proving that SFT effectively activates models’ intrinsic reasoning capabilities and provides a resource-efficient pathway for complex task optimization. Experimental results demonstrate performance improvements on mathematical benchmarks including GSM8K and MATH500, with the fine-tuned model achieving a substantial improvement on competition-level problems like AIME24. Code will be open-sourced.

[285] Continual Learning of Achieving Forgetting-free and Positive Knowledge Transfer

Zhi Wang, Zhongbin Wu, Yanni Li, Bing Liu, Guangxi Li, Yuping Wang

Main category: cs.LG

TL;DR: ETCL is a novel continual learning method that achieves forgetting-free learning with positive forward and backward knowledge transfer through task-specific masks, gradient alignment, and bi-objective optimization.

Details

Motivation: Existing continual learning research focuses mainly on overcoming catastrophic forgetting, but an ideal CL agent should also enable positive knowledge transfer in both forward (using previous knowledge for new tasks) and backward (improving previous tasks with new knowledge) directions.

Method: ETCL models CL as an optimization problem with positive KT constraints, uses task-specific binary masks to isolate sparse sub-networks, aligns new task gradients with previous similar tasks for positive FKT, and employs bi-objective optimization with orthogonal gradient projection for positive BKT.

Result: Extensive evaluations show ETCL markedly outperforms strong baselines on dissimilar, similar, and mixed task sequences, achieving forgetting-free learning with positive knowledge transfer.

Conclusion: ETCL successfully addresses both catastrophic forgetting and enables positive bidirectional knowledge transfer, representing a more comprehensive solution for continual learning that goes beyond just preventing forgetting to actively improving task performance through knowledge sharing.

Abstract: Existing research on continual learning (CL) of a sequence of tasks focuses mainly on dealing with catastrophic forgetting (CF) to balance the learning plasticity of new tasks and the memory stability of old tasks. However, an ideal CL agent should not only be able to overcome CF, but also encourage positive forward and backward knowledge transfer (KT), i.e., using the learned knowledge from previous tasks for the new task learning (namely FKT), and improving the previous tasks’ performance with the knowledge of the new task (namely BKT). To this end, this paper first models CL as an optimization problem in which each sequential learning task aims to achieve its optimal performance under the constraint that both FKT and BKT should be positive. It then proposes a novel Enhanced Task Continual Learning (ETCL) method, which achieves forgetting-free and positive KT. Furthermore, the bounds that can lead to negative FKT and BKT are estimated theoretically. Based on the bounds, a new strategy for online task similarity detection is also proposed to facilitate positive KT. To overcome CF, ETCL learns a set of task-specific binary masks to isolate a sparse sub-network for each task while preserving the performance of a dense network for the task. At the beginning of a new task learning, ETCL tries to align the new task’s gradient with that of the sub-network of the previous most similar task to ensure positive FKT. By using a new bi-objective optimization strategy and an orthogonal gradient projection method, ETCL updates only the weights of previous similar tasks at the classification layer to achieve positive BKT. Extensive evaluations demonstrate that the proposed ETCL markedly outperforms strong baselines on dissimilar, similar, and mixed task sequences.

[286] Transformer Is Inherently a Causal Learner

Xinyue Wang, Stephen Wang, Biwei Huang

Main category: cs.LG

TL;DR: Transformers trained autoregressively naturally learn time-delayed causal structures from multivariate time series, enabling causal graph discovery without explicit causal objectives.

Details

Motivation: To show that transformers inherently capture causal relationships during autoregressive training, providing a new approach to causal discovery that leverages foundation models rather than traditional causal algorithms.

Method: Analyze gradient sensitivities of transformer outputs with respect to past inputs to recover causal graphs, develop practical extraction method using aggregated gradient attributions, and prove theoretical connection under standard identifiability conditions.

Result: The approach significantly outperforms state-of-the-art causal discovery algorithms on challenging cases (nonlinear dynamics, long-term dependencies, non-stationary systems), especially with increasing data heterogeneity, showing scaling potential where accuracy improves with data volume.

Conclusion: This establishes a unifying framework where causal discovery can operate through foundation models, and foundation models gain interpretability through causality, suggesting a future paradigm shift in both fields.

Abstract: We reveal that transformers trained in an autoregressive manner naturally encode time-delayed causal structures in their learned representations. When predicting future values in multivariate time series, the gradient sensitivities of transformer outputs with respect to past inputs directly recover the underlying causal graph, without any explicit causal objectives or structural constraints. We prove this connection theoretically under standard identifiability conditions and develop a practical extraction method using aggregated gradient attributions. On challenging cases such as nonlinear dynamics, long-term dependencies, and non-stationary systems, this approach greatly surpasses the performance of state-of-the-art discovery algorithms, especially as data heterogeneity increases, exhibiting scaling potential where causal accuracy improves with data volume and heterogeneity, a property traditional methods lack. This unifying view lays the groundwork for a future paradigm where causal discovery operates through the lens of foundation models, and foundation models gain interpretability and enhancement through the lens of causality.

[287] From Global to Local: Cluster-Aware Learning for Wi-Fi Fingerprinting Indoor Localisation

Miguel Matey-Sanz, Joaquín Torres-Sospedra, Joaquín Huerta, Sergio Trilles

Main category: cs.LG

TL;DR: Clustering-based method structures Wi-Fi fingerprint datasets before localization to improve accuracy by reducing data heterogeneity and ambiguity in large multi-floor environments.

Details

Motivation: Wi-Fi fingerprinting suffers from performance limitations due to large heterogeneous datasets, strong RSSI variability, and ambiguity in multi-floor environments, which degrade localization accuracy when using global models without structural constraints.

Method: Clustering-based approach that groups fingerprints using spatial or radio features at building or floor level. During localization, clustering estimation based on strongest access points assigns unseen fingerprints to relevant clusters, then localization is performed only within selected clusters using learning models on reduced, coherent data subsets.

Result: Evaluation on three public datasets with several ML models shows consistent reduction in localization errors, especially under building-level strategies, but with reduced floor detection accuracy.

Conclusion: Explicitly structuring datasets through clustering is an effective and flexible approach for scalable indoor positioning, though there’s a trade-off between localization accuracy and floor detection performance.

Abstract: Wi-Fi fingerprinting remains one of the most practical solutions for indoor positioning, however, its performance is often limited by the size and heterogeneity of fingerprint datasets, strong Received Signal Strength Indicator variability, and the ambiguity introduced in large and multi-floor environments. These factors significantly degrade localisation accuracy, particularly when global models are applied without considering structural constraints. This paper introduces a clustering-based method that structures the fingerprint dataset prior to localisation. Fingerprints are grouped using either spatial or radio features, and clustering can be applied at the building or floor level. In the localisation phase, a clustering estimation procedure based on the strongest access points assigns unseen fingerprints to the most relevant cluster. Localisation is then performed only within the selected clusters, allowing learning models to operate on reduced and more coherent subsets of data. The effectiveness of the method is evaluated on three public datasets and several machine learning models. Results show a consistent reduction in localisation errors, particularly under building-level strategies, but at the cost of reducing the floor detection accuracy. These results demonstrate that explicitly structuring datasets through clustering is an effective and flexible approach for scalable indoor positioning.

[288] Do Sparse Autoencoders Identify Reasoning Features in Language Models?

George Ma, Zhongyuan Liang, Irene Y. Chen, Somayeh Sojoudi

Main category: cs.LG

TL;DR: SAE features identified by contrastive methods don’t capture genuine reasoning computations but rather linguistic correlates; they’re easily triggered by token-level interventions and fail falsification tests.

Details

Motivation: To determine whether sparse autoencoders (SAEs) actually identify genuine reasoning features in LLMs, or if they're capturing superficial linguistic patterns instead.

Method: Falsification framework combining causal token injection experiments (injecting feature-associated tokens into non-reasoning text) and LLM-guided falsification (generating counterexamples) across 20 configurations spanning multiple models, layers, and reasoning datasets.

Result: 59-94% of features were easily triggered by token injections, showing lexical artifact reliance. Remaining features failed LLM-guided falsification tests - no feature satisfied criteria for genuine reasoning behavior. Steering features yielded minimal or negative benchmark performance changes.

Conclusion: SAE features identified by contrastive approaches primarily capture linguistic correlates of reasoning rather than the underlying reasoning computations themselves.

Abstract: We investigate whether sparse autoencoders (SAEs) identify genuine reasoning features in large language models (LLMs). Starting from features selected using standard contrastive activation methods, we introduce a falsification-oriented framework that combines causal token injection experiments and LLM-guided falsification to test whether feature activation reflects reasoning processes or superficial linguistic correlates. Across 20 configurations spanning multiple model families, layers, and reasoning datasets, we find that identified reasoning features are highly sensitive to token-level interventions. Injecting a small number of feature-associated tokens into non-reasoning text is sufficient to elicit strong activation for 59% to 94% of features, indicating reliance on lexical artifacts. For the remaining features that are not explained by simple token triggers, LLM-guided falsification consistently produces non-reasoning inputs that activate the feature and reasoning inputs that do not, with no analyzed feature satisfying our criteria for genuine reasoning behavior. Steering these features yields minimal changes or slight degradations in benchmark performance. Together, these results suggest that SAE features identified by contrastive approaches primarily capture linguistic correlates of reasoning rather than the underlying reasoning computations themselves.

[289] Towards Realistic Guarantees: A Probabilistic Certificate for SmoothLLM

Adarsh Kumarappan, Ayushi Mehrotra

Main category: cs.LG

TL;DR: The paper introduces a more realistic probabilistic framework called (k, ε)-unstable to improve SmoothLLM’s certification against jailbreaking attacks, addressing limitations of the original strict k-unstable assumption.

Details

Motivation: The SmoothLLM defense provides certification against jailbreaking attacks but relies on a strict k-unstable assumption that rarely holds in practice, limiting the trustworthiness of its safety certificates.

Method: Introduces a probabilistic framework called (k, ε)-unstable to certify defenses against diverse jailbreaking attacks (from gradient-based GCG to semantic PAIR). Derives a new data-informed lower bound on SmoothLLM’s defense probability by incorporating empirical models of attack success.

Result: Provides a more trustworthy and practical safety certificate that better reflects real-world LLM behavior, enabling practitioners to set more realistic certification thresholds.

Conclusion: The work contributes a practical and theoretically-grounded mechanism to make LLMs more resistant to exploitation of their safety alignments, addressing a critical challenge in secure AI deployment.

Abstract: The SmoothLLM defense provides a certification guarantee against jailbreaking attacks, but it relies on a strict k-unstable' assumption that rarely holds in practice. This strong assumption can limit the trustworthiness of the provided safety certificate. In this work, we address this limitation by introducing a more realistic probabilistic framework, (k, $\varepsilon$)-unstable,’ to certify defenses against diverse jailbreaking attacks, from gradient-based (GCG) to semantic (PAIR). We derive a new, data-informed lower bound on SmoothLLM’s defense probability by incorporating empirical models of attack success, providing a more trustworthy and practical safety certificate. By introducing the notion of (k, $\varepsilon$)-unstable, our framework provides practitioners with actionable safety guarantees, enabling them to set certification thresholds that better reflect the real-world behavior of LLMs. Ultimately, this work contributes a practical and theoretically-grounded mechanism to make LLMs more resistant to the exploitation of their safety alignments, a critical challenge in secure AI deployment.

[290] AGDC: Autoregressive Generation of Variable-Length Sequences with Joint Discrete and Continuous Spaces

Yeonsang Shin, Insoo Kim, Bongkeun Kim, Keonwoo Bae, Bohyung Han

Main category: cs.LG

TL;DR: AGDC is a unified framework for generating hybrid discrete-continuous sequences that combines categorical prediction for discrete values with diffusion modeling for continuous values, addressing precision limitations of token-based approaches in high-precision domains like semiconductor layouts.

Details

Motivation: Transformer-based autoregressive models are limited by discretized tokens that can't represent continuous values with high precision, especially problematic in domains like semiconductor circuit design where precision loss causes functional failures. Existing discretization approaches don't scale well for hybrid discrete-continuous sequences.

Method: AGDC jointly models discrete and continuous values using a hybrid approach: categorical prediction for discrete values and diffusion-based modeling for continuous values. Key innovations include: 1) EOS logit adjustment mechanism using an MLP to dynamically adjust end-of-sequence token logits based on context, and 2) length regularization term in the loss function. Also introduced ContLayNet benchmark with 334K semiconductor layout samples.

Result: AGDC outperforms discretization-based and fixed-schema baselines on semiconductor layouts (ContLayNet), graphic layouts, and SVGs, achieving superior high-fidelity hybrid vector representation generation and scalable high-precision generation across domains.

Conclusion: AGDC provides a unified framework for scalable high-precision generation of hybrid discrete-continuous sequences, overcoming limitations of token-based approaches and enabling reliable generation in precision-critical domains like semiconductor design.

Abstract: Transformer-based autoregressive models excel in data generation but are inherently constrained by their reliance on discretized tokens, which limits their ability to represent continuous values with high precision. We analyze the scalability limitations of existing discretization-based approaches for generating hybrid discrete-continuous sequences, particularly in high-precision domains such as semiconductor circuit designs, where precision loss can lead to functional failure. To address the challenge, we propose AGDC, a novel unified framework that jointly models discrete and continuous values for variable-length sequences. AGDC employs a hybrid approach that combines categorical prediction for discrete values with diffusion-based modeling for continuous values, incorporating two key technical components: an end-of-sequence (EOS) logit adjustment mechanism that uses an MLP to dynamically adjust EOS token logits based on sequence context, and a length regularization term integrated into the loss function. Additionally, we present ContLayNet, a large-scale benchmark comprising 334K high-precision semiconductor layout samples with specialized evaluation metrics that capture functional correctness where precision errors significantly impact performance. Experiments on semiconductor layouts (ContLayNet), graphic layouts, and SVGs demonstrate AGDC’s superior performance in generating high-fidelity hybrid vector representations compared to discretization-based and fixed-schema baselines, achieving scalable high-precision generation across diverse domains.

[291] Automating Deception: Scalable Multi-Turn LLM Jailbreaks

Adarsh Kumarappan, Ananya Mujoo

Main category: cs.LG

TL;DR: Automated pipeline creates large-scale psychological multi-turn jailbreak datasets using Foot-in-the-Door techniques, revealing GPT models are vulnerable to conversational history while Gemini shows exceptional resilience.

Details

Motivation: Multi-turn conversational attacks using psychological principles like Foot-in-the-Door pose persistent threats to LLMs, but defense progress is hindered by manual, hard-to-scale dataset creation.

Method: Novel automated pipeline for generating large-scale psychologically-grounded multi-turn jailbreak datasets. Systematically operationalizes FITD techniques into reproducible templates, creating benchmark of 1,500 scenarios across illegal activities and offensive content.

Result: GPT family shows significant vulnerability to conversational history (ASR increases up to 32 percentage points). Gemini 2.5 Flash exhibits exceptional resilience (nearly immune). Claude 3 Haiku shows strong but imperfect resistance.

Conclusion: Critical divergence in how current safety architectures handle conversational context, highlighting need for defenses that can resist narrative-based manipulation.

Abstract: Multi-turn conversational attacks, which leverage psychological principles like Foot-in-the-Door (FITD), where a small initial request paves the way for a more significant one, to bypass safety alignments, pose a persistent threat to Large Language Models (LLMs). Progress in defending against these attacks is hindered by a reliance on manual, hard-to-scale dataset creation. This paper introduces a novel, automated pipeline for generating large-scale, psychologically-grounded multi-turn jailbreak datasets. We systematically operationalize FITD techniques into reproducible templates, creating a benchmark of 1,500 scenarios across illegal activities and offensive content. We evaluate seven models from three major LLM families under both multi-turn (with history) and single-turn (without history) conditions. Our results reveal stark differences in contextual robustness: models in the GPT family demonstrate a significant vulnerability to conversational history, with Attack Success Rates (ASR) increasing by as much as 32 percentage points. In contrast, Google’s Gemini 2.5 Flash exhibits exceptional resilience, proving nearly immune to these attacks, while Anthropic’s Claude 3 Haiku shows strong but imperfect resistance. These findings highlight a critical divergence in how current safety architectures handle conversational context and underscore the need for defenses that can resist narrative-based manipulation.

[292] FLRQ: Faster LLM Quantization with Flexible Low-Rank Matrix Sketching

Hongyaoxing Gul, Lijuan Hu, Shuzi Niu, Fangfang Liu

Main category: cs.LG

TL;DR: FLRQ is a flexible low-rank quantization method for LLMs that quickly identifies optimal ranks per layer using R1-Sketch and minimizes quantization error through iterative clipping, achieving SOTA performance without costly fine-tuning.

Details

Motivation: Existing low-rank PTQ methods require expensive fine-tuning to find compromise ranks for different layers and data, failing to exploit full potential. Current SVD-based approaches also add computational overhead.

Method: FLRQ has two components: 1) R1-FLR uses R1-Sketch with Gaussian projection for fast low-rank approximation and outlier-aware rank extraction per layer, 2) BLC minimizes low-rank quantization error under scaling/clipping through iterative optimization.

Result: FLRQ achieves state-of-the-art performance in both quantization quality and algorithm efficiency, demonstrating strong effectiveness and robustness in comprehensive experiments.

Conclusion: FLRQ provides a novel solution for flexible low-rank quantization that quickly identifies optimal ranks per layer and minimizes storage combinations, outperforming existing methods without requiring costly fine-tuning.

Abstract: Traditional post-training quantization (PTQ) is considered an effective approach to reduce model size and accelerate inference of large-scale language models (LLMs). However, existing low-rank PTQ methods require costly fine-tuning to determine a compromise rank for diverse data and layers in large models, failing to exploit their full potential. Additionally, the current SVD-based low-rank approximation compounds the computational overhead. In this work, we thoroughly analyze the varying effectiveness of low-rank approximation across different layers in representative models. Accordingly, we introduce \underline{F}lexible \underline{L}ow-\underline{R}ank \underline{Q}uantization (FLRQ), a novel solution designed to quickly identify the accuracy-optimal ranks and aggregate them to achieve minimal storage combinations. FLRQ comprises two powerful components, Rank1-Sketch-based Flexible Rank Selection (R1-FLR) and Best Low-rank Approximation under Clipping (BLC). R1-FLR applies the R1-Sketch with Gaussian projection for the fast low-rank approximation, enabling outlier-aware rank extraction for each layer. Meanwhile, BLC aims at minimizing the low-rank quantization error under the scaling and clipping strategy through an iterative method. FLRQ demonstrates strong effectiveness and robustness in comprehensive experiments, achieving state-of-the-art performance in both quantization quality and algorithm efficiency.

[293] Tiny Recursive Models on ARC-AGI-1: Inductive Biases, Identity Conditioning, and Test-Time Compute

Antonio Roye-Azar, Santiago Vargas-Naranjo, Dhruv Ghai, Nithin Balamurugan, Rayan Amir

Main category: cs.LG

TL;DR: TRM’s strong performance on ARC-AGI-1 stems mainly from test-time augmentation/voting, task ID dependence, shallow recursion, and efficiency advantages rather than deep reasoning capabilities.

Details

Motivation: To understand what factors contribute to TRM's reported strong performance on ARC tasks, distinguishing between architectural reasoning capabilities vs. test-time compute, task priors, and efficiency factors.

Method: Empirical analysis of ARC Prize TRM checkpoint on ARC-AGI-1 through: 1) test-time augmentation/voting ablation, 2) puzzle-identity ablation, 3) recursion trajectory analysis, 4) training augmentation experiments, and 5) efficiency comparison with Llama 3 8B QLoRA fine-tune.

Result: 1) 1000-sample voting improves Pass@1 by ~11pp over single inference; 2) Zero accuracy without correct puzzle IDs; 3) Most accuracy achieved at first recursion step, saturation after few updates; 4) Heavy augmentation broadens solution distribution; 5) TRM has much higher throughput and lower memory than Llama 3 8B.

Conclusion: TRM’s performance arises from efficiency, task-specific conditioning, and aggressive test-time compute rather than deep internal reasoning. The architecture enables parameter-efficient processing but doesn’t demonstrate sophisticated recursive reasoning.

Abstract: Tiny Recursive Models (TRM) were proposed as a parameter-efficient alternative to large language models for solving Abstraction and Reasoning Corpus (ARC) style tasks. The original work reports strong performance and suggests that recursive latent updates enable non-trivial reasoning, but it remains unclear how much of this performance stems from architecture, test-time compute, or task-specific priors. In this technical note, we empirically analyze the ARC Prize TRM checkpoint on ARC-AGI-1 and report four behavioral findings and an efficiency comparison. First, we show that test-time augmentation and majority-vote ensembling account for a substantial fraction of reported performance: the 1000-sample voting pipeline improves Pass@1 by about 11 percentage points over single-pass canonical inference. Second, a puzzle-identity ablation reveals strict dependence on task identifiers: replacing the correct puzzle ID with a blank or random token yields zero accuracy. Third, a recursion trajectory analysis shows that most of the final accuracy is achieved at the first recursion step and that performance saturates after few latent updates, indicating shallow effective recursion. Fourth, early-stage training experiments under canonical versus heavy augmentation regimes suggest that heavy augmentation broadens the distribution of candidate solutions and improves multi-sample success. Finally, we compare TRM with a naive QLoRA fine-tune of Llama 3 8B on canonical ARC-AGI-1, finding that TRM’s non-autoregressive design achieves much higher throughput and substantially lower memory usage in this setting. Overall, TRM’s ARC-AGI-1 performance appears to arise from an interaction between efficiency, task-specific conditioning, and aggressive test-time compute rather than deep internal reasoning.

[294] mHC-lite: You Don’t Need 20 Sinkhorn-Knopp Iterations

Yongyi Yang, Jianyang Gao

Main category: cs.LG

TL;DR: mHC-lite replaces iterative Sinkhorn-Knopp normalization with explicit doubly stochastic matrix construction via convex combinations of permutations, ensuring exact doubly stochasticity and better portability while matching performance.

Details

Motivation: Address limitations of mHC: (1) finite SK iterations don't guarantee exact doubly stochasticity, causing approximation gaps that accumulate through depth and undermine stability; (2) SK requires specialized CUDA kernels, creating engineering barriers and reducing portability.

Method: mHC-lite uses Birkhoff-von Neumann theorem to explicitly construct doubly stochastic matrices as convex combinations of permutation matrices. This guarantees exact doubly stochasticity by construction and can be implemented using only native matrix operations.

Result: mHC-lite matches or exceeds mHC in performance while achieving higher training throughput with naive implementation and eliminating residual instabilities observed in both HC and mHC.

Conclusion: mHC-lite provides a simpler, more portable alternative to mHC that guarantees exact doubly stochasticity, improves training stability, and maintains performance while being easier to implement.

Abstract: Hyper-Connections (HC) generalizes residual connections by introducing dynamic residual matrices that mix information across multiple residual streams, accelerating convergence in deep neural networks. However, unconstrained residual matrices can compromise training stability. To address this, DeepSeek’s Manifold-Constrained Hyper-Connections (mHC) approximately projects these matrices onto the Birkhoff polytope via iterative Sinkhorn–Knopp (SK) normalization. We identify two limitations of this approach: (i) finite SK iterations do not guarantee exact doubly stochasticity, leaving an approximation gap that can accumulate through network depth and undermine stability; (ii) efficient SK implementation requires highly specialized CUDA kernels, raising engineering barriers and reducing portability. Motivated by the Birkhoff–von Neumann theorem, we propose mHC-lite, a simple reparameterization that explicitly constructs doubly stochastic matrices as convex combinations of permutation matrices. This approach guarantees exact doubly stochasticity by construction and can be implemented using only native matrix operations. Extensive experiments demonstrate that mHC-lite matches or exceeds mHC in performance while achieving higher training throughput with a naive implementation and eliminating the residual instabilities observed in both HC and mHC. The code is publicly available at https://github.com/FFTYYY/mhc-lite.

[295] Variational Autoencoders for P-wave Detection on Strong Motion Earthquake Spectrograms

Turkan Simge Ispak, Salih Tileylioglu, Erdem Akagunduz

Main category: cs.LG

TL;DR: Self-supervised anomaly detection using VAE architectures shows attention mechanisms outperform skip connections for P-wave detection in earthquake early warning by prioritizing global context over local reconstruction fidelity.

Details

Motivation: P-wave detection is critical for earthquake early warning but faces challenges from high noise levels, limited labeled data, and complex waveforms in strong-motion records.

Method: Reframed P-wave detection as self-supervised anomaly detection task. Conducted comprehensive grid search of 492 VAE configurations to evaluate architectural variations (skip connections vs attention mechanisms) regulating trade-off between reconstruction fidelity and anomaly discrimination.

Result: Skip connections minimize reconstruction error (MAE ~0.0012) but cause “overgeneralization” that reconstructs noise and masks detection signal. Attention mechanisms prioritize global context over local detail and yield highest detection performance (AUC 0.875). Attention-based VAE achieves AUC 0.91 in 0-40km near-source range.

Conclusion: Architectural constraints favoring global context over pixel-perfect reconstruction are essential for robust, self-supervised P-wave detection. Attention-based VAEs demonstrate high suitability for immediate early warning applications.

Abstract: Accurate P-wave detection is critical for earthquake early warning, yet strong-motion records pose challenges due to high noise levels, limited labeled data, and complex waveform characteristics. This study reframes P-wave arrival detection as a self-supervised anomaly detection task to evaluate how architectural variations regulate the trade-off between reconstruction fidelity and anomaly discrimination. Through a comprehensive grid search of 492 Variational Autoencoder configurations, we show that while skip connections minimize reconstruction error (Mean Absolute Error approximately 0.0012), they induce “overgeneralization”, allowing the model to reconstruct noise and masking the detection signal. In contrast, attention mechanisms prioritize global context over local detail and yield the highest detection performance with an area-under-the-curve of 0.875. The attention-based Variational Autoencoder achieves an area-under-the-curve of 0.91 in the 0 to 40-kilometer near-source range, demonstrating high suitability for immediate early warning applications. These findings establish that architectural constraints favoring global context over pixel-perfect reconstruction are essential for robust, self-supervised P-wave detection.

[296] Weights to Code: Extracting Interpretable Algorithms from the Discrete Transformer

Yifan Zhang, Wei Bi, Kechi Zhang, Dongming Jin, Jie Fu, Zhi Jin

Main category: cs.LG

TL;DR: Discrete Transformer bridges continuous representations and discrete symbolic logic through functional disentanglement and temperature-annealed sampling, enabling extraction of human-readable programs from trained models without human-written code.

Details

Motivation: Algorithm extraction from Transformer models is hindered by superposition (entangled features in overlapping directions), which obstructs extraction of symbolic expressions. The goal is to enable de novo algorithm discovery without relying on human-written code.

Method: Proposes Discrete Transformer with strict functional disentanglement (Numerical Attention for information routing, Numerical MLP for element-wise arithmetic) and temperature-annealed sampling to facilitate program extraction.

Result: Achieves performance comparable to RNN-based baselines, extends interpretability to continuous variable domains, shows clear phase transition in annealing process, and enables fine-grained control over synthesized programs via inductive biases.

Conclusion: Discrete Transformer establishes a robust framework for demonstration-free algorithm discovery and offers a rigorous pathway toward Transformer interpretability by bridging continuous representations and discrete symbolic logic.

Abstract: Algorithm extraction aims to synthesize executable programs directly from models trained on specific algorithmic tasks, enabling de novo algorithm discovery without relying on human-written code. However, extending this paradigm to Transformer is hindered by superposition, where entangled features encoded in overlapping directions obstruct the extraction of symbolic expressions. In this work, we propose the Discrete Transformer, an architecture explicitly engineered to bridge the gap between continuous representations and discrete symbolic logic. By enforcing a strict functional disentanglement, which constrains Numerical Attention to information routing and Numerical MLP to element-wise arithmetic, and employing temperature-annealed sampling, our method effectively facilitates the extraction of human-readable programs. Empirically, the Discrete Transformer not only achieves performance comparable to RNN-based baselines but crucially extends interpretability to continuous variable domains. Moreover, our analysis of the annealing process shows that the efficient discrete search undergoes a clear phase transition from exploration to exploitation. We further demonstrate that our method enables fine-grained control over synthesized programs by imposing inductive biases. Collectively, these findings establish the Discrete Transformer as a robust framework for demonstration-free algorithm discovery, offering a rigorous pathway toward Transformer interpretability.

[297] Tensor-DTI: Enhancing Biomolecular Interaction Prediction with Contrastive Embedding Learning

Manel Gil-Sorribes, Júlia Vilalta-Mor, Isaac Filella-Mercè, Robert Soliva, Álvaro Ciudad, Víctor Guallar, Alexis Molina

Main category: cs.LG

TL;DR: Tensor-DTI is a contrastive learning framework that integrates multimodal embeddings (molecular graphs, protein language models, binding-site predictions) to improve drug-target interaction prediction accuracy and virtual screening performance.

Details

Motivation: Existing DTI prediction models rely on single-modality predefined molecular descriptors or sequence-based embeddings with limited representativeness, which restricts their ability to accurately model drug-target interactions.

Method: Tensor-DTI uses a contrastive learning framework with siamese dual-encoder architecture that integrates multimodal embeddings from molecular graphs, protein language models, and binding-site predictions to capture both chemical and structural interaction features while distinguishing interacting from non-interacting pairs.

Result: Tensor-DTI outperforms existing sequence-based and graph-based models on multiple DTI benchmarks, produces chemically plausible hit distributions in large-scale inference experiments on CDK2, remains competitive with docking methods in enrichment studies, and shows applicability to protein-RNA and peptide-protein interactions.

Conclusion: Integrating multimodal information with contrastive objectives enhances interaction-prediction accuracy and provides more interpretable and reliability-aware models for virtual screening, demonstrating the benefits of comprehensive molecular representation.

Abstract: Accurate drug-target interaction (DTI) prediction is essential for computational drug discovery, yet existing models often rely on single-modality predefined molecular descriptors or sequence-based embeddings with limited representativeness. We propose Tensor-DTI, a contrastive learning framework that integrates multimodal embeddings from molecular graphs, protein language models, and binding-site predictions to improve interaction modeling. Tensor-DTI employs a siamese dual-encoder architecture, enabling it to capture both chemical and structural interaction features while distinguishing interacting from non-interacting pairs. Evaluations on multiple DTI benchmarks demonstrate that Tensor-DTI outperforms existing sequence-based and graph-based models. We also conduct large-scale inference experiments on CDK2 across billion-scale chemical libraries, where Tensor-DTI produces chemically plausible hit distributions even when CDK2 is withheld from training. In enrichment studies against Glide docking and Boltz-2 co-folder, Tensor-DTI remains competitive on CDK2 and improves the screening budget required to recover moderate fractions of high-affinity ligands on out-of-family targets under strict family-holdout splits. Additionally, we explore its applicability to protein-RNA and peptide-protein interactions. Our findings highlight the benefits of integrating multimodal information with contrastive objectives to enhance interaction-prediction accuracy and to provide more interpretable and reliability-aware models for virtual screening.

[298] Fusion Matters: Length-Aware Analysis of Positional-Encoding Fusion in Transformers

Mohamed Amine Hallam, Kuo-Kun Tseng

Main category: cs.LG

TL;DR: Positional encoding fusion mechanisms (addition vs. concatenation vs. gating) matter for long-sequence Transformers but not short texts, with learnable fusion strategies showing consistent gains on long documents.

Details

Motivation: Most prior work focuses on designing new positional encodings rather than examining how positional information is fused with token embeddings. The paper investigates whether the fusion mechanism itself affects performance, especially in long-sequence settings.

Method: Controlled empirical study comparing three canonical fusion strategies (element-wise addition, concatenation with projection, and scalar gated fusion) under identical Transformer architectures, data splits, and random seeds. Experiments on three text classification datasets spanning short (AG News), medium (IMDB), and long (ArXiv) sequences. Additional experiments include paired-seed analysis, cross-dataset comparison, and exploration of a lightweight convolutional gating mechanism for long documents.

Result: Fusion choice has negligible impact on short texts but produces consistent gains on long documents. The benefits are structural rather than stochastic, as verified by paired-seed analysis. Learnable fusion generalizes across multiple positional encoding families. The lightweight convolutional gating mechanism shows promise for long documents.

Conclusion: Positional-encoding fusion is a non-trivial design choice for long-sequence Transformers and should be treated as an explicit modeling decision rather than a fixed default.

Abstract: Transformers require positional encodings to represent sequence order, yet most prior work focuses on designing new positional encodings rather than examining how positional information is fused with token embeddings. In this paper, we study whether the fusion mechanism itself affects performance, particularly in long-sequence settings. We conduct a controlled empirical study comparing three canonical fusion strategies–element-wise addition, concatenation with projection, and scalar gated fusion–under identical Transformer architectures, data splits, and random seeds. Experiments on three text classification datasets spanning short (AG News), medium (IMDB), and long (ArXiv) sequences show that fusion choice has negligible impact on short texts but produces consistent gains on long documents. To verify that these gains are structural rather than stochastic, we perform paired-seed analysis and cross-dataset comparison across sequence-length regimes. Additional experiments on the ArXiv dataset indicate that the benefit of learnable fusion generalizes across multiple positional encoding families. Finally, we explore a lightweight convolutional gating mechanism that introduces local inductive bias at the fusion level, evaluated on long documents only. Our results indicate that positional-encoding fusion is a non-trivial design choice for long-sequence Transformers and should be treated as an explicit modeling decision rather than a fixed default.

[299] Learning Reconstructive Embeddings in Reproducing Kernel Hilbert Spaces via the Representer Theorem

Enrique Feito-Casares, Francisco M. Melgarejo-Meseguer, José-Luis Rojo-Álvarez

Main category: cs.LG

TL;DR: Proposes new RKHS-based manifold learning algorithms using autorepresentation with operator-valued kernels and kernel alignment for dimensionality reduction.

Details

Motivation: Growing interest in representation learning approaches that uncover latent structure of high-dimensional data, particularly for manifold learning in RKHS frameworks.

Method: 1) Reconstruct observations as linear combinations of other samples in RKHS using vector form of Representer Theorem; 2) Use separable operator-valued kernel for vector-valued data; 3) Kernel-alignment task projects data to lower-dimensional latent space whose Gram matrix matches high-dimensional reconstruction kernel.

Result: Numerical experiments on simulated (concentric circles, swiss-roll) and real (cancer molecular activity, IoT network intrusions) datasets show practical effectiveness of the approach.

Conclusion: The algorithms represent an extended approach to autorepresentation property using Kernel Learning Theory, transferring auto-reconstruction geometry of RKHS to embeddings for effective manifold learning.

Abstract: Motivated by the growing interest in representation learning approaches that uncover the latent structure of high-dimensional data, this work proposes new algorithms for reconstruction-based manifold learning within Reproducing-Kernel Hilbert Spaces (RKHS). Each observation is first reconstructed as a linear combination of the other samples in the RKHS, by optimizing a vector form of the Representer Theorem for their autorepresentation property. A separable operator-valued kernel extends the formulation to vector-valued data while retaining the simplicity of a single scalar similarity function. A subsequent kernel-alignment task projects the data into a lower-dimensional latent space whose Gram matrix aims to match the high-dimensional reconstruction kernel, thus transferring the auto-reconstruction geometry of the RKHS to the embedding. Therefore, the proposed algorithms represent an extended approach to the autorepresentation property, exhibited by many natural data, by using and adapting well-known results of Kernel Learning Theory. Numerical experiments on both simulated (concentric circles and swiss-roll) and real (cancer molecular activity and IoT network intrusions) datasets provide empirical evidence of the practical effectiveness of the proposed approach.

[300] Detecting Autism Spectrum Disorder with Deep Eye Movement Features

Zhanpei Huang, Taochen chen, Fangqing Gu, Yiqun Zhang

Main category: cs.LG

TL;DR: A novel discrete short-term sequential (DSTS) modeling framework with class-aware representation and imbalance-aware mechanisms outperforms existing methods for ASD detection using eye movement data.

Details

Motivation: Eye movement data provides non-invasive diagnostic potential for Autism Spectrum Disorder (ASD) detection, but traditional Transformer models with global attention mechanisms are inefficient for capturing the discrete, short-term temporal dependencies in gaze patterns.

Method: Proposes a discrete short-term sequential (DSTS) modeling framework with two key components: Class-aware Representation mechanism to learn discriminative features for ASD vs. TD classification, and Imbalance-aware Mechanism to handle dataset imbalances.

Result: DSTS outperforms both traditional machine learning techniques and sophisticated deep learning models across multiple eye movement datasets, demonstrating superior ASD detection performance.

Conclusion: The DSTS framework effectively captures subtle eye movement patterns for ASD detection by focusing on discrete short-term sequential modeling, addressing the limitations of global attention mechanisms for this specific data type.

Abstract: Autism Spectrum Disorder (ASD) is a neurodevelopmental disorder characterized by deficits in social communication and behavioral patterns. Eye movement data offers a non-invasive diagnostic tool for ASD detection, as it is inherently discrete and exhibits short-term temporal dependencies, reflecting localized gaze focus between fixation points. These characteristics enable the data to provide deeper insights into subtle behavioral markers, distinguishing ASD-related patterns from typical development. Eye movement signals mainly contain short-term and localized dependencies. However, despite the widespread application of stacked attention layers in Transformer-based models for capturing long-range dependencies, our experimental results indicate that this approach yields only limited benefits when applied to eye movement data. This may be because discrete fixation points and short-term dependencies in gaze focus reduce the utility of global attention mechanisms, making them less efficient than architectures focusing on local temporal patterns. To efficiently capture subtle and complex eye movement patterns, distinguishing ASD from typically developing (TD) individuals, a discrete short-term sequential (DSTS) modeling framework is designed with Class-aware Representation and Imbalance-aware Mechanisms. Through extensive experiments on several eye movement datasets, DSTS outperforms both traditional machine learning techniques and more sophisticated deep learning models.

[301] A Dual Pipeline Machine Learning Framework for Automated Multi Class Sleep Disorder Screening Using Hybrid Resampling and Ensemble Learning

Md Sultanul Islam Ovi, Muhsina Tarannum Munfa, Miftahul Alam Adib, Syed Sabbir Hasan

Main category: cs.LG

TL;DR: Dual pipeline ML framework achieves 98.67% accuracy for sleep disorder screening using statistical and wrapper-based pipelines with SMOTETomek resampling.

Details

Motivation: Clinical sleep studies are resource-intensive and difficult to scale for population-level screening, creating a need for automated, non-invasive screening methods for sleep disorders like insomnia and sleep apnea.

Method: Dual pipeline framework with: 1) statistical pipeline using Mutual Information and Linear Discriminant Analysis for linear separability, and 2) wrapper-based pipeline using Boruta feature selection with autoencoder for non-linear representation learning. Hybrid SMOTETomek resampling addresses class imbalance.

Result: Extra Trees and K Nearest Neighbors achieved 98.67% accuracy, outperforming recent baselines. Statistical testing (Wilcoxon Signed Rank Test) shows significant improvement, with inference latency below 400 milliseconds.

Conclusion: The dual pipeline design supports accurate and efficient automated screening for non-invasive sleep disorder risk stratification, offering a scalable solution for population-level screening.

Abstract: Accurate classification of sleep disorders, particularly insomnia and sleep apnea, is important for reducing long term health risks and improving patient quality of life. However, clinical sleep studies are resource intensive and are difficult to scale for population level screening. This paper presents a Dual Pipeline Machine Learning Framework for multi class sleep disorder screening using the Sleep Health and Lifestyle dataset. The framework consists of two parallel processing streams: a statistical pipeline that targets linear separability using Mutual Information and Linear Discriminant Analysis, and a wrapper based pipeline that applies Boruta feature selection with an autoencoder for non linear representation learning. To address class imbalance, we use the hybrid SMOTETomek resampling strategy. In experiments, Extra Trees and K Nearest Neighbors achieved an accuracy of 98.67%, outperforming recent baselines on the same dataset. Statistical testing using the Wilcoxon Signed Rank Test indicates that the improvement over baseline configurations is significant, and inference latency remains below 400 milliseconds. These results suggest that the proposed dual pipeline design supports accurate and efficient automated screening for non invasive sleep disorder risk stratification.

[302] A New Family of Poisson Non-negative Matrix Factorization Methods Using the Shifted Log Link

Eric Weine, Peter Carbonetto, Rafael A. Irizarry, Matthew Stephens

Main category: cs.LG

TL;DR: Poisson NMF with shifted-log link function relaxes the additive combination assumption of standard Poisson NMF, allowing parts to combine more multiplicatively, with computational efficiency for sparse datasets.

Details

Motivation: Standard Poisson NMF assumes parts combine additively, which may not be appropriate for all settings. The authors aim to relax this restrictive assumption to improve model flexibility and interpretability.

Method: Introduce Poisson NMF with shifted-log link function that has a single tuning parameter controlling the transition from additive to multiplicative combination of parts. Provide maximum likelihood fitting algorithm and approximation for computational efficiency on sparse datasets.

Result: The method is illustrated on various real datasets, showing that choice of link function substantively impacts results, and shifted-log link can improve interpretability compared to standard additive link.

Conclusion: The shifted-log link function provides a flexible extension to Poisson NMF that relaxes the additive combination assumption, offering improved interpretability in some settings while maintaining computational efficiency for sparse data.

Abstract: Poisson non-negative matrix factorization (NMF) is a widely used method to find interpretable “parts-based” decompositions of count data. While many variants of Poisson NMF exist, existing methods assume that the “parts” in the decomposition combine additively. This assumption may be natural in some settings, but not in others. Here we introduce Poisson NMF with the shifted-log link function to relax this assumption. The shifted-log link function has a single tuning parameter, and as this parameter varies the model changes from assuming that parts combine additively (i.e., standard Poisson NMF) to assuming that parts combine more multiplicatively. We provide an algorithm to fit this model by maximum likelihood, and also an approximation that substantially reduces computation time for large, sparse datasets (computations scale with the number of non-zero entries in the data matrix). We illustrate these new methods on a variety of real datasets. Our examples show how the choice of link function in Poisson NMF can substantively impact the results, and how in some settings the use of a shifted-log link function may improve interpretability compared with the standard, additive link.

[303] IIB-LPO: Latent Policy Optimization via Iterative Information Bottleneck

Huilin Deng, Hongchen Luo, Yue Zhu, Long Li, Zhuoyue Chen, Xinghao Zhao, Ming Li, Jihai Zhang, Mengchang Wang, Yang Cao, Yu Kang

Main category: cs.LG

TL;DR: IIB-LPO addresses exploration collapse in RLVR for LLM reasoning by shifting from token-level perturbations to topological branching of reasoning trajectories, using Information Bottleneck for trajectory filtering and self-reward.

Details

Motivation: Existing RLVR methods for LLM reasoning suffer from exploration collapse due to semantic homogeneity of random rollouts. Current exploration techniques like global entropy regularization cause reward hacking and verbosity, while local token-selective updates struggle with pre-trained model biases.

Method: IIB-LPO (Latent Policy Optimization via Iterative Information Bottleneck) shifts exploration from statistical perturbation of token distributions to topological branching of reasoning trajectories. It triggers latent branching at high-entropy states to diversify reasoning paths and employs the Information Bottleneck principle both as a trajectory filter and self-reward mechanism.

Result: Empirical results across four mathematical reasoning benchmarks show IIB-LPO achieves state-of-the-art performance, surpassing prior methods by up to 5.3% in accuracy and 7.4% in diversity metrics.

Conclusion: IIB-LPO effectively addresses exploration collapse in RLVR for LLM reasoning by introducing topological branching and Information Bottleneck principles, leading to improved accuracy and diversity in mathematical reasoning tasks.

Abstract: Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) for Large Language Model (LLM) reasoning have been hindered by a persistent challenge: exploration collapse. The semantic homogeneity of random rollouts often traps models in narrow, over-optimized behaviors. While existing methods leverage policy entropy to encourage exploration, they face inherent limitations. Global entropy regularization is susceptible to reward hacking, which can induce meaningless verbosity, whereas local token-selective updates struggle with the strong inductive bias of pre-trained models. To address this, we propose Latent Policy Optimization via Iterative Information Bottleneck (IIB-LPO), a novel approach that shifts exploration from statistical perturbation of token distributions to topological branching of reasoning trajectories. IIB-LPO triggers latent branching at high-entropy states to diversify reasoning paths and employs the Information Bottleneck principle both as a trajectory filter and a self-reward mechanism, ensuring concise and informative exploration. Empirical results across four mathematical reasoning benchmarks demonstrate that IIB-LPO achieves state-of-the-art performance, surpassing prior methods by margins of up to 5.3% in accuracy and 7.4% in diversity metrics.

[304] GlueNN: gluing patchwise analytic solutions with neural networks

Doyoung Kim, Donghee Lee, Hye-Sung Lee, Jiheon Lee, Jaeok Yi

Main category: cs.LG

TL;DR: A learning framework that replaces constant integration coefficients with scale-dependent functions to smoothly interpolate between asymptotic regimes, eliminating the need for arbitrary boundary matching in differential equations.

Details

Motivation: Traditional patching methods for solving complex differential equations with scale-dependent terms often fail because approximate solutions break down near matching boundaries, requiring arbitrary matching procedures that may not reproduce correct solutions.

Method: Promote integration constants of asymptotic analytic solutions to scale-dependent functions, then constrain these coefficient functions with the original differential equation over the entire domain using a learning framework to obtain globally valid solutions.

Result: The framework accurately reproduces global solutions and outperforms conventional matching procedures in representative problems from chemical kinetics and cosmology.

Conclusion: The proposed learning approach provides a more robust alternative to traditional patching methods by enabling smooth interpolation between asymptotic regimes without arbitrary boundary matching.

Abstract: In many problems in physics and engineering, one encounters complicated differential equations with strongly scale-dependent terms for which exact analytical or numerical solutions are not available. A common strategy is to divide the domain into several regions (patches) and simplify the equation in each region. When approximate analytic solutions can be obtained in each patch, they are then matched at the interfaces to construct a global solution. However, this patching procedure can fail to reproduce the correct solution, since the approximate forms may break down near the matching boundaries. In this work, we propose a learning framework in which the integration constants of asymptotic analytic solutions are promoted to scale-dependent functions. By constraining these coefficient functions with the original differential equation over the domain, the network learns a globally valid solution that smoothly interpolates between asymptotic regimes, eliminating the need for arbitrary boundary matching. We demonstrate the effectiveness of this framework in representative problems from chemical kinetics and cosmology, where it accurately reproduces global solutions and outperforms conventional matching procedures.

[305] Auditing Fairness under Model Updates: Fundamental Complexity and Property-Preserving Updates

Ayoub Ajarra, Debabrota Basu

Main category: cs.LG

TL;DR: The paper studies group fairness auditing under adaptive model updates, proposing a PAC auditing framework with distribution-free bounds based on a novel combinatorial measure called SP dimension.

Details

Motivation: Real-world machine learning models are frequently updated adaptively in response to changing environments, which complicates auditing for bias since updates can alter the model class while preserving certain properties of interest. This raises questions about what can be reliably audited under such shifts.

Method: The authors propose a generic framework for PAC auditing based on an Empirical Property Optimization (EPO) oracle. For statistical parity, they establish distribution-free auditing bounds characterized by the SP dimension, a novel combinatorial measure that captures the complexity of admissible strategic updates.

Result: The framework provides efficient estimation of auditing properties like group fairness using minimal labeled samples, with distribution-free auditing bounds for statistical parity based on the SP dimension. The approach naturally extends to other auditing objectives including prediction error and robust risk.

Conclusion: The work addresses the challenge of auditing machine learning models under adaptive updates by developing a principled framework that characterizes the information complexity of allowable updates and enables efficient auditing with theoretical guarantees, applicable to various fairness and performance metrics.

Abstract: As machine learning models become increasingly embedded in societal infrastructure, auditing them for bias is of growing importance. However, in real-world deployments, auditing is complicated by the fact that model owners may adaptively update their models in response to changing environments, such as financial markets. These updates can alter the underlying model class while preserving certain properties of interest, raising fundamental questions about what can be reliably audited under such shifts. In this work, we study group fairness auditing under arbitrary updates. We consider general shifts that modify the pre-audit model class while maintaining invariance of the audited property. Our goals are two-fold: (i) to characterize the information complexity of allowable updates, by identifying which strategic changes preserve the property under audit; and (ii) to efficiently estimate auditing properties, such as group fairness, using a minimal number of labeled samples. We propose a generic framework for PAC auditing based on an Empirical Property Optimization (EPO) oracle. For statistical parity, we establish distribution-free auditing bounds characterized by the SP dimension, a novel combinatorial measure that captures the complexity of admissible strategic updates. Finally, we demonstrate that our framework naturally extends to other auditing objectives, including prediction error and robust risk.

[306] Distilling Lightweight Domain Experts from Large ML Models by Identifying Relevant Subspaces

Pattarawat Chormai, Ali Hashemi, Klaus-Robert Müller, Grégoire Montavon

Main category: cs.LG

TL;DR: SubDistill: A new distillation algorithm that selectively transfers only relevant components from teacher to student models for specific subtasks, outperforming existing layer-wise distillation methods.

Details

Motivation: In practice, only a few classes and their intermediate concepts are often relevant for distillation, but existing distillation methods don't explicitly focus on these relevant subtasks. There's a gap in methods that selectively distill only the necessary components.

Method: SubDistill algorithm with improved numerical properties that only distills the relevant components of the teacher model at each layer, focusing on specific subtasks rather than the entire model.

Result: Experiments on CIFAR-100 and ImageNet with Convolutional and Transformer models show SubDistill outperforms existing layer-wise distillation techniques on representative subtasks. Explainable AI analyses confirm distilled students more closely match teacher decision structures.

Conclusion: SubDistill effectively addresses the practical need for selective distillation of relevant components, demonstrating superior performance and better alignment with teacher model decision patterns compared to existing methods.

Abstract: Knowledge distillation involves transferring the predictive capabilities of large, high-performing AI models (teachers) to smaller models (students) that can operate in environments with limited computing power. In this paper, we address the scenario in which only a few classes and their associated intermediate concepts are relevant to distill. This scenario is common in practice, yet few existing distillation methods explicitly focus on the relevant subtask. To address this gap, we introduce ‘SubDistill’, a new distillation algorithm with improved numerical properties that only distills the relevant components of the teacher model at each layer. Experiments on CIFAR-100 and ImageNet with Convolutional and Transformer models demonstrate that SubDistill outperforms existing layer-wise distillation techniques on a representative set of subtasks. Our benchmark evaluations are complemented by Explainable AI analyses showing that our distilled student models more closely match the decision structure of the original teacher model.

[307] Prophet as a Repro ducible Forecasting Framework: A Methodological Guide for Business and Financial Analytics

Sidney Shapiro, Burhanuddin Panvelwala

Main category: cs.LG

TL;DR: Prophet is evaluated as a reproducibility-enabling forecasting framework that balances interpretability, standardized workflows, and accessibility, compared against ARIMA variants and Random Forest.

Details

Motivation: Reproducibility is a persistent challenge in forecasting research, especially in business/financial analytics where high-stakes decisions depend on forecasts. Traditional methods require manual tuning and are hard to replicate, while ML approaches lack interpretability and introduce stochasticity issues.

Method: Evaluates Prophet’s additive structure, open-source implementation, and standardized workflow for transparent forecasting. Uses public financial/retail datasets to compare Prophet with multiple ARIMA specifications (auto-selected, manual, seasonal) and Random Forest under controlled, documented experimental design. Includes concrete Python examples demonstrating workflow integration.

Result: Prophet facilitates efficient forecasting workflows and integration with analytical pipelines. The multi-model comparison provides robust assessment of Prophet’s relative performance and reproducibility advantages.

Conclusion: Prophet serves as a methodological building block that supports verification, auditability, and methodological rigor in reproducible forecasting. Provides practical reference framework for reproducible forecasting in Python-based research workflows.

Abstract: Reproducibility remains a persistent challenge in forecasting research and practice, particularly in business and financial analytics where forecasts inform high-stakes decisions. Traditional forecasting methods, while theoretically interpretable, often require extensive manual tuning and are difficult to replicate in proprietary environments. Machine learning approaches offer predictive flexibility but introduce challenges related to interpretability, stochastic training procedures, and cross-environment reproducibility. This paper examines Prophet, an open-source forecasting framework developed by Meta, as a reproducibility-enabling solution that balances interpretability, standardized workflows, and accessibility. Rather than proposing a new algorithm, this study evaluates how Prophet’s additive structure, open-source implementation, and standardized workflow contribute to transparent and replicable forecasting practice. Using publicly available financial and retail datasets, we compare Prophet’s performance and interpretability with multiple ARIMA specifications (auto-selected, manually specified, and seasonal variants) and Random Forest under a controlled and fully documented experimental design. This multi-model comparison provides a robust assessment of Prophet’s relative performance and reproducibility advantages. Through concrete Python examples, we demonstrate how Prophet facilitates efficient forecasting workflows and integration with analytical pipelines. The study positions Prophet within the broader context of reproducible research. It highlights Prophet’s role as a methodological building block that supports verification, auditability, and methodological rigor. This work provides researchers and practitioners with a practical reference framework for reproducible forecasting in Python-based research workflows.

[308] On the Robustness of Age for Learning-Based Wireless Scheduling in Unknown Environments

Juaren Steiger, Bin Li

Main category: cs.LG

TL;DR: The paper proposes a new learning-based scheduling policy for constrained combinatorial multi-armed bandits that uses head-of-line age instead of virtual queue length, making it more robust to abrupt channel changes and constraint infeasibility.

Details

Motivation: Existing algorithms for wireless scheduling under unknown channel conditions use virtual queue lengths to track constraint violations, but these can grow unbounded when channel conditions change abruptly, making constraints infeasible. The authors observe that head-of-line age dynamics are more robust for algorithm design.

Method: Design a learning-based scheduling policy that replaces virtual queue length with head-of-line age (the age of the oldest packet in the virtual queue) in the algorithm design for constrained combinatorial multi-armed bandit problems.

Result: The proposed policy matches state-of-the-art performance under i.i.d. network conditions while maintaining system stability even under abrupt channel condition changes. It can rapidly recover from periods of constraint infeasibility where traditional approaches would fail.

Conclusion: Using head-of-line age instead of virtual queue length in learning-based scheduling algorithms provides superior robustness to network dynamics, preventing unbounded growth during constraint infeasibility while maintaining performance under normal conditions.

Abstract: The constrained combinatorial multi-armed bandit model has been widely employed to solve problems in wireless networking and related areas, including the problem of wireless scheduling for throughput optimization under unknown channel conditions. Most work in this area uses an algorithm design strategy that combines a bandit learning algorithm with the virtual queue technique to track the throughput constraint violation. These algorithms seek to minimize the virtual queue length in their algorithm design. However, in networks where channel conditions change abruptly, the resulting constraints may become infeasible, leading to unbounded growth in virtual queue lengths. In this paper, we make the key observation that the dynamics of the head-of-line age, i.e. the age of the oldest packet in the virtual queue, make it more robust when used in algorithm design compared to the virtual queue length. We therefore design a learning-based scheduling policy that uses the head-of-line age in place of the virtual queue length. We show that our policy matches state-of-the-art performance under i.i.d. network conditions. Crucially, we also show that the system remains stable even under abrupt changes in channel conditions and can rapidly recover from periods of constraint infeasibility.

Sahibzada Saadoon Hammad, Joaquín Huerta Guijarro, Francisco Ramos, Michael Gould Carlson, Sergio Trilles Oliver

Main category: cs.LG

TL;DR: An anomaly detection framework for IoT sensor networks using Communities of Interest (CoIs) with fused similarity metrics and autoencoder models trained on representative stations.

Details

Motivation: The rapid growth of IoT devices creates large-scale sensor networks that need efficient organization and anomaly detection. Communities of Interest provide a way to group heterogeneous IoT sensors with similar characteristics to enable scalable monitoring.

Method: 1. Group sensors into communities using fused similarity matrix (temporal correlations via Spearman coefficients, spatial proximity via Gaussian distance decay, elevation similarities). 2. Select representative stations based on best silhouette scores. 3. Train three autoencoder architectures (BiLSTM, LSTM, MLP) using Bayesian hyperparameter optimization with expanding window cross-validation. 4. Detect anomalies through reconstruction error analysis of normal temperature patterns.

Result: Robust within-community performance across evaluated configurations, with variations observed across different communities. The framework supports community-based model sharing to reduce computational overhead while analyzing model generalizability across IoT sensor networks.

Conclusion: The CoI-based anomaly detection framework is applicable for efficient IoT sensor network monitoring, enabling computational efficiency through model sharing within communities while providing insights into model generalization across different sensor groupings.

Abstract: The rapid deployment of Internet of Things (IoT) devices has led to large-scale sensor networks that monitor environmental and urban phenomena in real time. Communities of Interest (CoIs) provide a promising paradigm for organising heterogeneous IoT sensor networks by grouping devices with similar operational and environmental characteristics. This work presents an anomaly detection framework based on the CoI paradigm by grouping sensors into communities using a fused similarity matrix that incorporates temporal correlations via Spearman coefficients, spatial proximity using Gaussian distance decay, and elevation similarities. For each community, representative stations based on the best silhouette are selected and three autoencoder architectures (BiLSTM, LSTM, and MLP) are trained using Bayesian hyperparameter optimization with expanding window cross-validation and tested on stations from the same cluster and the best representative stations of other clusters. The models are trained on normal temperature patterns of the data and anomalies are detected through reconstruction error analysis. Experimental results show a robust within-community performance across the evaluated configurations, while variations across communities are observed. Overall, the results support the applicability of community-based model sharing in reducing computational overhead and to analyse model generalisability across IoT sensor networks.

[310] LookAroundNet: Extending Temporal Context with Transformers for Clinically Viable EEG Seizure Detection

Þór Sverrisson, Steinn Guðmundsson

Main category: cs.LG

TL;DR: LookAroundNet is a transformer-based seizure detector that uses extended temporal context (EEG signals before and after the segment of interest) to improve seizure detection performance across diverse clinical settings.

Details

Motivation: Automated seizure detection from EEG is challenging due to large variability across patients, recording conditions, and clinical settings. Current methods often lack the contextual understanding that clinicians use when interpreting EEG recordings.

Method: Transformer-based seizure detector that incorporates EEG signals before and after the segment of interest, mimicking clinical practice. Evaluated on multiple EEG datasets including routine clinical EEG, long-term ambulatory recordings, and home EEG recordings to study performance across varying data distributions.

Result: LookAroundNet achieves strong performance across datasets, generalizes well to unseen recording conditions, and operates with computational costs compatible with real-world clinical deployment. Extended temporal context, increased training data diversity, and model ensembling are key factors for improved performance.

Conclusion: The work contributes to moving automatic seizure detection models toward clinically viable solutions by demonstrating the importance of extended temporal context and diverse training data for robust performance across different clinical environments.

Abstract: Automated seizure detection from electroencephalography (EEG) remains difficult due to the large variability of seizure dynamics across patients, recording conditions, and clinical settings. We introduce LookAroundNet, a transformer-based seizure detector that uses a wider temporal window of EEG data to model seizure activity. The seizure detector incorporates EEG signals before and after the segment of interest, reflecting how clinicians use surrounding context when interpreting EEG recordings. We evaluate the proposed method on multiple EEG datasets spanning diverse clinical environments, patient populations, and recording modalities, including routine clinical EEG and long-term ambulatory recordings, in order to study performance across varying data distributions. The evaluation includes publicly available datasets as well as a large proprietary collection of home EEG recordings, providing complementary views of controlled clinical data and unconstrained home-monitoring conditions. Our results show that LookAroundNet achieves strong performance across datasets, generalizes well to previously unseen recording conditions, and operates with computational costs compatible with real-world clinical deployment. The results indicate that extended temporal context, increased training data diversity, and model ensembling are key factors for improving performance. This work contributes to moving automatic seizure detection models toward clinically viable solutions.

[311] Utilising physics-guided deep learning to overcome data scarcity

Jinshuai Bai, Laith Alzubaidi, Qingxia Wang, Ellen Kuhl, Mohammed Bennamoun, Yuantong Gu

Main category: cs.LG

TL;DR: Physics-guided deep learning (PGDL) integrates physics laws with neural networks to overcome data scarcity challenges in fields like structural engineering and medical diagnosis, achieving better accuracy and generalization with limited data.

Details

Motivation: Deep learning requires high-quality annotated datasets, but obtaining such data is challenging in real-world applications like structural risk estimation and medical diagnosis. This data scarcity creates a barrier to practical DL implementation in these fields.

Method: PGDL integrates physics laws into neural network training, allowing the models to leverage physical principles as additional information. This approach can be applied to any system governed by physics laws, including mechanics, finance, and medical applications.

Result: PGDL achieves great accuracy and generalization even with limited data, as demonstrated across various fields. The physics laws provide additional information that compensates for data scarcity.

Conclusion: PGDL offers a promising solution to data scarcity challenges in DL applications. The review provides a structured overview of PGDL applications, identifies current limitations and opportunities, and discusses future prospects for this approach.

Abstract: Deep learning (DL) relies heavily on data, and the quality of data influences its performance significantly. However, obtaining high-quality, well-annotated datasets can be challenging or even impossible in many real-world applications, such as structural risk estimation and medical diagnosis. This presents a significant barrier to the practical implementation of DL in these fields. Physics-guided deep learning (PGDL) is a novel type of DL that can integrate physics laws to train neural networks. This can be applied to any systems that are controlled or governed by physics laws, such as mechanics, finance and medical applications. It has been demonstrated that, with the additional information provided by physics laws, PGDL achieves great accuracy and generalisation in the presence of data scarcity. This review provides a detailed examination of PGDL and offers a structured overview of its use in addressing data scarcity across various fields, including physics, engineering and medical applications. Moreover, the review identifies the current limitations and opportunities for PGDL in relation to data scarcity and offers a thorough discussion on the future prospects of PGDL.

[312] Simple Mechanisms for Representing, Indexing and Manipulating Concepts

Yuanzhi Li, Raghu Meka, Rina Panigrahy, Kulin Shah

Main category: cs.LG

TL;DR: The paper proposes a mathematical framework for representing concepts as zero sets of polynomials, using moment statistics to create unique concept signatures that can discover hierarchical structures in data.

Details

Motivation: Current deep learning approaches lack a formal mathematical framework for defining and operating on concepts, despite the hierarchical nature of latent generative processes in data. There's a need for a principled way to characterize concepts mathematically and discover their hierarchical relationships.

Method: Represent simple primitive concepts as zero sets of polynomial collections, use moment statistics of data to create unique concept signatures, maintain a dictionary of concepts, and recursively combine lower-level concept signatures to form higher-level concept signatures.

Result: The proposed method can learn different types of hierarchical structures in data by discovering common structures across concepts and recursively building higher-level concepts from lower-level ones.

Conclusion: The framework provides a mathematical foundation for representing and operating on concepts, enabling systematic discovery of hierarchical structures in data through concept signatures and recursive composition.

Abstract: Supervised and unsupervised learning using deep neural networks typically aims to exploit the underlying structure in the training data; this structure is often explained using a latent generative process that produces the data, and the generative process is often hierarchical, involving latent concepts. Despite the significant work on understanding the learning of the latent structure and underlying concepts using theory and experiments, a framework that mathematically captures the definition of a concept and provides ways to operate on concepts is missing. In this work, we propose to characterize a simple primitive concept by the zero set of a collection of polynomials and use moment statistics of the data to uniquely represent the concepts; we show how this view can be used to obtain a signature of the concept. These signatures can be used to discover a common structure across the set of concepts and could recursively produce the signature of higher-level concepts from the signatures of lower-level concepts. To utilize such desired properties, we propose a method by keeping a dictionary of concepts and show that the proposed method can learn different types of hierarchical structures of the data.

[313] Dynamic and Adaptive Feature Generation with LLM

Xinhao Zhang, Jinghan Zhang, Banafsheh Rekabdar, Yuanchun Zhou, Pengfei Wang, Kunpeng Liu

Main category: cs.LG

TL;DR: LLM-based feature generation method improves interpretability, applicability, and flexibility over existing automated feature engineering approaches.

Details

Motivation: Current automated feature engineering methods suffer from three key issues: lack of explainability, limited applicability across data types/tasks, and inflexible strategies, which hinder ML model deployment in varied scenarios.

Method: Novel approach using large language models (LLMs) with feature-generating prompts to create a dynamic and adaptive feature generation method that enhances interpretability.

Result: The proposed approach significantly outperforms existing methods in experiments, demonstrating superior performance across various data types and tasks.

Conclusion: LLM-based feature generation addresses fundamental limitations of current automated feature engineering by improving explainability, broadening applicability, and offering strategic flexibility.

Abstract: The representation of feature space is a crucial environment where data points get vectorized and embedded for subsequent modeling. Thus the efficacy of machine learning (ML) algorithms is closely related to the quality of feature engineering. As one of the most important techniques, feature generation transforms raw data into an optimized feature space conducive to model training and further refines the space. Despite the advancements in automated feature engineering and feature generation, current methodologies often suffer from three fundamental issues: lack of explainability, limited applicability, and inflexible strategy. These shortcomings frequently hinder and limit the deployment of ML models across varied scenarios. Our research introduces a novel approach adopting large language models (LLMs) and feature-generating prompts to address these challenges. We propose a dynamic and adaptive feature generation method that enhances the interpretability of the feature generation process. Our approach broadens the applicability across various data types and tasks and offers advantages over strategic flexibility. A broad range of experiments showcases that our approach is significantly superior to existing methods.

[314] Explainable AI needs formalization

Stefan Haufe, Rick Wilming, Benedict Clark, Rustam Zhumagambetov, Ahcène Boubekki, Jörg Martin, Danny Panknin

Main category: cs.LG

TL;DR: Current XAI methods systematically fail to provide reliable explanations because they don’t address well-defined problems or use objective correctness criteria, limiting their utility for practical applications.

Details

Motivation: The paper critiques the current state of explainable AI (XAI), arguing that despite addressing the need for human-understandable ML decisions, XAI methods themselves need scrutiny because they cannot reliably answer relevant questions about models, training data, or test inputs.

Method: The paper proposes a conceptual framework rather than a technical method: researchers should formally define the explanation problems they intend to solve and design methods accordingly, leading to use-case-dependent notions of explanation correctness and objective performance metrics.

Result: The analysis reveals that popular XAI methods systematically attribute importance to input features independent of prediction targets, fundamentally limiting their utility for diagnosing/correcting models, scientific discovery, and identifying intervention targets.

Conclusion: XAI needs a paradigm shift toward formally defined problems and objective evaluation criteria to become truly useful. Researchers should move beyond current flawed approaches by establishing clear problem definitions and validation metrics tailored to specific use cases.

Abstract: The field of “explainable artificial intelligence” (XAI) seemingly addresses the desire that decisions of machine learning systems should be human-understandable. However, in its current state, XAI itself needs scrutiny. Popular methods cannot reliably answer relevant questions about ML models, their training data, or test inputs, because they systematically attribute importance to input features that are independent of the prediction target. This limits the utility of XAI for diagnosing and correcting data and models, for scientific discovery, and for identifying intervention targets. The fundamental reason for this is that current XAI methods do not address well-defined problems and are not evaluated against targeted criteria of explanation correctness. Researchers should formally define the problems they intend to solve and design methods accordingly. This will lead to diverse use-case-dependent notions of explanation correctness and objective metrics of explanation performance that can be used to validate XAI algorithms.

[315] FedScalar: A Communication efficient Federated Learning

M. Rostami, S. S. Kia

Main category: cs.LG

TL;DR: FedScalar reduces FL communication overhead by having agents send only 2 scalars instead of high-dimensional vectors, using random projection encoding.

Details

Motivation: Federated learning preserves privacy but suffers from high communication costs when agents send large model updates to the central server, limiting scalability.

Method: Agents encode model updates into a scalar via inner product with a random vector, send scalar + random seed to server. Server averages scalars and projects onto regenerated random vectors using seeds.

Result: Achieves O(d/√K) convergence rate for smooth non-convex functions. Rademacher distribution reduces variance vs Gaussian. Numerical simulations confirm communication efficiency.

Conclusion: FedScalar significantly reduces FL communication overhead while maintaining convergence guarantees, with Rademacher distribution further improving performance.

Abstract: Federated learning (FL) has gained considerable popularity for distributed machine learning due to its ability to preserve the privacy of participating agents by eliminating the need for data aggregation. Nevertheless, communication costs between agents and the central server in FL are substantial in large-scale problems and remain a limiting factor for this algorithm. This paper introduces an innovative algorithm, called FedScalar, within the FL framework aimed at improving communication efficiency. Unlike traditional FL methods that require agents to send high-dimensional vectors to the server, FedScalar enables agents to communicate updates using only two scalar values. Each agent encodes its updated model parameters into a scalar via the inner product between its local update difference and a random vector, which is transmitted to the server along with the agent’s local random seed value. The server then averages the received scalar values and decodes the information by projecting the averaged scalar onto the regenerated random vector using the corresponding agent seed values. Our method thereby significantly reduces communication overhead. Technically, we demonstrate that the proposed algorithm achieves a convergence rate of O(d/\sqrt(K)) to a stationary point for smooth, non-convex loss functions. Additionally, our analysis shows that changing the underlying distribution of the random vector generated by the server from Gaussian to Rademacher distribution reduces the variance during the aggregation step of the algorithm. Finally, we validate the performance and communication efficiency of our algorithm with numerical simulations.

[316] Communication-Efficient Stochastic Distributed Learning

Xiaoxing Ren, Nicola Bastianello, Karl H. Johansson, Thomas Parisini

Main category: cs.LG

TL;DR: Novel distributed ADMM algorithm with local training steps and stochastic gradients for efficient distributed learning over networks, achieving convergence to neighborhood of stationary/optimal points with variance reduction for exact convergence.

Details

Motivation: Address challenges in distributed learning over undirected networks: high communication costs and large datasets that make traditional distributed optimization inefficient.

Method: Design distributed ADMM algorithm with two key features: 1) multiple local training steps between communication rounds, 2) use of stochastic gradients for local computations. Also propose variance reduction variant for exact convergence.

Result: Algorithm converges to neighborhood of stationary point for nonconvex problems and optimal point for convex problems. Variance reduction variant achieves exact convergence to stationary/optimal points. Local training accelerates convergence.

Conclusion: Proposed algorithm effectively addresses communication and computational challenges in distributed learning, outperforms state-of-the-art methods both theoretically and empirically through numerical comparisons.

Abstract: We address distributed learning problems, both nonconvex and convex, over undirected networks. In particular, we design a novel algorithm based on the distributed Alternating Direction Method of Multipliers (ADMM) to address the challenges of high communication costs, and large datasets. Our design tackles these challenges i) by enabling the agents to perform multiple local training steps between each round of communications; and ii) by allowing the agents to employ stochastic gradients while carrying out local computations. We show that the proposed algorithm converges to a neighborhood of a stationary point, for nonconvex problems, and of an optimal point, for convex problems. We also propose a variant of the algorithm to incorporate variance reduction thus achieving exact convergence. We show that the resulting algorithm indeed converges to a stationary (or optimal) point, and moreover that local training accelerates convergence. We thoroughly compare the proposed algorithms with the state of the art, both theoretically and through numerical results.

[317] LEKA:LLM-Enhanced Knowledge Augmentation

Xinhao Zhang, Jinghan Zhang, Fengran Mo, Dongjie Wang, Yanjie Fu, Kunpeng Liu

Main category: cs.LG

TL;DR: LEKA is a knowledge augmentation method that actively searches for suitable external knowledge sources to enrich target domains, enabling models to autonomously retrieve and transfer relevant knowledge rather than passively acquiring it.

Details

Motivation: Humans excel at analogical learning and identifying appropriate knowledge sources for transfer. Current models lack this ability - they passively acquire knowledge but can't actively decide which knowledge to transfer. The challenge is teaching models which knowledge can be analogized and transferred, not just filling them with more information.

Method: LEKA extracts key information from target domain text, retrieves pertinent data from external knowledge libraries, and harmonizes the retrieved data with target domain data in both feature space and marginal probability measures. This creates an active knowledge retrieval and transfer system.

Result: Extensive experiments across various domains show significant improvements over traditional methods in reducing computational costs, automating data alignment, and optimizing transfer learning outcomes.

Conclusion: LEKA successfully enables models to actively search for and transfer relevant knowledge, moving beyond passive knowledge acquisition to more human-like analogical learning capabilities.

Abstract: Humans excel in analogical learning and knowledge transfer and, more importantly, possess a unique understanding of identifying appropriate sources of knowledge. From a model’s perspective, this presents an interesting challenge. If models could autonomously retrieve knowledge useful for transfer or decision-making to solve problems, they would transition from passively acquiring to actively accessing and learning from knowledge. However, filling models with knowledge is relatively straightforward – it simply requires more training and accessible knowledge bases. The more complex task is teaching models about which knowledge can be analogized and transferred. Therefore, we design a knowledge augmentation method, LEKA, for knowledge transfer that actively searches for suitable knowledge sources that can enrich the target domain’s knowledge. This LEKA method extracts key information from the target domain’s textual information, retrieves pertinent data from external data libraries, and harmonizes retrieved data with the target domain data in feature space and marginal probability measures. We validate the effectiveness of our approach through extensive experiments across various domains and demonstrate significant improvements over traditional methods in reducing computational costs, automating data alignment, and optimizing transfer learning outcomes.

[318] Shortcuts and Identifiability in Concept-based Models from a Neuro-Symbolic Lens

Samuele Bortolotti, Emanuele Marconato, Paolo Morettin, Andrea Passerini, Stefano Teso

Main category: cs.LG

TL;DR: Concept-based Models suffer from reasoning shortcuts where they learn low-quality concepts despite having correct inference layers, undermining interpretability and OOD reliability.

Details

Motivation: Concept-based Models aim to provide interpretable AI by learning high-level concepts and inference rules, but they often fail to produce reliable concepts that generalize out-of-distribution, raising concerns about their true interpretability and robustness.

Method: The authors establish a novel connection between Concept-based Models and reasoning shortcuts, extending RS theory to this complex setting. They derive theoretical conditions for identifying both concepts and inference layers, then empirically test existing methods with multiple mitigation strategies.

Result: Empirical results show reasoning shortcuts significantly impact Concept-based Models, and existing methods often fail to meet the theoretical conditions for reliable concept learning, even when combined with multiple mitigation strategies.

Conclusion: Current Concept-based Models are vulnerable to reasoning shortcuts that compromise their interpretability and out-of-distribution reliability, highlighting the need for new approaches that satisfy the theoretical conditions for robust concept learning.

Abstract: Concept-based Models are neural networks that learn a concept extractor to map inputs to high-level concepts and an inference layer to translate these into predictions. Ensuring these modules produce interpretable concepts and behave reliably in out-of-distribution is crucial, yet the conditions for achieving this remain unclear. We study this problem by establishing a novel connection between Concept-based Models and reasoning shortcuts (RSs), a common issue where models achieve high accuracy by learning low-quality concepts, even when the inference layer is fixed and provided upfront. Specifically, we extend RSs to the more complex setting of Concept-based Models and derive theoretical conditions for identifying both the concepts and the inference layer. Our empirical results highlight the impact of RSs and show that existing methods, even combined with multiple natural mitigation strategies, often fail to meet these conditions in practice.

[319] There are no Champions in Supervised Long-Term Time Series Forecasting

Lorenzo Brigato, Rafael Morand, Knut Strømmen, Maria Panagiotou, Markus Schmidt, Stavroula Mougiakakou

Main category: cs.LG

TL;DR: The paper critiques inconsistent benchmarking in long-term time series forecasting, showing that minor experimental changes can overturn claims of state-of-the-art performance, and calls for standardized evaluation practices.

Details

Motivation: To address concerns about inconsistent benchmarking and reporting practices in long-term time series forecasting research, where rapid progression of complex models may be undermined by unreliable comparisons.

Method: Conducted a broad, thorough, and reproducible evaluation of top-performing supervised models on popular benchmarks and additional baselines, assessing 8 models on 14 datasets with ~5,000 trained networks for hyperparameter searches.

Result: Found that slight changes to experimental setups or evaluation metrics drastically shift the common belief that newly published results are advancing the state of the art, revealing benchmarking inconsistencies.

Conclusion: Research should shift focus from pursuing increasingly complex models toward enhancing benchmarking practices through rigorous, standardized evaluations with reproducible hyperparameter setups and statistical testing.

Abstract: Recent advances in long-term time series forecasting have introduced numerous complex supervised prediction models that consistently outperform previously published architectures. However, this rapid progression raises concerns regarding inconsistent benchmarking and reporting practices, which may undermine the reliability of these comparisons. In this study, we first perform a broad, thorough, and reproducible evaluation of the top-performing supervised models on the most popular benchmark and additional baselines representing the most active architecture families. This extensive evaluation assesses eight models on 14 datasets, encompassing $\sim$5,000 trained networks for the hyperparameter (HP) searches. Then, through a comprehensive analysis, we find that slight changes to experimental setups or current evaluation metrics drastically shift the common belief that newly published results are advancing the state of the art. Our findings emphasize the need to shift focus away from pursuing ever-more complex models, towards enhancing benchmarking practices through rigorous and standardized evaluations that enable more substantiated claims, including reproducible HP setups and statistical testing. We offer recommendations for future research.

[320] Variance Reduction Methods Do Not Need to Compute Full Gradients: Improved Efficiency through Shuffling

Daniil Medyakov, Gleb Molodtsov, Savelii Chezhegov, Alexey Rebrikov, Aleksandr Beznosikov

Main category: cs.LG

TL;DR: Proposes a memory-efficient variance reduction method that eliminates expensive full gradient computations using shuffling and SAG/SAGA techniques, achieving competitive convergence rates for non-convex objectives and improved rates under strong convexity.

Details

Motivation: Stochastic optimization algorithms suffer from non-vanishing variance, while existing variance reduction methods like SVRG and SARAH require periodic full gradient computations which create computational bottlenecks and memory inefficiency.

Method: Combines shuffling heuristic with SAG/SAGA concepts to eliminate full gradient computations, creating a memory-efficient variance reduction approach that avoids the periodic full gradient bottleneck of traditional VR methods.

Result: For non-convex objectives, convergence rates match standard shuffling methods; under strong convexity, rates show improvement. Empirical validation demonstrates efficiency and scalability on large-scale tasks including CIFAR-10 and CIFAR-100 image classification.

Conclusion: The proposed approach successfully eliminates the expensive full gradient computation bottleneck of traditional variance reduction methods while maintaining competitive convergence performance, offering a memory-efficient solution for large-scale machine learning tasks.

Abstract: Stochastic optimization algorithms are widely used for machine learning with large-scale data. However, their convergence often suffers from non-vanishing variance. Variance Reduction (VR) methods, such as SVRG and SARAH, address this issue but introduce a bottleneck by requiring periodic full gradient computations. In this paper, we explore popular VR techniques and propose an approach that eliminates the necessity for expensive full gradient calculations. To avoid these computations and make our approach memory-efficient, we employ two key techniques: the shuffling heuristic and the concept of SAG/SAGA methods. For non-convex objectives, our convergence rates match those of standard shuffling methods, while under strong convexity, they demonstrate an improvement. We empirically validate the efficiency of our approach and demonstrate its scalability on large-scale machine learning tasks including image classification problem on CIFAR-10 and CIFAR-100 datasets.

[321] HiQ-Lip: A Hierarchical Quantum-Classical Method for Global Lipschitz Constant Estimation of ReLU Networks

Haoqi He, Yan Xiao, Wenzhi Xu, Ruoying Liu, Xiaokai Lin, Kai Wen

Main category: cs.LG

TL;DR: HiQ-Lip: A hybrid quantum-classical method that uses quantum computing to estimate neural network Lipschitz constants faster and more accurately than existing methods.

Details

Motivation: Estimating global Lipschitz constants is important for understanding neural network robustness and generalization, but current methods (SDP-based) are computationally expensive, memory-intensive, and slow. There's a need for more efficient approaches.

Method: HiQ-Lip converts Lipschitz constant estimation into a Quadratic Unconstrained Binary Optimization (QUBO) problem, then uses a hybrid quantum-classical hierarchical approach with multilevel graph coarsening and refinement to adapt to current quantum hardware limitations.

Result: On fully connected neural networks, HiQ-Lip achieves comparable estimates to state-of-the-art methods while significantly accelerating computation. For two-layer networks with 256 hidden neurons, it doubles solving speed and provides more accurate upper bounds than LiPopt.

Conclusion: Small-scale quantum devices show promising utility for advancing neural network robustness estimation, with HiQ-Lip demonstrating practical quantum advantage in Lipschitz constant estimation tasks.

Abstract: Estimating the global Lipschitz constant of neural networks is crucial for understanding and improving their robustness and generalization capabilities. However, precise calculations are NP-hard, and current semidefinite programming (SDP) methods face challenges such as high memory usage and slow processing speeds. In this paper, we propose HiQ-Lip, a hybrid quantum-classical hierarchical method that leverages quantum computing to estimate the global Lipschitz constant. We tackle the estimation by converting it into a Quadratic Unconstrained Binary Optimization problem and implement a multilevel graph coarsening and refinement strategy to adapt to the constraints of contemporary quantum hardware. Our experimental evaluations on fully connected neural networks demonstrate that HiQ-Lip not only provides estimates comparable to state-of-the-art methods but also significantly accelerates the computation process. In specific tests involving two-layer neural networks with 256 hidden neurons, HiQ-Lip doubles the solving speed and offers more accurate upper bounds than the existing best method, LiPopt. These findings highlight the promising utility of small-scale quantum devices in advancing the estimation of neural network robustness.

[322] DynaMo: Runtime Switchable Quantization for MoE with Cross-Dataset Adaptation

Zihao Zheng, Xiuping Cui, Size Zheng, Maoliang Li, Jiayu Chen, Yun Liang, Xiang Chen

Main category: cs.LG

TL;DR: DynaMo is an end-to-end MoE quantization framework that uses expert-level mixed-precision quantization and channel-level dynamic switching to adapt quantized MoE models to multiple datasets with improved performance and speed.

Details

Motivation: Existing quantization methods overlook expert dynamics in MoE architectures across multiple datasets, and static quantization cannot adapt MoE models to various data change scenarios as MoE models grow larger.

Method: Performs multi-level analysis of MoE dynamics, defines significance of channels/experts, then implements expert-level mixed-precision baseline quantization for multi-dataset compatibility, plus channel-level dynamic switching for adaptation to novel datasets.

Result: Achieves 2.78-4.54 PPL decrease and 1.85%-3.77% accuracy improvement across various datasets, with ~3x inference speedup and negligible overhead.

Conclusion: DynaMo effectively addresses MoE quantization challenges by combining expert-level mixed-precision quantization with dynamic channel switching, enabling efficient adaptation to multiple datasets while maintaining performance and speed.

Abstract: As the Mix-of-Experts (MoE) architecture increases the number of parameters in large models, there is an even greater need for model quantization. However, existing quantization methods overlook the expert dynamics of MoE across multiple datasets. Moreover, the existing static quantization cannot adapt MoE to various data change scenarios. In this paper, we perform a multi-level analysis to reveal MoE dynamics and define the significance of each channel/each expert. Based on the analysis results, we propose \textit{DynaMo}, an end-to-end MoE quantization framework. DynaMo adopts an expert-level mixed-precision baseline quantization strategy, which ensures the quantized MoEs are compatible with multiple existing datasets. Furthermore, DynaMo incorporates a channel-level dynamic switching mechanism to adapt these quantized MoE models to novel datasets. Experiments show that DynaMo achieves a 2.78~4.54 PPL decrease and a 1.85%~3.77% accuracy improvement in various datasets, with ~3x inference speedup and negligible overhead.

[323] Evaluating machine learning models for predicting pesticide toxicity to honey bees

Jakub Adamczyk, Jakub Poziemski, Pawel Siedlecki

Main category: cs.LG

TL;DR: This paper introduces ApisTox, the most comprehensive dataset of experimentally validated chemical toxicity to honey bees, and evaluates various machine learning approaches for modeling agrochemical toxicity, revealing limitations of models trained only on biomedical data.

Details

Motivation: There is a scarcity of agrochemical toxicity data, particularly species-specific toxicity data for ecologically important pollinators like honey bees, while current machine learning models are primarily trained on biomedical datasets and may not generalize well to agrochemical domains.

Method: The study uses ApisTox dataset and evaluates diverse machine learning approaches including molecular fingerprints, graph kernels, graph neural networks, and pretrained models. It performs comparative analysis with medicinal datasets from MoleculeNet benchmark to assess chemical space differences and model generalizability.

Result: ApisTox represents a distinct chemical space compared to medicinal datasets. Performance degradation on non-medicinal datasets like ApisTox demonstrates limited generalizability of current state-of-the-art algorithms trained solely on biomedical data.

Conclusion: The study highlights the need for more diverse datasets and targeted model development specifically geared toward the agrochemical domain, as models trained on biomedical data alone do not adequately generalize to agrochemical toxicity prediction.

Abstract: Small molecules play a critical role in the biomedical, environmental, and agrochemical domains, each with distinct physicochemical requirements and success criteria. Although biomedical research benefits from extensive datasets and established benchmarks, agrochemical data remain scarce, particularly with respect to species-specific toxicity. This work focuses on ApisTox, the most comprehensive dataset of experimentally validated chemical toxicity to the honey bee (\textit{Apis mellifera}), an ecologically vital pollinator. The primary goal of this study was to determine the suitability of diverse machine learning approaches for modeling such toxicity, including molecular fingerprints, graph kernels, and graph neural networks, as well as pretrained models. Comparative analysis with medicinal datasets from the MoleculeNet benchmark reveals that ApisTox represents a distinct chemical space. Performance degradation on non-medicinal datasets, such as \mbox{ApisTox}, demonstrates their limited generalizability of current state-of-the-art algorithms trained solely on biomedical data. Our study highlights the need for more diverse datasets and for targeted model development geared toward the agrochemical domain.

[324] Advanced Long-term Earth System Forecasting

Hao Wu, Yuan Gao, Ruijian Gou, Xian Wu, Chuhan Wu, Huahui Yi, Johannes Brandstetter, Fan Xu, Kun Wang, Penghao Zhao, Hao Jia, Qi Song, Xinliang Liu, Juncai He, Shuhao Cao, Huanshuo Dong, Yanfei Xiang, Fan Zhang, Haixin Wang, Xingjian Shi, Qiufeng Wang, Shuaipeng Li, Ruobing Xie, Feng Tao, Yuxu Lu, Yu Guo, Yuntian Chen, Yuxuan Liang, Qingsong Wen, Wanli Ouyang, Deliang Chen, Niklas Boers, Xiaomeng Huang

Main category: cs.LG

TL;DR: TritonCast is a novel AI architecture for Earth system forecasting that uses a nested latent dynamical core to achieve unprecedented long-term stability and cross-resolution generalization.

Details

Motivation: Current AI models for Earth system forecasting suffer from instability during extended autoregressive simulations due to spectral bias, which leads to inadequate representation of high-frequency, small-scale processes and uncontrolled error amplification.

Method: TritonCast uses a nested grid architecture inspired by numerical models. It features a dedicated latent dynamical core that ensures long-term stability of macro-evolution at coarse scales, with an outer structure that fuses this stable trend with fine-grained local details to mitigate spectral bias from cross-scale interactions.

Result: Achieves state-of-the-art accuracy on WeatherBench 2, demonstrates exceptional long-term stability with year-long autoregressive global forecasts and multi-year climate simulations spanning 2500 days without drift. In oceanography, extends skillful eddy forecast to 120 days and shows unprecedented zero-shot cross-resolution generalization.

Conclusion: TritonCast offers a promising pathway toward trustworthy AI-driven Earth system simulations, with potential to accelerate discovery in climate science through more reliable long-term forecasting and deeper insights into complex geophysical dynamics.

Abstract: Reliable long-term forecasting of Earth system dynamics is fundamentally limited by instabilities in current artificial intelligence (AI) models during extended autoregressive simulations. These failures often originate from inherent spectral bias, leading to inadequate representation of critical high-frequency, small-scale processes and subsequent uncontrolled error amplification. Inspired by the nested grids in numerical models used to resolve small scales, we present TritonCast. At the core of its design is a dedicated latent dynamical core, which ensures the long-term stability of the macro-evolution at a coarse scale. An outer structure then fuses this stable trend with fine-grained local details. This design effectively mitigates the spectral bias caused by cross-scale interactions. In atmospheric science, it achieves state-of-the-art accuracy on the WeatherBench 2 benchmark while demonstrating exceptional long-term stability: executing year-long autoregressive global forecasts and completing multi-year climate simulations that span the entire available $2500$-day test period without drift. In oceanography, it extends skillful eddy forecast to $120$ days and exhibits unprecedented zero-shot cross-resolution generalization. Ablation studies reveal that this performance stems from the synergistic interplay of the architecture’s core components. TritonCast thus offers a promising pathway towards a new generation of trustworthy, AI-driven simulations. This significant advance has the potential to accelerate discovery in climate and Earth system science, enabling more reliable long-term forecasting and deeper insights into complex geophysical dynamics.

[325] Let Me Think! A Long Chain-of-Thought Can Be Worth Exponentially Many Short Ones

Parsa Mirtaheri, Ezra Edelman, Samy Jelassi, Eran Malach, Enric Boix-Adsera

Main category: cs.LG

TL;DR: The paper shows that sequential scaling (longer chains of thought) can provide exponential advantages over parallel scaling (majority voting across multiple short chains) for certain graph connectivity reasoning tasks.

Details

Motivation: While inference-time computation is promising for improving LLM reasoning, optimal allocation between sequential vs parallel scaling remains poorly understood. The paper aims to illuminate this landscape by identifying reasoning settings where sequential scaling offers exponential advantages.

Method: Theoretical analysis using graph connectivity problems on challenging graph distributions to demonstrate exponential advantages of sequential scaling. Experimental validation across various language models including models trained from scratch for graph connectivity with different chain of thought strategies and large reasoning models.

Result: The paper demonstrates the existence of reasoning settings where sequential scaling offers exponential advantages over parallel scaling. Comprehensive experiments validate these theoretical findings across multiple language models.

Conclusion: Sequential scaling (longer chains of thought) can provide exponential benefits over parallel scaling for certain reasoning tasks, particularly graph connectivity problems, highlighting the importance of understanding optimal inference-time computation allocation.

Abstract: Inference-time computation has emerged as a promising scaling axis for improving large language model reasoning. However, despite yielding impressive performance, the optimal allocation of inference-time computation remains poorly understood. A central question is whether to prioritize sequential scaling (e.g., longer chains of thought) or parallel scaling (e.g., majority voting across multiple short chains of thought). In this work, we seek to illuminate the landscape of test-time scaling by demonstrating the existence of reasoning settings where sequential scaling offers an exponential advantage over parallel scaling. These settings are based on graph connectivity problems in challenging distributions of graphs. We validate our theoretical findings with comprehensive experiments across a range of language models, including models trained from scratch for graph connectivity with different chain of thought strategies as well as large reasoning models.

[326] Machine learning for in-situ composition mapping in a self-driving magnetron sputtering system

Sanna Jarl, Jens Sjölund, Robert J. W. Frost, Anders Holst, Jonathan J. S. Scragg

Main category: cs.LG

TL;DR: ML-driven self-driving lab for magnetron co-sputtering uses Gaussian processes and active learning to predict composition maps of multi-element thin films without calibration, accelerating materials discovery.

Details

Motivation: Current self-driving labs in thin film science are limited to solution-based methods, which cannot access the broad chemical space of inorganic materials. Magnetron sputtering offers access to diverse materials but requires time-consuming ex-situ analysis prone to errors.

Method: Developed an ML approach using Gaussian processes with active learning to predict deposition rates from quartz-crystal microbalance sensors. Combines learned sensor readings with geometric flux distribution models to interpolate deposition rates across the sample. Tested multiple acquisition functions, with Bayesian active learning MacKay (BALM) performing best.

Result: BALM achieved best performance, learning deposition rates for a single source in only 10 experiments. Prediction accuracy for co-sputtering composition distributions was experimentally verified. The framework dramatically increases throughput by eliminating extensive characterization or calibration.

Conclusion: The ML-guided self-driving lab demonstrates potential to accelerate materials exploration by enabling rapid, calibration-free composition mapping for magnetron co-sputtering, overcoming limitations of current SDLs and expanding accessible chemical space.

Abstract: Self-driving labs (SDLs), employing automation and machine learning (ML) to accelerate experimental procedures, have enormous potential in the discovery of new materials. However, in thin film science, SDLs are mainly restricted to solution-based synthetic methods which are easier to automate but cannot access the broad chemical space of inorganic materials. This work presents an SDL based on magnetron co-sputtering. We are using combinatorial frameworks, obtaining accurate composition maps on multi-element, compositionally graded thin films. This normally requires time-consuming ex-situ analysis prone to systematic errors. We present a rapid and calibration-free in-situ, ML driven approach to produce composition maps for arbitrary source combinations and sputtering conditions. We develop a method to predict the composition distribution in a multi-element combinatorial thin film, using in-situ measurements from quartz-crystal microbalance sensors placed in a sputter chamber. For a given source, the sensor readings are learned as a function of the sputtering pressure and magnetron power, through active learning using Gaussian processes (GPs). The final GPs are combined with a geometric model of the deposition flux distribution in the chamber, which allows interpolation of the deposition rates from each source, at any position across the sample. We investigate several acquisition functions for the ML procedure. A fully Bayesian GP - BALM (Bayesian active learning MacKay) - achieved the best performance, learning the deposition rates for a single source in 10 experiments. Prediction accuracy for co-sputtering composition distributions was verified experimentally. Our framework dramatically increases throughput by avoiding the need for extensive characterisation or calibration, thus demonstrating the potential of ML-guided SDLs to accelerate materials exploration.

[327] Generative or Discriminative? Revisiting Text Classification in the Era of Transformers

Siva Rajesh Kasa, Karan Gupta, Sumegh Roychowdhury, Ashutosh Kumar, Yaswanth Biruduraju, Santhosh Kumar Kasa, Nikhil Priyatam Pattisapu, Arindam Bhattacharya, Shailendra Agarwal, Vijay huddar

Main category: cs.LG

TL;DR: This paper compares modern generative vs discriminative classifiers in the transformer era, analyzing their trade-offs in accuracy, sample efficiency, calibration, and robustness across different architectures and training paradigms.

Details

Motivation: The motivation is to extend the classical comparison between discriminative and generative classifiers (dating back to Efron's work on logistic regression vs discriminant analysis) to modern transformer architectures, as the trade-offs between sample complexity and asymptotic error remain unexplored in contemporary deep learning settings.

Method: The authors conduct a comprehensive evaluation of modern generative and discriminative architectures including Auto-regressive modeling, Masked Language Modeling, Discrete Diffusion, and Encoders for text classification tasks. They analyze multiple dimensions beyond just accuracy.

Result: The study reveals that the classical ’two regimes’ phenomenon (generative classifiers having lower sample complexity but higher asymptotic error) manifests distinctly across different architectures and training paradigms. The analysis covers sample efficiency, calibration, noise robustness, and ordinality across diverse scenarios.

Conclusion: The findings provide practical guidance for selecting the most suitable modeling approach based on real-world constraints such as latency and data limitations, bridging classical statistical theory with modern deep learning practice.

Abstract: The comparison between discriminative and generative classifiers has intrigued researchers since Efron’s seminal analysis of logistic regression versus discriminant analysis. While early theoretical work established that generative classifiers exhibit lower sample complexity but higher asymptotic error in simple linear settings, these trade-offs remain unexplored in the transformer era. We present the first comprehensive evaluation of modern generative and discriminative architectures - Auto-regressive modeling, Masked Language Modeling, Discrete Diffusion, and Encoders for text classification. Our study reveals that the classical ’two regimes’ phenomenon manifests distinctly across different architectures and training paradigms. Beyond accuracy, we analyze sample efficiency, calibration, noise robustness, and ordinality across diverse scenarios. Our findings offer practical guidance for selecting the most suitable modeling approach based on real-world constraints such as latency and data limitations.

[328] Bayesian BiLO: Bilevel Local Operator Learning for Efficient Uncertainty Quantification of Bayesian PDE Inverse Problems with Low-Rank Adaptation

Ray Zirui Zhang, Christopher E. Miles, Xiaohui Xie, John S. Lowengrub

Main category: cs.LG

TL;DR: B-BiLO: Bilevel Local Operator Learning framework for Bayesian inference in PDEs that combines Hamiltonian Monte Carlo sampling with neural operator fine-tuning via LoRA for efficient uncertainty quantification without synthetic data or adjoints.

Details

Motivation: Uncertainty quantification in PDE inverse problems is essential for applications like medical imaging and clinical insights. Scientific machine learning enables data-driven learning while preserving physical structure, but existing methods face challenges with scalability, computational cost, and high-dimensional sampling.

Method: Bilevel Local Operator Learning (B-BiLO): Upper level uses Hamiltonian Monte Carlo to sample parameters from posterior; lower level fine-tunes neural network via low-rank adaptation (LoRA) to approximate solution operator locally. This avoids synthetic data, adjoint equations, and high-dimensional weight space sampling by optimizing weights deterministically.

Result: The framework enables efficient gradient-based sampling, analyzes errors from approximate lower-level optimization, and establishes their impact on posterior accuracy. Numerical experiments across PDE models (including tumor growth) demonstrate accurate and efficient uncertainty quantification.

Conclusion: B-BiLO provides an effective framework for Bayesian inference in PDEs that combines the strengths of scientific machine learning with efficient sampling techniques, enabling scalable uncertainty quantification for emerging imaging technologies and clinical applications.

Abstract: Uncertainty quantification in PDE inverse problems is essential in many applications. Scientific machine learning and AI enable data-driven learning of model components while preserving physical structure, and provide the scalability and adaptability needed for emerging imaging technologies and clinical insights. We develop a Bilevel Local Operator Learning framework for Bayesian inference in PDEs (B-BiLO). At the upper level, we sample parameters from the posterior via Hamiltonian Monte Carlo, while at the lower level we fine-tune a neural network via low-rank adaptation (LoRA) to approximate the solution operator locally. B-BiLO enables efficient gradient-based sampling without synthetic data or adjoint equations and avoids sampling in high-dimensional weight space, as in Bayesian neural networks, by optimizing weights deterministically. We analyze errors from approximate lower-level optimization and establish their impact on posterior accuracy. Numerical experiments across PDE models, including tumor growth, demonstrate that B-BiLO achieves accurate and efficient uncertainty quantification.

[329] Confidence-gated training for efficient early-exit neural networks

Saad Mokssit, Ouassim Karrakchou, Alejandro Mousist, Mounir Ghogho

Main category: cs.LG

TL;DR: CGT (Confidence-Gated Training) improves early-exit neural networks by conditionally propagating gradients from deeper exits only when earlier exits fail, aligning training with inference to reduce gradient interference and improve efficiency.

Details

Motivation: Early-exit neural networks reduce inference cost but suffer from gradient interference during joint training, where deeper classifiers dominate optimization, causing shallow exits to underperform and reducing overall efficiency.

Method: Confidence-Gated Training (CGT) conditionally propagates gradients from deeper exits only when preceding exits fail to make confident predictions. This aligns training with the inference-time policy where shallow classifiers act as primary decision points and deeper layers handle harder inputs.

Result: Experiments on Indian Pines and Fashion-MNIST benchmarks show CGT lowers average inference cost while improving overall accuracy, mitigating overthinking and improving early-exit performance.

Conclusion: CGT offers a practical solution for deploying deep models in resource-constrained environments by better aligning training with inference dynamics, reducing gradient interference, and improving both efficiency and accuracy in early-exit networks.

Abstract: Early-exit neural networks reduce inference cost by enabling confident predictions at intermediate layers. However, joint training often leads to gradient interference, with deeper classifiers dominating optimization. We propose Confidence-Gated Training (CGT), a paradigm that conditionally propagates gradients from deeper exits only when preceding exits fail. This encourages shallow classifiers to act as primary decision points while reserving deeper layers for harder inputs. By aligning training with the inference-time policy, CGT mitigates overthinking, improves early-exit accuracy, and preserves efficiency. Experiments on the Indian Pines and Fashion-MNIST benchmarks show that CGT lowers average inference cost while improving overall accuracy, offering a practical solution for deploying deep models in resource-constrained environments.

[330] MO-GRPO: Mitigating Reward Hacking of Group Relative Policy Optimization on Multi-Objective Problems

Yuki Ichihara, Yuu Jinnai, Tetsuro Morimura, Mitsuki Sakamoto, Ryota Mitsuhashi, Eiji Uchibe

Main category: cs.LG

TL;DR: MO-GRPO extends GRPO with automatic reward normalization for multi-objective RL, preventing reward hacking by ensuring balanced contributions from all objectives.

Details

Motivation: GRPO requires accurate reward models which are often unavailable in real-world tasks, and is vulnerable to reward hacking in multi-objective settings where it may optimize one objective at the expense of others.

Method: MO-GRPO extends GRPO with a simple normalization method that automatically reweights reward functions according to their value variances, ensuring all rewards contribute evenly to the loss while preserving preference order.

Result: MO-GRPO achieves stable learning by evenly distributing correlations among reward components, outperforming GRPO across four domains: multi-armed bandits, simulated control tasks, WMT machine translation benchmarks, and instruction following tasks.

Conclusion: MO-GRPO is a promising algorithm for multi-objective reinforcement learning that eliminates manual reward scaling and prevents reward hacking through automatic normalization.

Abstract: Group Relative Policy Optimization (GRPO) has been shown to be an effective algorithm when an accurate reward model is available. However, such a highly reliable reward model is not available in many real-world tasks. In this paper, we particularly focus on multi-objective settings, in which we identify that GRPO is vulnerable to reward hacking, optimizing only one of the objectives at the cost of the others. To address this issue, we propose MO-GRPO, an extension of GRPO with a simple normalization method to reweight the reward functions automatically according to the variances of their values. We first show analytically that MO-GRPO ensures that all reward functions contribute evenly to the loss function while preserving the order of preferences, eliminating the need for manual tuning of the reward functions’ scales. Then, we evaluate MO-GRPO experimentally in four domains: (i) the multi-armed bandits problem, (ii) simulated control task (Mo-Gymnasium), (iii) machine translation tasks on the WMT benchmark (En-Ja, En-Zh), and (iv) instruction following task. MO-GRPO achieves stable learning by evenly distributing correlations among the components of rewards, outperforming GRPO, showing MO-GRPO to be a promising algorithm for multi-objective reinforcement learning problems.

[331] SPEC-RL: Accelerating On-Policy Reinforcement Learning via Speculative Rollouts

Bingshuai Liu, Ante Wang, Zijun Min, Liang Yao, Haibo Zhang, Yang Liu, Anxiang Zeng, Jinsong Su

Main category: cs.LG

TL;DR: SPEC-RL accelerates RL training for LLMs by reusing overlapping trajectory segments from previous epochs via speculative decoding, reducing rollout time 2-3x without quality loss.

Details

Motivation: Current RL training for LLMs is bottlenecked by expensive rollout computation. Existing acceleration methods have limitations: parallelization has diminishing returns, objective/data modifications introduce bias, and replay buffers overlook redundancy across iterations. The key insight is that rollouts from consecutive epochs often share overlapping segments, wasting computation.

Method: SPEC-RL integrates speculative decoding with RL rollout process. It reuses prior trajectory segments as speculative prefixes and extends them via a draft-and-verify mechanism. This avoids redundant generation while ensuring policy consistency. The approach is purely a rollout-stage enhancement that works with mainstream RL algorithms like PPO, GRPO, and DAPO.

Result: Experiments on diverse benchmarks (AIME24, MATH-500, OlympiadBench, MMLU-STEM, etc.) show SPEC-RL reduces rollout time by 2-3x without compromising policy quality. The method demonstrates effectiveness across various math reasoning and generalization tasks.

Conclusion: SPEC-RL provides a general and practical solution to scale RL with verifiable rewards for large reasoning models. By eliminating redundant computation through speculative decoding, it offers significant speedup while maintaining policy quality, making RL training more efficient for LLMs.

Abstract: Large Language Models (LLMs) increasingly rely on reinforcement learning with verifiable rewards (RLVR) to elicit reliable chain-of-thought reasoning. However, the training process remains bottlenecked by the computationally expensive rollout stage. Existing acceleration methods-such as parallelization, objective- and data-driven modifications, and replay buffers-either incur diminishing returns, introduce bias, or overlook redundancy across iterations. We identify that rollouts from consecutive training epochs frequently share a large portion of overlapping segments, wasting computation. To address this, we propose SPEC-RL, a novel framework that integrates SPECulative decoding with the RL rollout process. SPEC-RL reuses prior trajectory segments as speculative prefixes and extends them via a draft-and-verify mechanism, avoiding redundant generation while ensuring policy consistency. Experiments on diverse math reasoning and generalization benchmarks, including AIME24, MATH-500, OlympiadBench, MMLU-STEM, and others, demonstrate that SPEC-RL reduces rollout time by 2-3x without compromising policy quality. As a purely rollout-stage enhancement, SPEC-RL integrates seamlessly with mainstream algorithms (e.g., PPO, GRPO, DAPO), offering a general and practical path to scale RLVR for large reasoning models. Our code is available at https://github.com/ShopeeLLM/Spec-RL

[332] Sample-Efficient Differentially Private Fine-Tuning via Gradient Matrix Denoising

Ali Dadsetan, Frank Rudzicz

Main category: cs.LG

TL;DR: A post-processing algorithm using random matrix theory to denoise DP-SGD gradients, improving sample efficiency in private LLM fine-tuning without compromising privacy.

Details

Motivation: DP-SGD's added noise increases gradient matrix entropy, disrupts low-rank structure, and slows optimization, creating sample efficiency challenges in private LLM fine-tuning.

Method: Proposes a post-processing algorithm leveraging random matrix theory to denoise gradients, restore low-rank structure, and improve alignment with original signal during DP-SGD fine-tuning.

Result: Applied to DP-SGD fine-tuning of RoBERTa on GLUE tasks, the method improves sample efficiency compared to state-of-the-art approaches and substantially reduces training time when optimal performance isn’t required.

Conclusion: Matrix recovery techniques can enhance utility of private language model training without compromising privacy guarantees, demonstrating practical improvements in sample efficiency.

Abstract: We address the challenge of sample efficiency in differentially private fine-tuning of large language models (LLMs) using DP-SGD. While DP-SGD provides strong privacy guarantees, the added noise significantly increases the entropy of gradient matrices, disrupting their low-rank structure and slowing optimization. We propose a post-processing algorithm that leverages random matrix theory to denoise gradients, restore low-rank structure, and improve alignment with the original signal. Applied to DP-SGD fine-tuning of RoBERTa on GLUE tasks, our method improves sample efficiency compared to state-of-the-art approaches, substantially reducing training time when optimal performance is not required. This work demonstrates that matrix recovery techniques can enhance the utility of private language model training without compromising privacy guarantees.

[333] Topological Signatures of ReLU Neural Network Activation Patterns

Vicente Bosca, Tatum Rask, Sunia Tanweer, Andrew R. Tawfeek, Branden Stone

Main category: cs.LG

TL;DR: This paper analyzes topological properties of ReLU neural networks, showing that Fiedler partitions correlate with decision boundaries in classification, and homology patterns in regression correlate with training loss.

Details

Motivation: To understand the topological structure of ReLU neural networks and how activation patterns relate to network behavior and decision-making processes.

Method: Analyze polytope decomposition of feature space induced by ReLU networks, investigate Fiedler partitions of dual graphs for binary classification, and compute homology of cellular decomposition for regression tasks.

Result: Fiedler partitions appear to correlate with decision boundaries in binary classification, and similar patterns emerge between training loss and polyhedral cell-count during training in regression tasks.

Conclusion: Topological signatures of ReLU activation patterns provide insights into network behavior, with Fiedler partitions revealing decision boundaries and homology patterns reflecting training dynamics.

Abstract: This paper explores the topological signatures of ReLU neural network activation patterns. We consider feedforward neural networks with ReLU activation functions and analyze the polytope decomposition of the feature space induced by the network. Mainly, we investigate how the Fiedler partition of the dual graph and show that it appears to correlate with the decision boundary – in the case of binary classification. Additionally, we compute the homology of the cellular decomposition – in a regression task – to draw similar patterns in behavior between the training loss and polyhedral cell-count, as the model is trained.

[334] Normalized Conditional Mutual Information Surrogate Loss for Deep Neural Classifiers

Linfeng Ye, Zhixiang Chi, Konstantinos N. Plataniotis, En-hui Yang

Main category: cs.LG

TL;DR: Proposes normalized conditional mutual information (NCMI) as a drop-in replacement for cross-entropy loss in DNN classifiers, achieving significant accuracy improvements across multiple benchmarks.

Details

Motivation: Cross-entropy is the de facto standard loss for training DNN classifiers, but there may be better information-theoretic alternatives that can improve model performance.

Method: Introduces NCMI as a novel information-theoretic surrogate loss, observes its inverse relationship with accuracy, and develops an alternating algorithm to efficiently minimize NCMI during training.

Result: NCMI-trained models outperform state-of-the-art losses substantially: 2.77% top-1 accuracy improvement on ImageNet with ResNet-50, 8.6% macro-F1 improvement on CAMELYON-17, with consistent gains across architectures and batch sizes.

Conclusion: NCMI is a practical, competitive alternative to cross-entropy that offers significant performance improvements at comparable computational cost, making it a viable drop-in replacement for training DNN classifiers.

Abstract: In this paper, we propose a novel information theoretic surrogate loss; normalized conditional mutual information (NCMI); as a drop in alternative to the de facto cross-entropy (CE) for training deep neural network (DNN) based classifiers. We first observe that the model’s NCMI is inversely proportional to its accuracy. Building on this insight, we introduce an alternating algorithm to efficiently minimize the NCMI. Across image recognition and whole-slide imaging (WSI) subtyping benchmarks, NCMI-trained models surpass state of the art losses by substantial margins at a computational cost comparable to that of CE. Notably, on ImageNet, NCMI yields a 2.77% top-1 accuracy improvement with ResNet-50 comparing to the CE; on CAMELYON-17, replacing CE with NCMI improves the macro-F1 by 8.6% over the strongest baseline. Gains are consistent across various architectures and batch sizes, suggesting that NCMI is a practical and competitive alternative to CE.

[335] Low-dimensional semi-supervised latent Bayesian optimization for designing antimicrobial peptides

Jyler Menard, R. A. Mansbach

Main category: cs.LG

TL;DR: The paper investigates using dimensionally-reduced latent spaces in deep generative models for antimicrobial peptide design to improve interpretability, optimization efficiency, and organization with physicochemical properties.

Details

Motivation: Deep generative models like variational autoencoders are valuable for peptide design but suffer from lack of interpretability and rigorous quantification of latent space quality as a search space for antimicrobial peptide discovery.

Method: The study investigates three aspects: (1) whether dimensionally-reduced latent spaces facilitate optimization, (2) how organizing latent spaces with physicochemical properties improves antimicrobial activity optimization efficiency, and (3) the interpretability of these spaces.

Result: Dimensionally-reduced latent spaces are more interpretable and can be advantageous, while latent spaces can be organized with different physicochemical properties even at different percentages of available labels.

Conclusion: This work lays crucial groundwork for biophysically-motivated peptide design procedures by improving interpretability and optimization efficiency in antimicrobial peptide discovery using deep generative models.

Abstract: Antimicrobial peptides (AMPs) are a promising class of therapeutics to treat bacterial infections. Discovering and designing such peptides is difficult because of the vast number of possible sequences of amino acids. Deep generative models, such as variational autoencoders, have shown value in peptide design due to their ability to model sequence space with a continuous-valued latent space. Although such models have already been used to great effect in biomolecular design, they still suffer from a lack of interpretability and rigorous quantification of latent space quality as a search space. We investigate (1) whether searching through a dimensionally-reduced variant of the latent design space may facilitate optimization, (2) how organizing latent spaces with physicochemical properties may improve the efficiency of optimizing antimicrobial activity, and (3) the interpretability of the spaces. We find that employing a dimensionally-reduced version of the latent space is more interpretable and can be advantageous, while we can organize the latent space with different physicochemical properties even at different percentages of available labels. This work lays crucial groundwork for biophysically-motivated peptide design procedures.

[336] MoEMeta: Mixture-of-Experts Meta Learning for Few-Shot Relational Learning

Han Wu, Jie Yin

Main category: cs.LG

TL;DR: MoEMeta: A meta-learning framework using mixture-of-experts to disentangle global relational knowledge from local task-specific contexts for improved few-shot KG relational learning.

Details

Motivation: Existing meta-learning approaches for few-shot KG relational learning have two key limitations: (1) they learn relation meta-knowledge in isolation without capturing common relational patterns across tasks, and (2) they struggle to effectively incorporate local, task-specific contexts crucial for rapid adaptation.

Method: Proposes MoEMeta framework with two key innovations: (1) a mixture-of-experts (MoE) model that learns globally shared relational prototypes to enhance generalization, and (2) a task-tailored adaptation mechanism that captures local contexts for fast task-specific adaptation.

Result: Extensive experiments on three KG benchmarks show MoEMeta consistently outperforms existing baselines and achieves state-of-the-art performance in few-shot relational learning.

Conclusion: MoEMeta advances few-shot KG relational learning by effectively balancing global generalization with local adaptability through disentangled knowledge representation and task-specific adaptation mechanisms.

Abstract: Few-shot knowledge graph relational learning seeks to perform reasoning over relations given only a limited number of training examples. While existing approaches largely adopt a meta-learning framework for enabling fast adaptation to new relations, they suffer from two key pitfalls. First, they learn relation meta-knowledge in isolation, failing to capture common relational patterns shared across tasks. Second, they struggle to effectively incorporate local, task-specific contexts crucial for rapid adaptation. To address these limitations, we propose MoEMeta, a novel meta-learning framework that disentangles globally shared knowledge from task-specific contexts to enable both effective model generalization and rapid adaptation. MoEMeta introduces two key innovations: (i) a mixture-of-experts (MoE) model that learns globally shared relational prototypes to enhance generalization, and (ii) a task-tailored adaptation mechanism that captures local contexts for fast task-specific adaptation. By balancing global generalization with local adaptability, MoEMeta significantly advances few-shot relational learning. Extensive experiments and analyses on three KG benchmarks show that MoEMeta consistently outperforms existing baselines, achieving state-of-the-art performance.

[337] Amortized Variational Inference for Partial-Label Learning: A Probabilistic Approach to Label Disambiguation

Tobias Fuchs, Nadja Klein

Main category: cs.LG

TL;DR: A novel probabilistic framework for partial-label learning that uses amortized variational inference to directly approximate the true label posterior, achieving state-of-the-art performance in accuracy and efficiency.

Details

Motivation: Real-world data is often noisy with conflicting labels (e.g., in crowdsourcing). Partial-label learning addresses this by training classifiers when each instance has multiple candidate labels with only one correct. Existing methods either are computationally intensive (early PLL) or rely on heuristic approaches (recent deep learning methods).

Method: A probabilistic framework that directly approximates the posterior distribution over true labels using amortized variational inference. Neural networks predict variational parameters from input data, enabling efficient inference while remaining architecture-agnostic.

Result: Theoretical analysis and extensive experiments on synthetic and real-world datasets demonstrate state-of-the-art performance in both accuracy and efficiency.

Conclusion: The method successfully combines the expressiveness of deep learning with the rigor of probabilistic modeling, providing an efficient and accurate solution for partial-label learning problems.

Abstract: Real-world data is frequently noisy and ambiguous. In crowdsourcing, for example, human annotators may assign conflicting class labels to the same instances. Partial-label learning (PLL) addresses this challenge by training classifiers when each instance is associated with a set of candidate labels, only one of which is correct. While early PLL methods approximate the true label posterior, they are often computationally intensive. Recent deep learning approaches improve scalability but rely on surrogate losses and heuristic label refinement. We introduce a novel probabilistic framework that directly approximates the posterior distribution over true labels using amortized variational inference. Our method employs neural networks to predict variational parameters from input data, enabling efficient inference. This approach combines the expressiveness of deep learning with the rigor of probabilistic modeling, while remaining architecture-agnostic. Theoretical analysis and extensive experiments on synthetic and real-world datasets demonstrate that our method achieves state-of-the-art performance in both accuracy and efficiency.

[338] Shorter but not Worse: Frugal Reasoning via Easy Samples as Length Regularizers in Math RLVR

Abdelaziz Bounhar, Hadi Abdine, Evan Dufraisse, Ahmad Chamma, Amr Mohamed, Dani Bouch, Michalis Vazirgiannis, Guokan Shang

Main category: cs.LG

TL;DR: RLVR training that filters out easy problems causes LLMs to become verbose. Retaining and up-weighting moderately easy problems acts as implicit length regularization, achieving baseline accuracy with nearly twice shorter solutions.

Details

Motivation: Standard RLVR pipelines filter out easy problems for training efficiency, causing models to train primarily on harder problems with longer reasoning chains. This leads to models conflating "thinking longer" with "thinking better," resulting in excessive verbosity and higher inference costs.

Method: Retain and modestly up-weight moderately easy problems during RLVR training. This acts as an implicit length regularizer by exposing the model to solvable short-chain tasks, constraining output distribution and preventing runaway verbosity without explicit length penalization.

Result: Experiments on Qwen3-4B-Thinking-2507 achieve baseline pass@1 AIME25 accuracy while generating solutions that are, on average, nearly twice as short. The model learns to solve harder problems without inflating output length, demonstrating “emergent brevity for free.”

Conclusion: Retaining easy problems in RLVR training provides effective implicit length regularization, reducing verbosity while maintaining accuracy. This approach offers a cost-effective solution to the verbosity problem in reasoning LLMs without explicit length constraints.

Abstract: Large language models (LLMs) trained for step-by-step reasoning often become excessively verbose, raising inference cost. Standard Reinforcement Learning with Verifiable Rewards (RLVR) pipelines filter out easy'' problems for training efficiency, leaving the model to train primarily on harder problems that require longer reasoning chains. This skews the output length distribution upward, resulting in a \textbf{model that conflates thinking longer’’ with ``thinking better’’}. In this work, we show that retaining and modestly up-weighting moderately easy problems acts as an implicit length regularizer. Exposing the model to solvable short-chain tasks constrains its output distribution and prevents runaway verbosity. The result is \textbf{\emph{emergent brevity for free}}: the model learns to solve harder problems without inflating the output length, \textbf{ despite the absence of any explicit length penalization}. RLVR experiments using this approach on \textit{Qwen3-4B-Thinking-2507} (with a 16k token limit) achieve baseline pass@1 AIME25 accuracy while generating solutions that are, on average, nearly twice as short. The code is available at \href{https://github.com/MBZUAI-Paris/Frugal-AI}{GitHub}, with datasets and models on \href{https://huggingface.co/collections/MBZUAI-Paris/k2-think-mini-68dcfa8b114686a4bd3dc2bc}{Hugging Face}.

[339] The Geometry of Grokking: Norm Minimization on the Zero-Loss Manifold

Tiberiu Musat

Main category: cs.LG

TL;DR: Grokking phenomenon explained as constrained optimization: gradient descent minimizes weight norm on zero-loss manifold after memorization, with closed-form dynamics derived for two-layer networks.

Details

Motivation: To understand the puzzling grokking phenomenon where neural networks achieve full generalization only after substantial delay following complete memorization, and to uncover the precise underlying dynamics beyond previous links to representation learning and weight decay.

Method: Proposes viewing post-memorization learning through constrained optimization lens: gradient descent minimizes weight norm on zero-loss manifold. Formally proves this in limit of infinitesimal learning rates and weight decay. Introduces approximation decoupling parameter subsets, derives closed-form expression for first-layer dynamics in two-layer networks.

Result: Experiments confirm that simulating training using predicted gradients reproduces both delayed generalization and representation learning characteristic of grokking, validating the theoretical framework.

Conclusion: Grokking can be understood as gradient descent performing constrained optimization to minimize weight norm on zero-loss manifold after memorization, providing formal explanation for delayed generalization phenomenon.

Abstract: Grokking is a puzzling phenomenon in neural networks where full generalization occurs only after a substantial delay following the complete memorization of the training data. Previous research has linked this delayed generalization to representation learning driven by weight decay, but the precise underlying dynamics remain elusive. In this paper, we argue that post-memorization learning can be understood through the lens of constrained optimization: gradient descent effectively minimizes the weight norm on the zero-loss manifold. We formally prove this in the limit of infinitesimally small learning rates and weight decay coefficients. To further dissect this regime, we introduce an approximation that decouples the learning dynamics of a subset of parameters from the rest of the network. Applying this framework, we derive a closed-form expression for the post-memorization dynamics of the first layer in a two-layer network. Experiments confirm that simulating the training process using our predicted gradients reproduces both the delayed generalization and representation learning characteristic of grokking.

[340] DynaGen: Unifying Temporal Knowledge Graph Reasoning with Dynamic Subgraphs and Generative Regularization

Jiawei Shen, Jia Zhu, Hanghui Guo, Weijie Shi, Guoqing Ma, Yidan Liang, Jingjiang Liu, Hao Chen, Shimin Di

Main category: cs.LG

TL;DR: DynaGen is a unified method for Temporal Knowledge Graph Reasoning that addresses interpolation and extrapolation tasks through entity-centric subgraph construction with dual-branch GNN for context, and conditional diffusion for learning evolutionary principles.

Details

Motivation: Existing TKGR methods face two critical challenges: limited contextual modeling in interpolation (embedding temporal info into individual facts) and cognitive generalization bias in extrapolation (leveraging sequence models over graph snapshots). These limitations hinder effective reasoning across both historical and future temporal positions.

Method: DynaGen uses a unified approach: 1) For interpolation: dynamically constructs entity-centric subgraphs and processes them with a synergistic dual-branch GNN encoder to capture evolving structural context. 2) For extrapolation: applies a conditional diffusion process that forces learning of underlying evolutionary principles rather than superficial patterns.

Result: Extensive experiments on six benchmark datasets show state-of-the-art performance. On average, compared to second-best models, DynaGen improves Mean Reciprocal Rank (MRR) by 2.61 points for interpolation and 1.45 points for extrapolation.

Conclusion: DynaGen effectively addresses both interpolation and extrapolation challenges in TKGR through context-aware subgraph modeling and principled evolutionary learning, demonstrating superior performance across diverse temporal reasoning tasks.

Abstract: Temporal Knowledge Graph Reasoning (TKGR) aims to complete missing factual elements along the timeline. Depending on the temporal position of the query, the task is categorized into interpolation and extrapolation. Existing interpolation methods typically embed temporal information into individual facts to complete missing historical knowledge, while extrapolation techniques often leverage sequence models over graph snapshots to identify recurring patterns for future event prediction. These methods face two critical challenges: limited contextual modeling in interpolation and cognitive generalization bias in extrapolation. To address these, we propose a unified method for TKGR, dubbed DynaGen. For interpolation, DynaGen dynamically constructs entity-centric subgraphs and processes them with a synergistic dual-branch GNN encoder to capture evolving structural context. For extrapolation, it applies a conditional diffusion process, which forces the model to learn underlying evolutionary principles rather than just superficial patterns, enhancing its ability to predict unseen future events. Extensive experiments on six benchmark datasets show DynaGen achieves state-of-the-art performance. On average, compared to the second-best models, DynaGen improves the Mean Reciprocal Rank (MRR) score by 2.61 points for interpolation and 1.45 points for extrapolation.

[341] Parametric Expensive Multi-Objective Optimization via Generative Solution Modeling

Tingyang Wei, Jiao Liu, Abhishek Gupta, Chin Chun Ooi, Puay Siew Tan, Yew-Soon Ong

Main category: cs.LG

TL;DR: A novel parametric multi-objective Bayesian optimizer that learns an inverse model to directly predict optimized solutions for any task-preference query without expensive re-evaluation, leveraging inter-task synergies.

Details

Motivation: Real-world applications require solving families of expensive multi-objective optimization problems (P-EMOPs) under varying conditions. Current methods require separate expensive evaluations for each task instance, but the continuous task parameter space contains infinite distinct problems, demanding a solution that can predict optimized solutions for any task without re-evaluation.

Method: Introduces a parametric multi-objective Bayesian optimizer that alternates between (1) generative solution sampling via conditional generative models and (2) acquisition-driven search leveraging inter-task synergies. Uses task-aware Gaussian processes to theoretically justify faster convergence through inter-task synergies.

Result: The approach enables effective optimization across multiple tasks and achieves direct solution prediction for unseen parameterized EMOPs without additional expensive evaluations. Empirical studies in synthetic and real-world benchmarks verify the effectiveness of the proposed parametric optimizer.

Conclusion: The proposed method successfully addresses the challenge of P-EMOPs by learning an inverse model that can predict optimized solutions for any task-preference query without expensive re-evaluation, leveraging inter-task synergies for faster convergence and better performance.

Abstract: Many real-world applications require solving families of expensive multi-objective optimization problems~(EMOPs) under varying operational conditions. This can be formulated as parametric expensive multi-objective optimization problems (P-EMOPs) where each task parameter defines a distinct optimization instance. Current multi-objective Bayesian optimization methods have been widely used for finding finite sets of Pareto optimal solutions for individual tasks. However, P-EMOPs present a fundamental challenge: the continuous task parameter space can contain infinite distinct problems, each requiring separate expensive evaluations. This demands learning an inverse model that can directly predict optimized solutions for any task-preference query without expensive re-evaluation. This paper introduces a novel parametric multi-objective Bayesian optimizer that learns this inverse model by alternating between (1) generative solution sampling via conditional generative models and (2) acquisition-driven search leveraging inter-task synergies. This approach enables effective optimization across multiple tasks and finally achieves direct solution prediction for unseen parameterized EMOPs without additional expensive evaluations. We theoretically justify the faster convergence by leveraging inter-task synergies through task-aware Gaussian processes. Based on that, empirical studies in synthetic and real-world benchmarks further verify the effectiveness of the proposed parametric optimizer.

[342] From Small to Large: Generalization Bounds for Transformers on Variable-Size Inputs

Anastasiia Alokhina, Pan Li

Main category: cs.LG

TL;DR: Transformers show size generalization - extrapolating from small to large token sets. The paper provides theoretical bounds for this phenomenon in geometric data, relating error to sampling density and intrinsic dimensionality.

Details

Motivation: Transformers empirically demonstrate size generalization (extrapolating from small to large token sets) across various domains, but this capability lacks rigorous theoretical characterization. The paper aims to provide theoretical foundations for this phenomenon in geometric data.

Method: Developed a theoretical framework analyzing Transformers on geometric data represented as discrete samples from continuous sources (e.g., point clouds from manifolds, graphs from graphons). Proved error bounds between Transformer outputs for discrete samples and their continuous-domain equivalents, focusing on Transformers with stable positional encodings.

Result: Proved that the error bound is determined by sampling density and intrinsic dimensionality of the data manifold. Experiments on graphs and point clouds of various sizes confirmed the tightness of the theoretical bound.

Conclusion: The paper provides rigorous theoretical characterization of Transformers’ size generalization capability for geometric data, showing how error scales with sampling density and intrinsic dimensionality, with experimental validation.

Abstract: Transformers exhibit a notable property of \emph{size generalization}, demonstrating an ability to extrapolate from smaller token sets to significantly longer ones. This behavior has been documented across diverse applications, including point clouds, graphs, and natural language. Despite its empirical success, this capability still lacks some rigorous theoretical characterizations. In this paper, we develop a theoretical framework to analyze this phenomenon for geometric data, which we represent as discrete samples from a continuous source (e.g., point clouds from manifolds, graphs from graphons). Our core contribution is a bound on the error between the Transformer’s output for a discrete sample and its continuous-domain equivalent. We prove that for Transformers with stable positional encodings, this bound is determined by the sampling density and the intrinsic dimensionality of the data manifold. Experiments on graphs and point clouds of various sizes confirm the tightness of our theoretical bound.

[343] ReCast: Reliability-aware Codebook Assisted Lightweight Time Series Forecasting

Xiang Ma, Taihua Chen, Pengcheng Wang, Xuemei Li, Caiming Zhang

Main category: cs.LG

TL;DR: ReCast is a lightweight time series forecasting framework that uses a learnable codebook to capture recurring local patterns, with dual-path architecture and reliability-aware updates for robustness to distribution shifts.

Details

Motivation: Conventional global decomposition methods fail for real-world time series with local, complex, dynamic patterns, and their high complexity limits real-time applicability in resource-constrained environments.

Method: ReCast encodes local patterns into discrete embeddings via patch-wise quantization using a learnable codebook. It employs dual-path architecture: quantization path for regular structures and residual path for irregular fluctuations. Uses reliability-aware codebook update with weighted corrections derived from multiple reliability factors fused via distributionally robust optimization.

Result: Extensive experiments show ReCast outperforms state-of-the-art models in accuracy, efficiency, and adaptability to distribution shifts.

Conclusion: ReCast provides a lightweight, robust forecasting framework that effectively handles local patterns and distribution shifts through codebook-based quantization and reliability-aware updates.

Abstract: Time series forecasting is crucial for applications in various domains. Conventional methods often rely on global decomposition into trend, seasonal, and residual components, which become ineffective for real-world series dominated by local, complex, and highly dynamic patterns. Moreover, the high model complexity of such approaches limits their applicability in real-time or resource-constrained environments. In this work, we propose a novel \textbf{RE}liability-aware \textbf{C}odebook-\textbf{AS}sisted \textbf{T}ime series forecasting framework (\textbf{ReCast}) that enables lightweight and robust prediction by exploiting recurring local shapes. ReCast encodes local patterns into discrete embeddings through patch-wise quantization using a learnable codebook, thereby compactly capturing stable regular structures. To compensate for residual variations not preserved by quantization, ReCast employs a dual-path architecture comprising a quantization path for efficient modeling of regular structures and a residual path for reconstructing irregular fluctuations. A central contribution of ReCast is a reliability-aware codebook update strategy, which incrementally refines the codebook via weighted corrections. These correction weights are derived by fusing multiple reliability factors from complementary perspectives by a distributionally robust optimization (DRO) scheme, ensuring adaptability to non-stationarity and robustness to distribution shifts. Extensive experiments demonstrate that ReCast outperforms state-of-the-art (SOTA) models in accuracy, efficiency, and adaptability to distribution shifts.

[344] SCOPE: Sequential Causal Optimization of Process Interventions

Jakob De Moor, Hans Weytjens, Johannes De Smedt, Jochen De Weerdt

Main category: cs.LG

TL;DR: SCOPE is a Prescriptive Process Monitoring approach that learns aligned sequential intervention recommendations using backward induction and causal learners, outperforming existing methods that treat interventions independently or require process approximations.

Details

Motivation: Existing PresPM approaches fail to handle sequences of interventions properly - they either focus on single interventions, treat multiple interventions independently ignoring temporal interactions, or require simulation/data augmentation that creates reality gaps and bias.

Method: SCOPE uses backward induction to estimate intervention effects, propagating impact from final decision point back to first. It leverages causal learners to use observational data directly without needing process approximations for reinforcement learning.

Result: Experiments on both existing synthetic dataset and new semi-synthetic dataset show SCOPE consistently outperforms state-of-the-art PresPM techniques in optimizing KPIs.

Conclusion: SCOPE effectively addresses sequential intervention alignment in PresPM, and the novel semi-synthetic setup provides a reusable benchmark for future sequential PresPM research.

Abstract: Prescriptive Process Monitoring (PresPM) recommends interventions during business processes to optimize key performance indicators (KPIs). In realistic settings, interventions are rarely isolated: organizations need to align sequences of interventions to jointly steer the outcome of a case. Existing PresPM approaches fall short in this respect. Many focus on a single intervention decision, while others treat multiple interventions independently, ignoring how they interact over time. Methods that do address these dependencies depend either on simulation or data augmentation to approximate the process to train a Reinforcement Learning (RL) agent, which can create a reality gap and introduce bias. We introduce SCOPE, a PresPM approach that learns aligned sequential intervention recommendations. SCOPE employs backward induction to estimate the effect of each candidate intervention action, propagating its impact from the final decision point back to the first. By leveraging causal learners, our method can utilize observational data directly, unlike methods that require constructing process approximations for reinforcement learning. Experiments on both an existing synthetic dataset and a new semi-synthetic dataset show that SCOPE consistently outperforms state-of-the-art PresPM techniques in optimizing the KPI. The novel semi-synthetic setup, based on a real-life event log, is provided as a reusable benchmark for future work on sequential PresPM.

[345] Improving Matrix Exponential for Generative AI Flows: A Taylor-Based Approach Beyond Paterson–Stockmeyer

Jorge Sastre, Daniel Faronbi, José Miguel Alonso, Peter Traver, Javier Ibáñez, Nuria Lloret

Main category: cs.LG

TL;DR: Optimized Taylor-based algorithm for matrix exponential with dynamic parameter selection, outperforming Padé methods for generative AI applications.

Details

Motivation: Matrix exponential is crucial for scientific computing and generative AI, but traditional Padé methods are being surpassed by newer Taylor-based approaches that offer better accuracy and efficiency for high-throughput AI applications.

Method: Developed an optimized Taylor-based algorithm with rigorous error analysis and dynamic selection strategy for Taylor order and scaling factor to minimize computation under error tolerance constraints.

Result: Significant acceleration and high numerical stability compared to state-of-the-art implementations, making it highly efficient for large-scale generative modeling.

Conclusion: The proposed Taylor-based method establishes itself as a superior alternative to traditional Padé approximants for matrix exponential computation in generative AI workflows.

Abstract: The matrix exponential is a fundamental operator in scientific computing and system simulation, with applications ranging from control theory and quantum mechanics to modern generative machine learning. While Padé approximants combined with scaling and squaring have long served as the standard, recent Taylor-based methods, which utilize polynomial evaluation schemes that surpass the classical Paterson–Stockmeyer technique, offer superior accuracy and reduced computational complexity. This paper presents an optimized Taylor-based algorithm for the matrix exponential, specifically designed for the high-throughput requirements of generative AI flows. We provide a rigorous error analysis and develop a dynamic selection strategy for the Taylor order and scaling factor to minimize computational effort under a prescribed error tolerance. Extensive numerical experiments demonstrate that our approach provides significant acceleration and maintains high numerical stability compared to existing state-of-the-art implementations. These results establish the proposed method as a highly efficient tool for large-scale generative modeling.

[346] Colorful Pinball: Density-Weighted Quantile Regression for Conditional Guarantee of Conformal Prediction

Qianyi Chen, Bo Li

Main category: cs.LG

TL;DR: A new method improves conditional coverage in conformal prediction by optimizing quantile regression with a density-weighted pinball loss, achieving better performance than standard approaches.

Details

Motivation: Standard conformal prediction provides marginal coverage guarantees but struggles with reliable conditional coverage for specific inputs. While exact distribution-free conditional coverage is impossible with finite samples, there's a need to improve conditional coverage performance beyond existing relaxed approaches.

Method: The authors propose refining quantile regression components by minimizing mean squared error of conditional coverage. They derive a density-weighted pinball loss using Taylor expansion, where weights are the conditional density of conformity scores at true quantiles. A three-headed quantile network estimates these weights via finite differences using auxiliary quantile levels at 1-α±δ, then fine-tunes the central quantile by optimizing the weighted loss.

Result: Theoretical analysis provides exact non-asymptotic guarantees characterizing the excess risk. Extensive experiments on diverse high-dimensional real-world datasets demonstrate remarkable improvements in conditional coverage performance compared to standard conformal methods.

Conclusion: The proposed approach successfully addresses the challenge of improving conditional coverage in conformal prediction by directly optimizing quantile regression with a novel density-weighted objective, offering both theoretical guarantees and practical performance gains.

Abstract: While conformal prediction provides robust marginal coverage guarantees, achieving reliable conditional coverage for specific inputs remains challenging. Although exact distribution-free conditional coverage is impossible with finite samples, recent work has focused on improving the conditional coverage of standard conformal procedures. Distinct from approaches that target relaxed notions of conditional coverage, we directly minimize the mean squared error of conditional coverage by refining the quantile regression components that underpin many conformal methods. Leveraging a Taylor expansion, we derive a sharp surrogate objective for quantile regression: a density-weighted pinball loss, where the weights are given by the conditional density of the conformity score evaluated at the true quantile. We propose a three-headed quantile network that estimates these weights via finite differences using auxiliary quantile levels at (1-α\pm δ), subsequently fine-tuning the central quantile by optimizing the weighted loss. We provide a theoretical analysis with exact non-asymptotic guarantees characterizing the resulting excess risk. Extensive experiments on diverse high-dimensional real-world datasets demonstrate remarkable improvements in conditional coverage performance.

[347] ACDZero: MCTS Agent for Mastering Automated Cyber Defense

Yu Li, Sizhe Tang, Rongqian Chen, Fei Xu Yu, Guangyu Jiang, Mahdi Imani, Nathaniel D. Bastian, Tian Lan

Main category: cs.LG

TL;DR: The paper proposes a Monte Carlo Tree Search (MCTS) based planning approach with graph neural networks for automated cyber defense, improving sample efficiency and performance over RL baselines in complex network scenarios.

Details

Motivation: Existing deep reinforcement learning approaches for automated cyber defense face difficult exploration in complex networks with large decision/state spaces, requiring expensive amounts of samples. There's a need for more sample-efficient defense policies.

Method: Frames automated cyber defense as a context-based partially observable Markov decision problem and proposes a planning-centric defense policy based on Monte Carlo Tree Search (MCTS). Uses graph neural networks to embed network observations as attributed graphs for permutation-invariant reasoning. Combines learned graph embeddings and priors over graph-edit actions with MCTS, integrating model-free generalization and policy distillation with look-ahead planning.

Result: The search-guided, graph-embedding-based planning approach improves defense reward and robustness relative to state-of-the-art RL baselines on CAGE Challenge 4 scenarios involving diverse network structures and adversary behaviors.

Conclusion: The proposed MCTS-based planning approach with graph neural network embeddings provides a more sample-efficient and effective solution for automated cyber defense in complex network environments compared to existing reinforcement learning methods.

Abstract: Automated cyber defense (ACD) seeks to protect computer networks with minimal or no human intervention, reacting to intrusions by taking corrective actions such as isolating hosts, resetting services, deploying decoys, or updating access controls. However, existing approaches for ACD, such as deep reinforcement learning (RL), often face difficult exploration in complex networks with large decision/state spaces and thus require an expensive amount of samples. Inspired by the need to learn sample-efficient defense policies, we frame ACD in CAGE Challenge 4 (CAGE-4 / CC4) as a context-based partially observable Markov decision problem and propose a planning-centric defense policy based on Monte Carlo Tree Search (MCTS). It explicitly models the exploration-exploitation tradeoff in ACD and uses statistical sampling to guide exploration and decision making. We make novel use of graph neural networks (GNNs) to embed observations from the network as attributed graphs, to enable permutation-invariant reasoning over hosts and their relationships. To make our solution practical in complex search spaces, we guide MCTS with learned graph embeddings and priors over graph-edit actions, combining model-free generalization and policy distillation with look-ahead planning. We evaluate the resulting agent on CC4 scenarios involving diverse network structures and adversary behaviors, and show that our search-guided, graph-embedding-based planning improves defense reward and robustness relative to state-of-the-art RL baselines.

[348] VeRPO: Verifiable Dense Reward Policy Optimization for Code Generation

Longwen Wang, Xuan’er Wu, Xiaohui Hu, Yirui Liu, Yuankai Fan, Kaidong Yu, Qizhen Weng, Wei Xi, Xuelong Li

Main category: cs.LG

TL;DR: VeRPO introduces a novel RL framework for code generation that creates dense rewards from weighted partial unit test success, eliminating the need for external reward models while maintaining verifiable execution feedback.

Details

Motivation: Current RL approaches for code generation face challenges with sparse pass/fail rewards and problematic reward models. Sparse rewards from unit test execution limit performance gains, while learned reward models suffer from misalignment issues and high computational costs.

Method: VeRPO constructs dense rewards from weighted partial success by dynamically estimating the difficulty weight of each unit test based on execution statistics during training. It combines these partial success signals with global execution outcomes to create robust, verifiable dense rewards without external models.

Result: VeRPO consistently outperforms outcome-driven and reward model baselines across diverse benchmarks, achieving up to +8.83% gain in pass@1 with negligible time cost (<0.02%) and zero GPU memory overhead.

Conclusion: VeRPO provides an effective solution for dense reward design in code generation RL that is fully grounded in verifiable execution feedback, eliminating the need for problematic external reward models while maintaining computational efficiency.

Abstract: Effective reward design is a central challenge in Reinforcement Learning (RL) for code generation. Mainstream pass/fail outcome rewards enforce functional correctness via executing unit tests, but the resulting sparsity limits potential performance gains. While recent work has explored external Reward Models (RM) to generate richer, continuous rewards, the learned RMs suffer from reward misalignment and prohibitive computational cost. In this paper, we introduce \textbf{VeRPO} (\textbf{V}erifiable D\textbf{e}nse \textbf{R}eward \textbf{P}olicy \textbf{O}ptimization), a novel RL framework for code generation that synthesizes \textit{robust and dense rewards fully grounded in verifiable execution feedback}. The core idea of VeRPO is constructing dense rewards from weighted partial success: by dynamically estimating the difficulty weight of each unit test based on the execution statistics during training, a dense reward is derived from the sum of weights of the passed unit tests. To solidify the consistency between partial success and end-to-end functional correctness, VeRPO further integrates the dense signal with global execution outcomes, establishing a robust and dense reward paradigm relying solely on verifiable execution feedback. Extensive experiments across diverse benchmarks and settings demonstrate that VeRPO consistently outperforms outcome-driven and RM-based baselines, achieving up to +8.83% gain in pass@1 with negligible time cost (< 0.02%) and zero GPU memory overhead.

[349] Distributed Online Convex Optimization with Efficient Communication: Improved Algorithm and Lower bounds

Sifan Yang, Wenhao Yang, Wei Jiang, Lijun Zhang

Main category: cs.LG

TL;DR: This paper improves regret bounds for distributed online convex optimization with compressed communication by proposing a novel algorithm with better dependence on compression quality and network size.

Details

Motivation: Prior work on distributed online convex optimization with compressed communication suffers from quadratic/quartic dependence on compression quality factor (ω⁻¹) and super-linear dependence on number of learners (n), which is undesirable for practical applications.

Method: Proposes a novel algorithm with a two-level blocking update framework incorporating two key components: an online gossip strategy and an error compensation scheme, which collaborate to achieve better consensus among learners. Also extends to bandit feedback scenario using classic gradient estimators.

Result: Achieves improved regret bounds: Õ(ω⁻¹/²ρ⁻¹n√T) for convex functions and Õ(ω⁻¹ρ⁻²n ln T) for strongly convex functions. Establishes first lower bounds for this problem, justifying optimality with respect to ω and T.

Conclusion: The proposed algorithm significantly improves regret bounds for distributed online convex optimization with compressed communication, with better dependence on compression quality and network size, and establishes fundamental lower bounds for the problem.

Abstract: We investigate distributed online convex optimization with compressed communication, where $n$ learners connected by a network collaboratively minimize a sequence of global loss functions using only local information and compressed data from neighbors. Prior work has established regret bounds of $O(\max{ω^{-2}ρ^{-4}n^{1/2},ω^{-4}ρ^{-8}}n\sqrt{T})$ and $O(\max{ω^{-2}ρ^{-4}n^{1/2},ω^{-4}ρ^{-8}}n\ln{T})$ for convex and strongly convex functions, respectively, where $ω\in(0,1]$ is the compression quality factor ($ω=1$ means no compression) and $ρ<1$ is the spectral gap of the communication matrix. However, these regret bounds suffer from a quadratic or even quartic dependence on $ω^{-1}$. Moreover, the super-linear dependence on $n$ is also undesirable. To overcome these limitations, we propose a novel algorithm that achieves improved regret bounds of $\tilde{O}(ω^{-1/2}ρ^{-1}n\sqrt{T})$ and $\tilde{O}(ω^{-1}ρ^{-2}n\ln{T})$ for convex and strongly convex functions, respectively. The primary idea is to design a two-level blocking update framework incorporating two novel ingredients: an online gossip strategy and an error compensation scheme, which collaborate to achieve a better consensus among learners. Furthermore, we establish the first lower bounds for this problem, justifying the optimality of our results with respect to both $ω$ and $T$. Additionally, we consider the bandit feedback scenario, and extend our method with the classic gradient estimators to enhance existing regret bounds.

cs.MA

[350] Simulation-Free PSRO: Removing Game Simulation from Policy Space Response Oracles

Yingzhuo Liu, Shuodi Liu, Weijun Luo, Liuyu Xiang, Zhaofeng He

Main category: cs.MA

TL;DR: Dynamic Window-based Simulation-Free PSRO reduces computational cost by limiting strategy window size and using Nash Clustering for strategy elimination, achieving lower exploitability than existing methods.

Details

Motivation: PSRO is effective for Nash equilibrium approximation but suffers from high computational costs, with game simulation being the primary bottleneck. There's a need for simulation-free approaches to make PSRO more practical.

Method: Proposes Dynamic Window-based Simulation-Free PSRO that introduces a strategy window concept to replace the original strategy set. Limits the number of strategies in the window, simplifies opponent selection, and improves best response robustness. Uses Nash Clustering to select strategies for elimination to effectively limit window size.

Result: Experiments across various environments show the Dynamic Window mechanism significantly reduces exploitability compared to existing methods while maintaining excellent compatibility.

Conclusion: The proposed simulation-free approach with dynamic window mechanism effectively addresses PSRO’s computational bottleneck, making it more practical while maintaining or improving performance over existing methods.

Abstract: Policy Space Response Oracles (PSRO) combines game-theoretic equilibrium computation with learning and is effective in approximating Nash Equilibrium in zero-sum games. However, the computational cost of PSRO has become a significant limitation to its practical application. Our analysis shows that game simulation is the primary bottleneck in PSRO’s runtime. To address this issue, we conclude the concept of Simulation-Free PSRO and summarize existing methods that instantiate this concept. Additionally, we propose a novel Dynamic Window-based Simulation-Free PSRO, which introduces the concept of a strategy window to replace the original strategy set maintained in PSRO. The number of strategies in the strategy window is limited, thereby simplifying opponent strategy selection and improving the robustness of the best response. Moreover, we use Nash Clustering to select the strategy to be eliminated, ensuring that the number of strategies within the strategy window is effectively limited. Our experiments across various environments demonstrate that the Dynamic Window mechanism significantly reduces exploitability compared to existing methods, while also exhibiting excellent compatibility. Our code is available at https://github.com/enochliu98/SF-PSRO.

[351] On the Transition to an Auction-based Intelligent Parking Assignment System

Levente Alekszejenkó, Dobrowiecki Tadeusz

Main category: cs.MA

TL;DR: Auction-based parking assignment improves traffic flow and parking proximity but increases costs, motivating adoption as non-participants face even higher prices.

Details

Motivation: To evaluate auction-based parking assignment systems by simulating voluntary adoption before mandatory implementation, examining how different market penetration rates affect traffic flow, system performance, and financial outcomes.

Method: Eclipse SUMO simulations with varying rates of participants vs. non-participants in a smartphone-based reservation system, analyzing traffic flow, auction system performance, and financial impacts.

Result: Auction-based system improves traffic flow with higher penetration rates, allows participants to park closer to preferred lots, but increases parking costs for participants. Non-participants face even higher prices, motivating them to adopt the system.

Conclusion: Auction-based parking assignment shows promise for reducing traffic congestion and improving parking efficiency, with market-driven pricing creating natural incentives for adoption despite increased costs.

Abstract: Finding a free parking space in a city has become a challenging task over the past decades. A recently proposed auction-based parking assignment can alleviate cruising for parking and also set a market-driven, demand-responsive parking price. However, the wide acceptance of such a system is far from certain. To evaluate the merits of auction-based parking assignment, we assume that drivers have access to a smartphone-based reservation system prior to its mandatory introduction and thus have the opportunity to test and experience its merits voluntarily. We set our experiment as Eclipse SUMO simulations with different rates of participants and non-participants to check how different market penetration levels affect the traffic flow, the performance of the auction-based assignment system, and the financial outcomes. The results show that the auction-based system improves traffic flow with increasing penetration rates, allowing participants to park gradually closer to their preferred parking lots. However, it comes with a price; the system also increases parking expenditures for participants. Interestingly, non-participating drivers will face even higher parking prices. Consequently, they will be motivated to use the new system.

[352] EvidFuse: Writing-Time Evidence Learning for Consistent Text-Chart Data Reporting

Huanxiang Lin, Qianyue Wang, Jinwu Hu, Bailin Chen, Qing Du, Mingkui Tan

Main category: cs.MA

TL;DR: EvidFuse is a training-free multi-agent framework that enables simultaneous text-chart generation for data-driven reports, solving chart-text inconsistency and insight freezing problems in current LLM-based systems.

Details

Motivation: Current LLM-based systems generate narratives and visualizations in staged pipelines (text-first-graph-second or graph-first-text-second), leading to chart-text inconsistency and insight freezing where intermediate evidence space becomes fixed, resulting in shallow and predefined analysis.

Method: EvidFuse uses two collaborating components: 1) Data-Augmented Analysis Agent with EDA-derived knowledge and raw table access, and 2) Real-Time Evidence Construction Writer that plans outlines and drafts reports while intermittently issuing fine-grained analysis requests, allowing visual evidence to be constructed exactly when needed.

Result: Experiments show EvidFuse attains top rank in both LLM-as-a-judge and human evaluations on chart quality, chart-text alignment, and report-level usefulness.

Conclusion: EvidFuse enables writing-time text-chart interleaved generation, allowing visual evidence to be constructed and incorporated exactly when narrative requires it, directly constraining subsequent claims and enabling on-demand expansion of evidence space.

Abstract: Data-driven reports communicate decision-relevant insights by tightly interleaving narrative text with charts grounded in underlying tables. However, current LLM-based systems typically generate narratives and visualizations in staged pipelines, following either a text-first-graph-second or a graph-first-text-second paradigm. These designs often lead to chart-text inconsistency and insight freezing, where the intermediate evidence space becomes fixed and the model can no longer retrieve or construct new visual evidence as the narrative evolves, resulting in shallow and predefined analysis. To address the limitations, we propose \textbf{EvidFuse}, a training-free multi-agent framework that enables writing-time text-chart interleaved generation for data-driven reports. EvidFuse decouples visualization analysis from long-form drafting via two collaborating components: a \textbf{Data-Augmented Analysis Agent}, equipped with Exploratory Data Analysis (EDA)-derived knowledge and access to raw tables, and a \textbf{Real-Time Evidence Construction Writer} that plans an outline and drafts the report while intermittently issuing fine-grained analysis requests. This design allows visual evidence to be constructed and incorporated exactly when the narrative requires it, directly constraining subsequent claims and enabling on-demand expansion of the evidence space. Experiments demonstrate that EvidFuse attains the top rank in both LLM-as-a-judge and human evaluations on chart quality, chart-text alignment, and report-level usefulness.

[353] How Exploration Breaks Cooperation in Shared-Policy Multi-Agent Reinforcement Learning

Yi-Ning Weng, Hsuan-Wei Lee

Main category: cs.MA

TL;DR: Shared-policy DQN in multi-agent reinforcement learning systematically collapses cooperation in social dilemmas due to exploration-driven representational bias, not reward misalignment or insufficient training.

Details

Motivation: Multi-agent reinforcement learning often uses parameter sharing for scalability, but the paper investigates why this approach can systematically undermine cooperation in social dilemmas where cooperative equilibria should be stable and payoff-dominant.

Method: The authors conduct controlled experiments with shared Deep Q-Network (DQN) learning in dynamic social dilemmas. They systematically test across different network sizes, exploration schedules, and payoff structures to isolate the failure mechanism.

Result: Shared DQN converges to stable but persistently low-cooperation regimes. The collapse is caused by a representational failure where exploration-driven updates bias the shared representation toward defection responses, which then propagate across agents and suppress cooperative learning. The failure disappears when parameter sharing is removed or when agents maintain independent representations.

Conclusion: The paper identifies a fundamental failure mode of shared-policy MARL where scalable learning architectures can systematically undermine cooperation. The findings provide concrete guidance for designing multi-agent learning systems in social and economic environments where collective behavior is critical.

Abstract: Multi-agent reinforcement learning in dynamic social dilemmas commonly relies on parameter sharing to enable scalability. We show that in shared-policy Deep Q-Network learning, standard exploration can induce a robust and systematic collapse of cooperation even in environments where fully cooperative equilibria are stable and payoff dominant. Through controlled experiments, we demonstrate that shared DQN converges to stable but persistently low-cooperation regimes. This collapse is not caused by reward misalignment, noise, or insufficient training, but by a representational failure arising from partial observability combined with parameter coupling across heterogeneous agent states. Exploration-driven updates bias the shared representation toward locally dominant defection responses, which then propagate across agents and suppress cooperative learning. We confirm that the failure persists across network sizes, exploration schedules, and payoff structures, and disappears when parameter sharing is removed or when agents maintain independent representations. These results identify a fundamental failure mode of shared-policy MARL and establish structural conditions under which scalable learning architectures can systematically undermine cooperation. Our findings provide concrete guidance for the design of multi-agent learning systems in social and economic environments where collective behavior is critical.

Chen Han, Jin Tan, Bohan Yu, Wenzhen Zheng, Xijin Tang

Main category: cs.MA

TL;DR: LLM-based multi-agent systems show that network topology critically shapes conformity dynamics in collective decision-making, with centralized structures enabling fast decisions but being vulnerable to hub competence, while distributed structures promote robust consensus but risk wrong-but-sure cascades.

Details

Motivation: To systematically study how network topology shapes conformity dynamics in LLM-based multi-agent systems, particularly in misinformation detection tasks, since conformity (agents aligning with group opinions) is a fundamental but underexplored mechanism in collective decision-making.

Method: Introduced a confidence-normalized pooling rule to control the trade-off between self-reliance and social influence, enabling comparisons between two canonical decision paradigms: Centralized Aggregation and Distributed Consensus. Conducted experiments on misinformation detection tasks with different network topologies.

Result: Network topology critically governs both efficiency and robustness of collective judgments. Centralized structures enable immediate decisions but are sensitive to hub competence and exhibit same-model alignment biases. Distributed structures promote more robust consensus, while increased network connectivity speeds up convergence but also heightens the risk of wrong-but-sure cascades (agents converging on incorrect decisions with high confidence).

Conclusion: Network topology and self-social weighting jointly shape the efficiency, robustness, and failure modes of collective decision-making in LLM-based multi-agent systems, providing insights into conformity dynamics and design considerations for such systems.

Abstract: Large Language Models (LLMs) are increasingly instantiated as interacting agents in multi-agent systems (MAS), where collective decisions emerge through social interaction rather than independent reasoning. A fundamental yet underexplored mechanism in this process is conformity, the tendency of agents to align their judgments with prevailing group opinions. This paper presents a systematic study of how network topology shapes conformity dynamics in LLM-based MAS through a misinformation detection task. We introduce a confidence-normalized pooling rule that controls the trade-off between self-reliance and social influence, enabling comparisons between two canonical decision paradigms: Centralized Aggregation and Distributed Consensus. Experimental results demonstrate that network topology critically governs both the efficiency and robustness of collective judgments. Centralized structures enable immediate decisions but are sensitive to hub competence and exhibit same-model alignment biases. In contrast, distributed structures promote more robust consensus, while increased network connectivity speeds up convergence but also heightens the risk of wrong-but-sure cascades, in which agents converge on incorrect decisions with high confidence. These findings characterize the conformity dynamics in LLM-based MAS, clarifying how network topology and self-social weighting jointly shape the efficiency, robustness, and failure modes of collective decision-making.

[355] Simulating Multi-Stakeholder Decision-Making with Generative Agents in Urban Planning

Jin Gao, Hanyong Xu, Luc Dao

Main category: cs.MA

TL;DR: LLM-based multi-agent systems can simulate urban planning stakeholder discussions, but face ethical risks; integrating demographic data improves decision diversity and stability.

Details

Motivation: Urban planning consensus is hindered by complex negotiations, power dynamics, and competing interests. LLM-based multi-agent systems offer promise for simulating stakeholder discussions but introduce ethical risks like misrepresentation and biases.

Method: Develop multi-generative agent systems with varying levels of real-world survey data and demographic detail. Test agents under altruism-driven and interest-driven value frameworks using a real-world urban rezoning challenge. Evaluate demographic factors (race, gender, age) on collective decision-making.

Result: Integrating demographic and life-value data enhances diversity and stability of agent outputs. Communication among agents improves collective reasoning quality. Provides predictive framework for anticipating stakeholder reactions.

Conclusion: Simulated multi-agent approach enables iterative refinement of urban planning proposals before public release, fostering more equitable and cost-effective decisions while addressing ethical concerns.

Abstract: Reaching consensus in urban planning is a complex process often hindered by prolonged negotiations, trade-offs, power dynamics, and competing stakeholder interests, resulting in inefficiencies and inequities. Advances in large language models (LLMs), with their increasing capabilities in knowledge transfer, reasoning, and planning, have enabled the development of multi-generative agent systems, offering a promising approach to simulating discussions and interactions among diverse stakeholders on contentious topics. However, applying such systems also carries significant societal and ethical risks, including misrepresentation, privacy concerns, and biases stemming from opinion convergence among agents, hallucinations caused by insufficient or biased prompts, and the inherent limitations of foundation models. To evaluate the influence of these factors, we incorporate varying levels of real-world survey data and demographic detail to test agents’ performance under two decision-making value frameworks: altruism-driven and interest-driven, using a real-world urban rezoning challenge. This approach evaluates the influence of demographic factors such as race, gender, and age on collective decision-making in the design of multi-generative agent systems. Our experimental results reveal that integrating demographic and life-value data enhances the diversity and stability of agent outputs. In addition, communication among generated agents improves the quality of collective reasoning. These findings provide a predictive framework for decision-makers to anticipate stakeholder reactions, including concerns, objections, and support. By enabling iterative refinement of proposals before public release, the simulated approach fosters more equitable and cost-effective decisions in urban planning.

cs.MM

[356] Meaning over Motion: A Semantic-First Approach to 360° Viewport Prediction

Arman Nik Khah, Arvin Bahreini, Ravi Prakash

Main category: cs.MM

TL;DR: Novel 360° video streaming framework uses semantic intent prediction to avoid “Saccade Trap” - reduces stalls by ≥20% and bandwidth by ≥18%.

Details

Motivation: Current viewport prediction fails during rapid, meaning-driven attention shifts (Saccade Trap), causing rebuffering when engagement is highest. Existing methods treat users as passive physical objects rather than cognitive agents.

Method: Semantically-Adaptive Conformal Tiling with Associative Lookahead: Server-side semantic reasoning generates lightweight association graphs, guiding client-side controller. Creates personalized Multi-Modal Prediction Sets that tighten safety margins during stable fixation while pre-fetching semantically linked non-adjacent tiles.

Result: Trace-driven evaluation on 360-AV-HM dataset shows successful mitigation of Saccade Trap: reduces stall duration by ≥20% and lowers effective bandwidth consumption by ≥18% compared to state-of-the-art trajectory-based baselines.

Conclusion: Integrating cognitive intent into network control through semantic reasoning and associative lookahead effectively converts fixation periods into preparation phases, overcoming limitations of purely physical prediction models.

Abstract: Ultra-high-resolution 360-degree video streaming is severely constrained by the massive bandwidth required to deliver immersive experiences. Current viewport prediction techniques predominately rely on kinematics or low-level visual saliency, treating users as passive physical objects governed by inertia. This theoretical limitation leads to the “Saccade Trap” – a critical failure mode where predictors fail to anticipate rapid, meaning-driven shifts in attention, causing rebuffering stalls exactly when user engagement is highest. To resolve this, we propose Semantically-Adaptive Conformal Tiling with Associative Lookahead, a novel framework that integrates cognitive intent into network control. Unlike “one-size-fits-all” approaches, our method utilizes an architectural inversion strategy: heavy semantic reasoning is offloaded to the server to generate lightweight association graphs, which guide a low-latency client-side controller. We construct a personalized Multi-Modal Prediction Set that dynamically tightens safety margins during stable fixation to maximize efficiency, while simultaneously pre-fetching non-adjacent tiles containing semantically linked objects (Associative Lookahead). This mechanism effectively converts the “calm” of fixation into a preparation phase for the next interaction. Trace-driven evaluation on the 360-AV-HM dataset demonstrates that this approach successfully mitigates the Saccade Trap, reducing stall duration by $\ge$ 20% and lowering effective bandwidth consumption by $\ge$ 18% compared to state-of-the-art trajectory-based baselines.

[357] TF-Mamba: Text-enhanced Fusion Mamba with Missing Modalities for Robust Multimodal Sentiment Analysis

Xiang Li, Xianfu Cheng, Dezhuang Miao, Xiaoming Zhang, Zhoujun Li

Main category: cs.MM

TL;DR: TF-Mamba: A novel efficient Text-enhanced Fusion Mamba framework for robust multimodal sentiment analysis with missing modalities, using text-aware enhancement and Mamba-based fusion for better performance and efficiency.

Details

Motivation: Current Transformer-based methods for MSA with missing modalities have quadratic complexity that hinders efficient long-range modeling and multimodal fusion, creating a need for more efficient approaches.

Method: Proposes TF-Mamba with three key components: 1) Text-aware Modality Enhancement (TME) module to align/enrich non-text modalities and reconstruct missing text semantics, 2) Text-based Context Mamba (TC-Mamba) to capture intra-modal contextual dependencies under text collaboration, and 3) Text-guided Query Mamba (TQ-Mamba) to query text-guided multimodal information and learn joint representations.

Result: Extensive experiments on three MSA datasets demonstrate the effectiveness and efficiency of the proposed method under missing modality scenarios.

Conclusion: TF-Mamba provides an efficient and effective solution for robust multimodal sentiment analysis with missing modalities, overcoming the limitations of Transformer-based methods while maintaining strong performance.

Abstract: Multimodal Sentiment Analysis (MSA) with missing modalities has attracted increasing attention recently. While current Transformer-based methods leverage dense text information to maintain model robustness, their quadratic complexity hinders efficient long-range modeling and multimodal fusion. To this end, we propose a novel and efficient Text-enhanced Fusion Mamba (TF-Mamba) framework for robust MSA with missing modalities. Specifically, a Text-aware Modality Enhancement (TME) module aligns and enriches non-text modalities, while reconstructing the missing text semantics. Moreover, we develop Text-based Context Mamba (TC-Mamba) to capture intra-modal contextual dependencies under text collaboration. Finally, Text-guided Query Mamba (TQ-Mamba) queries text-guided multimodal information and learns joint representations for sentiment prediction. Extensive experiments on three MSA datasets demonstrate the effectiveness and efficiency of the proposed method under missing modality scenarios. Our code is available at https://github.com/codemous/TF-Mamba.

[358] Transforming Video Subjective Testing with Training, Engagement, and Real-Time Feedback

Kumar Rahul, Sriram Sethuraman, Andrew Segall, Yixu Chen

Main category: cs.MM

TL;DR: Proposed framework improves subjective video quality assessment through automated rater training, real-time attention scoring, and efficient pairwise comparisons to get better quality scores with fewer comparisons.

Details

Motivation: Traditional subjective video quality assessment protocols have limitations in capturing nuanced perceptual differences and ensuring reliable user input. Need better methods to improve rater training, maintain attention, and reduce the number of comparisons needed.

Method: Three-phase approach: 1) Automated training quiz to teach quality indicators and verify readiness; 2) Real-time attention scoring using “golden” video pairs with penalties for lapses; 3) Efficient chain-based pairwise comparison procedure yielding quality scores in JOD units.

Result: Experiments with 80 participants across three groups showed training-quiz significantly improves data quality (golden unit accuracy, reduced tie rate). Real-time feedback further improves data quality and yields most monotonic quality ratings. Reduces non-monotonic cases on high-quality R-Q curve.

Conclusion: The integrated framework with training, quiz, and testing with feedback improves subjective video quality assessment reliability, reduces required comparisons, and helps train better objective video quality metrics by addressing viewer preferences for slightly compressed less-grainy content.

Abstract: Subjective video quality assessment is crucial for optimizing streaming and compression, yet traditional protocols face limitations in capturing nuanced perceptual differences and ensuring reliable user input. We propose an integrated framework that enhances rater training, enforces attention through real-time scoring, and streamlines pairwise comparisons to recover quality scores with fewer comparisons. Participants first undergo an automated training quiz to learn key video quality indicators (e.g., compression artifacts) and verify their readiness. During the test, a real-time attention scoring mechanism, using “golden” video pairs, monitors and reinforces rater focus by applying penalties for lapses. An efficient chain-based pairwise comparison procedure is then employed, yielding quality scores in Just-Objectionable-Differences (JOD) units. Experiments comparing three groups (no training, training without feedback, and training with feedback) with 80 participants demonstrate that training-quiz significantly improves data quality in terms of golden unit accuracy and reduces tie rate, while real-time feedback further improves data quality and yields the most monotonic quality ratings. The new training, quiz, testing with feedback, 3-phase approach can significantly reduce the non-monotonic cases on the high quality part of the R-Q curve where normal viewer typically prefer the slightly compressed less-grainy content and help train a better objective video quality metric.

eess.AS

[359] Discriminative-Generative Target Speaker Extraction with Decoder-Only Language Models

Bang Zeng, Beilong Tang, Wang Xiang, Ming Li

Main category: eess.AS

TL;DR: LauraTSE is a two-stage discriminative-generative framework for target speaker extraction that combines discriminative front-end robustness with generative back-end quality enhancement for better speech naturalness.

Details

Motivation: Existing discriminative TSE approaches effectively suppress interfering speakers but struggle with producing speech of high perceptual quality and naturalness. Purely generative approaches suffer from hallucinations, content drift, and limited controllability in complex acoustic scenarios.

Method: Two-stage framework: 1) Discriminative front-end robustly extracts target speaker’s speech for stable intermediate representations; 2) Generative back-end operates in neural audio codec representation space to reconstruct fine-grained details and enhance quality. Investigates collaborative training strategies including front-end freezing/fine-tuning, auxiliary SI-SDR loss, and auto-regressive/non-auto-regressive inference.

Result: The proposed framework achieves a more favorable trade-off among speech quality, intelligibility, and speaker consistency compared to existing approaches.

Conclusion: The discriminative-generative TSE framework effectively combines the robustness and controllability of discriminative models with the superior naturalness and quality enhancement capabilities of generative models, addressing limitations of both paradigms.

Abstract: Target speaker extraction (TSE) aims to recover the speech signal of a desired speaker from a mixed audio recording, given a short enrollment utterance. Most existing TSE approaches are based on discriminative modeling paradigms. Although effective at suppressing interfering speakers, these methods often struggle to produce speech with high perceptual quality and naturalness. To address this limitation, we first propose LauraTSE, a generative TSE model built upon an auto-regressive decoder-only language model. However, purely generative approaches may suffer from hallucinations, content drift, and limited controllability, which may undermine their reliability in complex acoustic scenarios. To overcome these challenges, we further introduce a discriminative-generative TSE framework. In this framework, a discriminative front-end is employed to robustly extract the target speaker’s speech, yielding stable and controllable intermediate representations. A generative back-end then operates in the neural audio codec representation space to reconstruct fine-grained speech details and enhance perceptual quality. This two-stage design effectively combines the robustness and controllability of discriminative models with the superior naturalness and quality enhancement capabilities of generative models. Moreover, we systematically investigate collaborative training strategies for the proposed framework, including freezing or fine-tuning the front-end, incorporating an auxiliary SI-SDR loss, and exploring both auto-regressive and non-auto-regressive inference mechanisms. Experimental results demonstrate that the proposed framework achieves a more favorable trade-off among speech quality, intelligibility, and speaker consistency.

eess.IV

[360] DYRECT Computed Tomography: DYnamic Reconstruction of Events on a Continuous Timescale

Wannes Goethals, Tom Bultreys, Steffen Berg, Matthieu N. Boone, Jan Aelterman

Main category: eess.IV

TL;DR: DYRECT is a novel 4D μCT reconstruction technique that uses event-based representation with three volumes (initial attenuation, final attenuation, transition times) instead of traditional frame-based approaches, achieving higher temporal resolution with less data.

Details

Motivation: Conventional 4D μCT uses frame-based reconstruction which limits temporal resolution, inflates data volume, requires costly post-processing, and ignores temporal correlations in sample structure evolution.

Method: DYRECT estimates individual attenuation evolution profiles for each sample position, creating an event-based representation with three volumes: initial attenuation, final attenuation, and transition times (continuous timescale). Uses iterative reconstruction of transition times and attenuation volumes.

Result: Validated on synthetic ground truth and experimental data. Effectively pinpoints transition times with time resolution corresponding to less than a tenth of projections needed for traditional μCT time frames.

Conclusion: DYRECT provides memory-efficient event-based 4D μCT reconstruction with superior temporal resolution compared to conventional frame-based approaches, enabling more efficient analysis of dynamic processes.

Abstract: Time-resolved high-resolution X-ray Computed Tomography (4D $μ$CT) is an imaging technique that offers insight into the evolution of dynamic processes inside materials that are opaque to visible light. Conventional tomographic reconstruction techniques are based on recording a sequence of 3D images that represent the sample state at different moments in time. This frame-based approach limits the temporal resolution compared to dynamic radiography experiments due to the time needed to make CT scans. Moreover, it leads to an inflation of the amount of data and thus to costly post-processing computations to quantify the dynamic behaviour from the sequence of time frames, hereby often ignoring the temporal correlations of the sample structure. Our proposed 4D $μ$CT reconstruction technique, named DYRECT, estimates individual attenuation evolution profiles for each position in the sample. This leads to a novel memory-efficient event-based representation of the sample, using as little as three image volumes: its initial attenuation, its final attenuation and the transition times. This third volume represents local events on a continuous timescale instead of the discrete global time frames. We propose a method to iteratively reconstruct the transition times and the attenuation volumes. The dynamic reconstruction technique was validated on synthetic ground truth data and experimental data, and was found to effectively pinpoint the transition times in the synthetic dataset with a time resolution corresponding to less than a tenth of the amount of projections required to reconstruct traditional $μ$CT time frames.

[361] InnerGS: Internal Scenes Reconstruction and Segmentation via Factorized 3D Gaussian Splatting

Shuxin Liang, Yihan Xiao, Wenlu Tang

Main category: eess.IV

TL;DR: InnerGS: A 3D Gaussian Splatting approach for reconstructing internal scenes from sparse sliced data, enabling text-guided segmentation via language features without camera poses.

Details

Motivation: Existing 3D Gaussian Splatting work focuses mainly on external surfaces, but internal scene reconstruction is crucial for applications requiring deep understanding of object interiors, especially in medical contexts.

Method: Models continuous volumetric density through inner 3D Gaussian distribution, reconstructs internal structures from sparse sliced data, integrates language features for text-guided segmentation, and works without camera poses.

Result: Effectively reconstructs smooth and detailed internal structures, demonstrates potential for downstream tasks like segmentation, and enables text-guided segmentation of medical scenes via natural language queries.

Conclusion: The approach is plug-and-play, compatible with any data modalities, eliminates need for camera poses, and provides a framework for internal scene reconstruction with applications in medical imaging and beyond.

Abstract: 3D Gaussian Splatting (3DGS) has recently gained popularity for efficient scene rendering by representing scenes as explicit sets of anisotropic 3D Gaussians. However, most existing work focuses primarily on modeling external surfaces. In this work, we target the reconstruction of internal scenes, which is crucial for applications that require a deep understanding of an object’s interior. By directly modeling a continuous volumetric density through the inner 3D Gaussian distribution, our model effectively reconstructs smooth and detailed internal structures from sparse sliced data. Beyond high-fidelity reconstruction, we further demonstrate the framework’s potential for downstream tasks such as segmentation. By integrating language features, we extend our approach to enable text-guided segmentation of medical scenes via natural language queries. Our approach eliminates the need for camera poses, is plug-and-play, and is inherently compatible with any data modalities. We provide cuda implementation at: https://github.com/Shuxin-Liang/InnerGS.

Today’s Research Highlights

Table of Contents

cs.CL

[1] Closing the Modality Reasoning Gap for Speech Large Language Models

[2] Enhancing Foundation Models in Transaction Understanding with LLM-based Sentence Embeddings

[3] The Table of Media Bias Elements: A sentence-level taxonomy of media bias types and propaganda techniques

[4] Lost in Execution: On the Multilingual Robustness of Tool Calling in Large Language Models

[5] Same Claim, Different Judgment: Benchmarking Scenario-Induced Bias in Multilingual Financial Misinformation Detection

[6] Glitter: Visualizing Lexical Surprisal for Readability in Administrative Texts

[7] Large Language Models Are Bad Dice Players: LLMs Struggle to Generate Random Numbers from Statistical Distributions

[8] Tracing Moral Foundations in Large Language Models

[9] Do LLMs Need Inherent Reasoning Before Reinforcement Learning? A Study in Korean Self-Correction

[10] Towards Valid Student Simulation with Large Language Models

[11] The Facade of Truth: Uncovering and Mitigating LLM Susceptibility to Deceptive Evidence

[12] Illusions of Confidence? Diagnosing LLM Truthfulness via Neighborhood Consistency

[13] MemBuilder: Reinforcing LLMs for Long-Term Memory Construction via Attributed Dense Rewards

[14] Can We Predict Before Executing Machine Learning Agents?

[15] FlashMem: Distilling Intrinsic Latent Memory via Computation Reuse

[16] CHisAgent: A Multi-Agent Framework for Event Taxonomy Construction in Ancient Chinese Cultural Systems

[17] Double: Breaking the Acceleration Limit via Double Retrieval Speculative Parallelism

[18] Can Large Language Models Differentiate Harmful from Argumentative Essays? Steps Toward Ethical Essay Scoring

[19] Generation-Based and Emotion-Reflected Memory Update: Creating the KEEM Dataset for Better Long-Term Conversation

[20] ReasonAny: Incorporating Reasoning Capability to Any Model via Simple and Effective Model Merging

[21] Can large language models interpret unstructured chat data on dynamic group decision-making processes? Evidence on joint destination choice

[22] ACR: Adaptive Context Refactoring via Context Refactoring Operators for Multi-Turn Dialogue

[23] Data Augmented Pipeline for Legal Information Extraction and Reasoning

[24] Text Detoxification in isiXhosa and Yorùbá: A Cross-Lingual Machine Learning Approach for Low-Resource African Languages

[25] GIFT: Games as Informal Training for Generalizable LLMs

[26] Multilingual Amnesia: On the Transferability of Unlearning in Multilingual LLMs

[27] A Framework for Personalized Persuasiveness Prediction via Context-Aware User Profiling

[28] Stephanie2: Thinking, Waiting, and Making Decisions Like Humans in Step-by-Step AI Social Chat

[29] Afri-MCQA: Multimodal Cultural Question Answering for African Languages

[30] Multimodal In-context Learning for ASR of Low-resource Languages

[31] Visualising Information Flow in Word Embeddings with Diffusion Tensor Imaging

[32] Analysing Differences in Persuasive Language in LLM-Generated Text: Uncovering Stereotypical Gender Patterns

[33] AutoMonitor-Bench: Evaluating the Reliability of LLM-Based Misbehavior Monitor

[34] One Script Instead of Hundreds? On Pretraining Romanized Encoder Language Models

[35] Simplify-This: A Comparative Analysis of Prompt-Based and Fine-Tuned LLMs

[36] EnvScaler: Scaling Tool-Interactive Environments for LLM Agent via Programmatic Synthesis

[37] LLMs as Science Journalists: Supporting Early-stage Researchers in Communicating Their Science to the Public

[38] Peek2: A Regex-free implementation of pretokenizers for Byte-level BPE

[39] Left, Right, or Center? Evaluating LLM Framing in News Classification and Generation

[40] Semantic NLP Pipelines for Interoperable Patient Digital Twins from Unstructured EHRs

[41] Router-Suggest: Dynamic Routing for Multimodal Auto-Completion in Visually-Grounded Dialogs

[42] CLewR: Curriculum Learning with Restarts for Machine Translation Preference Learning

[43] What do the metrics mean? A critical analysis of the use of Automated Evaluation Metrics in Interpreting

[44] FACTUM: Mechanistic Detection of Citation Hallucination in Long-Form RAG

[45] Continual-learning for Modelling Low-Resource Languages from Large Language Models

[46] iReasoner: Trajectory-Aware Intrinsic Reasoning Supervision for Self-Evolving Large Multimodal Models

[47] Gender Bias in LLMs: Preliminary Evidence from Shared Parenting Scenario in Czech Family Law

[48] An Empirical Study on Preference Tuning Generalization and Diversity Under Domain Shift

[49] HAPS: Hierarchical LLM Routing with Joint Architecture and Parameter Search

[50] Pantagruel: Unified Self-Supervised Encoders for French Text and Speech

[51] Distilling Feedback into Memory-as-a-Tool

[52] The Molecular Structure of Thought: Mapping the Topology of Long Chain-of-Thought Reasoning

[53] Don’t Break the Cache: An Evaluation of Prompt Caching for Long-Horizon Agentic Tasks

[54] Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards

[55] AdaFuse: Adaptive Ensemble Decoding with Test-Time Scaling for LLMs

[56] An Evaluation on Large Language Model Outputs: Discourse and Memorization

[57] Expression Syntax Information Bottleneck for Math Word Problems

[58] Pragmatic Reasoning improves LLM Code Generation

[59] Through the LLM Looking Glass: A Socratic Probing of Donkeys, Elephants, and Markets

[60] Detect, Explain, Escalate: Sustainable Dialogue Breakdown Management for LLM Agents

[61] Streamlining evidence based clinical recommendations with large language models

[62] Graph-Guided Passage Retrieval for Author-Centric Structured Feedback

[63] VietMix: A Naturally-Occurring Parallel Corpus and Augmentation Framework for Vietnamese-English Code-Mixed Machine Translation

[64] Guiding Generative Storytelling with Knowledge Graphs

[65] Let’s Put Ourselves in Sally’s Shoes: Shoes-of-Others Prefilling Improves Theory of Mind in Large Language Models

[66] Bridging External and Parametric Knowledge: Mitigating Hallucination of LLMs with Shared-Private Semantic Synergy in Dual-Stream Knowledge

[67] MedRiskEval: Medical Risk Evaluation Benchmark of Language Models, On the Importance of User Perspectives in Healthcare Settings

[68] Mechanistic Indicators of Understanding in Large Language Models

[69] GRASP: Generic Reasoning And SPARQL Generation across Knowledge Graphs

[70] Learning to Extract Rational Evidence via Reinforcement Learning for Retrieval-Augmented Generation

[71] Reservoir Computing as a Language Model

[72] CliCARE: Grounding Large Language Models in Clinical Guidelines for Decision Support over Longitudinal Cancer Electronic Health Records

[73] Expert Preference-based Evaluation of Automated Related Work Generation

[74] MAGneT: Coordinated Multi-Agent Generation of Synthetic Multi-Turn Mental Health Counseling Sessions

[75] Memorization in Large Language Models in Medicine: Prevalence, Characteristics, and Implications

[76] FS-DFM: Fast and Accurate Long Text Generation with Few-Step Diffusion Language Models

[77] UPDESH: Synthesizing Grounded Instruction Tuning Data for 13 Indic